Proceedings of FTC Conference
Proceedings of FTC Conference
Kohei Arai Editor
Proceedings
of the Future
Technologies
Conference
(FTC) 2022,
Volume 1
Lecture Notes in Networks and Systems
Volume 559
Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland
Advisory Editors
Fernando Gomide, Department of Computer Engineering and Automation—DCA,
School of Electrical and Computer Engineering—FEEC, University of Campinas—
UNICAMP, São Paulo, Brazil
Okyay Kaynak, Department of Electrical and Electronic Engineering,
Bogazici University, Istanbul, Turkey
Derong Liu, Department of Electrical and Computer Engineering, University
of Illinois at Chicago, Chicago, USA
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Witold Pedrycz, Department of Electrical and Computer Engineering, University of
Alberta, Alberta, Canada
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marios M. Polycarpou, Department of Electrical and Computer Engineering,
KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,
Nicosia, Cyprus
Imre J. Rudas, Óbuda University, Budapest, Hungary
Jun Wang, Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest
developments in Networks and Systems—quickly, informally and with high quality.
Original research reported in proceedings and post-proceedings represents the core
of LNNS.
Volumes published in LNNS embrace all aspects and subfields of, as well as new
challenges in, Networks and Systems.
The series contains proceedings and edited volumes in systems and networks,
spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor
Networks, Control Systems, Energy Systems, Automotive Systems, Biological
Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,
Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,
Robotics, Social Systems, Economic Systems and other. Of particular value to both
the contributors and the readership are the short publication timeframe and
the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
The series covers the theory, applications, and perspectives on the state of the art
and future developments relevant to systems and networks, decision making, control,
complex processes and related areas, as embedded in the fields of interdisciplinary
and applied sciences, engineering, computer science, physics, economics, social, and
life sciences, as well as the paradigms and methodologies behind them.
Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
For proposals from Asia please contact Aninda Bose ([email protected]).
123
Editor
Kohei Arai
Faculty of Science and Engineering
Saga University
Saga, Japan
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
We are extremely delighted and excited to present before you the seventh Future
Technologies Conference 2022 (FTC 2022), which was successfully held during
20–21 October 2022. COVID-19 necessitated this conference to be held virtually
for two years. However, as the pandemic waded and restrictions eased, we managed
to recreate the scholarly aura by having the esteemed conference in hybrid mode,
wherein learned researches from across the globe adorned the stage by either their
in-person presence or via the online mode. Around 250 participants from over 60
countries participated to make this event a huge academic success.
The conference provided a wonderful academic exchange platform to share the
latest researches, developments, advances and new technologies in the fields of
computing, electronics, AI, robotics, security and communications. The conference
was successful in disseminating novel ideas, emerging trends as well as discussing
research results and achievements. We were overwhelmed to receive 511 papers out
of which a total of 177 papers were selected to be published in the final proceed-
ings. The papers were thoroughly reviewed and then finally selected for publishing.
Many people have collaborated and worked hard to produce a successful FTC
2022 conference. Thus, we would like to thank all the authors and distinguished
Keynote Speakers for their interest in this conference, the Technical Committee
members, who carried out the most difficult work by carefully evaluating the
submitted papers, with professional reviewing and prompt response and to Session
Chairs Committee for their efforts. Finally, we would also like to express our
gratitude to Organizing Committee who worked very hard to ensure high standards
and quality of keynotes, panels, presentations and discussions.
We hope that readers are able to satisfactorily whet their appetite for knowledge
in the field of AI and its useful applications across diverse fields. We also expect
more and more enthusiastic participation to this coveted event next year.
Kind Regards,
Kohei Arai
Conference Program Chair
v
Contents
vii
viii Contents
Abstract. The present paper aims to propose a new method to minimize and
maximize information and its cost, accompanied by the ordinary error minimiza-
tion. All these computational procedures are operated as independently as pos-
sible from each other. This method aims to solve the contradiction in conven-
tional computational methods in which many procedures are intertwined with
each other, making it hard to compromise among them. In particular, we try
to minimize information at the expense of cost, followed by information max-
imization, to reduce humanly biased information obtained through artificially
created input variables. The new method was applied to the detection of rela-
tions between mission statements and firms’ financial performance. Though the
relation between them has been considered one of the main factors for strategic
planning in management, the past studies could only confirm very small positive
relations between them. In addition, those results turned out to be very dependent
on the operationalization and variable selection. The studies suggest that there
may be some indirect and mediating variables or factors to internalize the mis-
sion statements in organizational members. If neural networks have an ability to
infer those mediating variables or factors, new insight into the relation can be
obtained. Keeping this in mind, the experiments were performed to infer some
positive relations. The new method, based on minimizing the humanly biased
effects from inputs, could produce linear, non-linear, and indirect relations, which
could not be extracted by the conventional methods. Thus, this study shows a pos-
sibility for neural networks to interpret complex phenomena in human and social
sciences, which, in principle, conventional models cannot deal with.
1 Introduction
1.1 Necessity of Min-Max Property
One of the main characteristics of multi-layered neural networks is their ability to create
the sparse distributed representation in which all components should be used on average,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 1–17, 2023.
https://doi.org/10.1007/978-3-031-18461-1_1
2 R. Kamimura and R. Kitajima
while a very few should respond to a specific input [1]. This type of representation has
been the base of many learning strategies from the beginning of research on neural net-
works. For example, the conventional competitive learning and self-organizing maps [2–
6] need the winner-take-all type of responses to specific inputs, but actually, all neurons
are forced to be equally used, with many additional constraints to reduce dead neurons
[7–11]. This seemingly contradictory statement of specific and non-specific property
can be explained by the economical and efficient use of components in living systems,
in which all available components should respond to inputs equally, but at the same time,
all the components should have their own meaning or specific information. This means
that information specific to inputs should be minimized and at the same time maximized
in some ways. For clarifying this contradictory min-max property more concretely, we
explain here this property in terms of min-max selectivity, information, and cost.
Fist, the importance of the min-max property can be found in the recent discussion
on the necessity of selectivity in neural networks. As has been well known, the selec-
tivity has played important roles in the neurosciences from the beginning [12, 13], with
many experimental results accumulated [14]. The selectivity has also played impor-
tant roles in neural networks in improving generalization [15]. In particular, the selec-
tivity has been so far discussed in the field of convolutional neural networks (CNN).
For example, the majority of interpretation methods have tried to show which compo-
nents in a neural network are the most responsible for a specific input pattern or spe-
cific output, [16, 17]. However, there have been recent discussions on the importance of
selectivity in neural networks, saying that the selectivity should be reduced as much as
possible, especially for improving generalization performance [18, 19]. This discussion
suggests that selectivity should be minimized and maximized, depending on different
situations in neural networks. The selectivity becomes of some use in improving general
performance only with some conditions in learning. We need to adjust the strength of
selectivity according to learning conditions and objectives, suggesting the importance
of controlling min-max property in selectivity.
The min-max property can be also found in the regularization approach [20–23].
Usually, the regularization has been realized by decreasing the strength of weights,
such as weight decay. Suppose that we consider the absolute strength of weights as a
representational cost; then, the majority of regularization methods have been realized
by reducing the cost for representation. Actually, this cost minimization has a natural
effect to reduce the average cost, while the cost, in terms of the weight strength, associ-
ated with a specific neuron may be increased. Thus, we can say that the regularization
approach has a property that the average cost should be decreased, but the cost with a
few specific neurons should be increased in the end. The regularization methods can be
used to realize the contradictory cost minimization and maximization at the same time.
In addition, the min-max property of the sparse distributed representation can be
closely related to min-max information control. This concept of information can be used
to improve interpretation and generalization. For example, when we try to interpret a
neural network, we need to focus on specific information on a small number of com-
ponents. On the contrary, when we try to improve generalization, we need to distribute
information on input patterns over as many components as possible, because we need
to deal with new and unseen input patterns. This means that information minimization,
responding to all inputs, and information maximization, responding to specific inputs,
Min-Max Cost and Information Control 3
should be used, depending on the purposes of study. In addition, the situation is much
better if we can use both types of information optimization procedures at the same time.
In the information-theoretic approach, the need to maximize and minimize infor-
mation has been well recognized. For example, the information-theoretic methods have
used the complicated measure of mutual information, whose use in neural networks
has been initiated by the pioneering works on the maximum information preservation
by Linsker [24–26]. In mutual information, the properties of information maximization
and minimization are implicitly supposed from our viewpoint. For example, in mutual
information maximization, unconditional entropy should be maximized, while condi-
tional entropy should be minimized to achieve its maximum value. Since entropy max-
imization means that all components are uniformly distributed on average, and condi-
tional entropy minimization means that a component tends to be closely connected with
a specific one, mutual information is also a special type of operation with the min-max
property in neural networks.
The method was applied to the analysis of the relation between mission statements and
firms’ financial performance [36]. The mission statement has been considered one of
the most widespread methods for strategic planning [37–39] in companies’ manage-
ment. There has been a number of attempts to show positive relations among them [40],
while there are also some reports against positive influences, in particular, against the
organizational members [41]. Roughly speaking, the past studies seem to show that the
relation between mission statements and firms’ financial performance may be a small
positive one [42]. The past studies on this relation suggest that the conclusions are
highly dependent on the operationalization decisions. In particular, the variable and tar-
get selection can have much influence on the final results on the relations. We think that
one of the main problems lies in the limitations of the methods for analysis, where the
conventional linear models such as the regression ones are usually used and only direct
and linear relations between inputs and targets are extensively examined. For dealing
with those complex problems, we need a method to deal with non-linear relations and,
in addition, to reduce the influence of operationalization and variable selection [43] due
to the fact that extracted information is biased toward very specific purposes. The new
method proposed here tries to reduce the information in input variables to the extreme
point and even at the expense of cost. Then, the present method can show a new insight
on this relation, which can be independent of inputs, or at least, as independent as pos-
sible of input variables. Since input variables are artificial and humanly biased, not nec-
essarily representing the core information in inputs, we need to minimize information
through inputs at the expense of cost, and then maximize the information for interpre-
tation. Thus, this application can be used to demonstrate the performance of the present
method for showing the relations, whose existence has not been definitely decided by
the conventional methods.
The present paper tries to unfold the complicated, intertwined, and folded combination
of information, cost minimization and error minimization (assimilation) into completely
separated and independently operated procedures. Figure 1(a) shows the intertwined
and folded combination of three procedures. They are processed in complicated and
interwoven ways, which makes it hard to compromise among three types of process-
ing. Figure 1(b) shows that the three procedures are separated and serially operated. In
Min-Max Cost and Information Control 5
Fig. 1. Conventional folded processing (a) and serially unfolded processing (b).
where the max operation is over all connection weights between the layer. In addition,
we define the complementary one by
(2,3)
(2,3) ujk
ḡjk =1− (2,3)
(3)
maxj k uj k
When all potentialities become equal, naturally, the selective information becomes
zero. On the other hand, when only one potentiality becomes one, while all the oth-
ers are zero, the selective information becomes maximum. For simplicity, we suppose
that at least one connection weight should be larger than zero. Finally, the cost can be
computed simply by the sum of all absolute weights.
n2
n3
(2,3)
C (2,3) = ujk (5)
j=1 k=1
Min-Max Cost and Information Control 7
Fig. 2. Information minimization with cost augmentation (a), cost minimization (b), and informa-
tion maximization (c) only for the first learning cycle.
information maximization is applied in Fig. 2(c1), and the number of stronger con-
nection weights is forced to be smaller, and the strength of the weights is also forced
to be smaller. Then, the effect by information maximization should be assimilated in
Fig. 2(c2).
Connection weights are changed by multiplying them by the corresponding poten-
tialities. Since in the actual implementation, there are several parameters to be adjusted,
we need to discuss this implementation more concretely. Due to the page limitation,
detailed implementation is eliminated for easy interpretation of information flow in this
study. In the first place, information is forced to be minimized, which is realized by
adding the complementary potentiality. For the (n + 1)th step, weights are computed
by
(2,3) (2,3) (2,3)
wjk (n + 1) = ḡjk (n) wjk (n) (6)
In this process, the cost augmentation should be applied, which can be realized by
a learning parameter larger than one. Then, cost minimization is applied by
(2,3) (2,3)
wjk (n + 1) = θ1 wjk (n) (7)
In the cost minimization, only the learning parameter θ1 should be less than one.
Finally, to increase the selective information, we have
(2,3) (2,3) (2,3)
wjk (n + 1) = gjk (n) wjk (n) (8)
For increasing information, the potentiality g should be used, where strong weights
are forced to be stronger, while small ones are forced to be smaller. This has an effect
of reducing the number of strong weights.
Then, we should deal with the problem of the multi-layered property of neural networks.
This means that, for the interpretation of neural networks, we need to reduce the number
of layers or to make neural networks de-layered as much as possible. Naturally, when
the number of hidden layers increases, neurons tend to be connected in more compli-
cated ways, which has made it impossible to understand the inner mechanism, and this
black-box problem has been one of the serious problems in neural networks. For mak-
ing the interpretation possible, the complicated multi-layered neural networks should
be de-layered or compressed, and the number of hidden layers should be decreased. As
has been well known, the method of model compression has been taken seriously these
days [45–52]. However, the typical compression methods have tried to replace multi-
layered networks with fewer-layered ones, different from the original ones. Thus, we
need to develop a method to directly make the number of hidden layers smaller to keep
the original information as untouched as possible.
For interpreting multi-layered neural networks, we compress them into the simplest
ones, as shown in Fig. 3. We try here to trace all routes from inputs to the corresponding
outputs by multiplying and summing all corresponding connection weights.
Min-Max Cost and Information Control 9
Fig. 3. De-layered compression to the simplest network (b), and further unfolded one (c).
First, we compress connection weights from the first to the second layer, denoted by
(1,2), and from the second to the third layer (2,3) for an initial condition and a subset of
a data set. Then, we have the compressed weights between the first and the third layer,
denoted by (1,3).
n2
(1,3) (1,2) (2,3)
wik = wij wjk (9)
j=1
Those compressed weights are further combined with weights from the third to the
fourth layer (3,4), and we have the compressed weights between the first and the fourth
layer (1,4).
n3
(1,4) (1,3) (3,4)
wik = wik wkl (10)
k=1
By repeating these processes, we have the compressed weights between the first and
(1,6)
sixth layer, denoted by wiq . Using those connection weights, we have the final and
fully compressed weights (1,7).
n6
(1,7) (1,6) (6,7)
wir = wiq wqr (11)
q=1
Considering all routes from the inputs to the outputs, the final connection weights
should represent the overall characteristics of connection weights of the original multi-
layered neural networks. Finally, connection weights are expected to be unfolded and
disentangled, meaning that each input can be separately treated as shown in Fig. 3(c).
300 companies, listed in the first section of the Tokyo Stock Exchange, which were sum-
marized by five input variables, extracted by the natural language processing systems.
For demonstrating the performance of the present method, we used the very redundant
ten-hidden-layered neural networks, where each hidden layer had ten neurons. The data
set seemed very easy, but the conventional methods such as the linear regression and
random forest could not improve generalization. In addition, simple information max-
imization and minimization also failed to improve generalization. Then, we repeated
the process of information minimization and maximization up to 20 cycles, because
improved performance could not be seen even if we repeated them further. In the exper-
iments, we tried to increase generalization performance, focusing on information min-
imization, where information was first minimized and then maximized. We used infor-
mation assimilation steps containing the fixed number of learning epochs (50 epochs)
for assimilating the information by the information minimizer and maximizer. In the
experiments, several learning parameters had to be controlled, and among them, the
most important one was a parameter θ1 , which was changed from 1 to 1.5, to control
the strength of weights or cost. When this parameter increased, the strength of weights
or potentialities (cost) increased gradually to decrease the information content.
The experimental results show that, when the number of learning cycles increased to 20,
information decreased gradually, and at the same time the fluctuation of selective infor-
mation became smaller, meaning that information could be assimilated in connection
weights smoothly.
Figure 4 shows selective information (left), cost (middle), and ratio of information
to its cost (right), when the number of cycles increased from two (a) to 20 (d), and by the
conventional method without information control (e). When the number of cycles was
two in Fig. 4(a), selective information (left), cost (middle), and ratio (right) increased
to a maximum point and decreased to a minimum point drastically. Then, when the
number of cycles increased from five in Fig. 4(b) to 20 in Fig. 4(d), the strength of
fluctuations of information, cost, and ratio decreased gradually. In particular, when the
number of cycles was 20, information and its cost decreased sufficiently through many
small fluctuations. On the other hand, by using the conventional method in Fig. 4(e),
information, cost and its ratio remained almost constant for the entire series of learning
steps.
3.3 Potentialities
When the learning cycles increased, the number of weights with stronger potentialities
became larger and more regularly distributed. Figure 5 shows the potentialities in terms
of relative absolute weights when the number of cycles increased from two (a) to 20
(d) and by the conventional method (e). One of the main characteristics is that, when
the number of cycles increased, the number of stronger weights with higher poten-
tialities increased gradually. When the number of cycles was two in Fig. 5(a), only
one weight tended to have higher potentialities. On the contrary, when the number
Min-Max Cost and Information Control 11
Fig. 4. Selective information (left), cost (middle), and ratio of information to its cost (right) as a
function of the number of steps when the number of learning cycles increased from two (a) to 20
(d) and when the conventional method (e) was used for the mission statement data set.
of cycles increased to 20 in Fig. 5(d), the number of weights with higher potentiali-
ties, shown in white, increased gradually, going through the second hidden layer to the
ninth hidden layer. In addition, potentialities responded very regularly to neurons in the
precedent and subsequent layers. From the information-theoretic viewpoint, informa-
tion decreases as the number of stronger components increases gradually. This means
that the potentialities, when the number of cycles was 20 in Fig. 5(d), tended to be dis-
tributed evenly when the hidden layers become higher. Thus, this shows that the present
method can realize a process of information minimization over hidden layers. Finally,
when the conventional method was used in Fig. 5(e), the number of strong weights
became smaller, and many weights seemed to be randomly distributed.
12 R. Kamimura and R. Kitajima
Fig. 5. Potentialities for all hidden layers by the selective information when the number of cycles
increased from two (a) to 20 (d) and by the conventional method without information control (e)
for the mission statement data set.
of cycles increased from five in Fig. 6(b) to 10 in Fig. 6(c), the compressed weights
(left) gradually became similar to the correlation coefficients (right), but still, relative
input No. 5 had the highest strength. When the number of cycles increased to 20 in
Fig. 6(d), the compressed weights were almost equal to the correlation coefficients, but
the strength of relative input No. 5 became smaller. Finally, by the conventional method
in Fig. 6(e), the compressed weights were close to the corresponding correlation coeffi-
cients, and the strength of relative input No. 5 became smaller. This tendency was more
clearly seen in the relative information, normalized by the original correlation in the
middle figures in Fig. 6. As can be seen in the figure, input No. 5 showed the highest
strength for all cases. The results show that input No. 5 should play some important
roles, though non-linearly.
The present method produced compressed weights close to the correlation coefficients,
and at the same time, it could improve generalization performance by changing the
compressed weights. Thus, the method could extract linear and non-linear relations
between inputs and targets explicitly.
Table 1 shows the summary of correlation coefficients and generalization perfor-
mance. The conventional method produced the second highest correlation coefficient of
0.974, but the accuracy was relatively low (0.597). When the number of cycles was 20,
the correlation coefficient of 0.966 was close to the 0.974 by the conventional method.
However, the method produced the best and highest accuracy of 0.645. The logistic
regression produced the highest correlation coefficient of 0.980, but the accuracy was
the lowest (0.527). Finally, the random forest produced the lowest correlation coeffi-
cient of 0.238, and the accuracy was the second worst at 0.529. The results show that
the conventional and very redundant multi-layered neural networks could extract the
linear and independent relations between inputs and outputs. When the information was
controlled, the compressed weights tended to be less similar to the correlation coeffi-
cient, and Input No. 5, considered not so important, tended to show higher importance.
Input No. 5 represents the abstract property, meaning that the input and target relation,
14 R. Kamimura and R. Kitajima
Fig. 6. The compressed weights (left), relative weights (middle), and the original correlation coef-
ficients (right) when the number of cycles increased from two (a) to 20 (d) and by the conven-
tional method (e) for the mission statement data set. the compressed weights were averages over
ten different initial conditions and input patterns.
or more concretely, inputs and profitability can be based on Input No. 5, which could
not be extracted by the conventional methods.
4 Conclusion
The present paper aimed to present a method to resolve a contradiction among com-
putational procedures in neural networks. We particularly focused on the contradiction
among information minimization, maximization, cost minimization, and error mini-
mization. Those computational procedures were intertwined with each other, which
has made it hard to reconcile among those contradictory procedures. To cope with
Min-Max Cost and Information Control 15
References
1. Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Distributed representations. In: Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations,
pp. 77–109 (1986)
2. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112
(1985)
3. Kohonen, T.: Self-Organization and Associative Memory. Springer, New York (1988).
https://doi.org/10.1007/978-3-642-88163-3
4. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995). https://doi.org/10.1007/
978-3-642-97610-0
5. Xu, Y., Xu, L., Chow, T.W.S.: PPoSOM: a new variant of PolSOM by using probabilis-
tic assignment for multidimensional data visualization. Neurocomputing 74(11), 2018–2027
(2011)
6. Xu, L., Chow, T.W.S.: Multivariate data classification using PolSOM. In: Prognostics and
System Health Management Conference (PHM-Shenzhen), pp. 1–4. IEEE (2011)
7. DeSieno, D.: Adding a conscience to competitive learning. In: IEEE International Confer-
ence on Neural Networks, vol. 1, pp. 117–124. Institute of Electrical and Electronics Engi-
neers, New York (1988)
8. Lei, X.: Rival penalized competitive learning for clustering analysis, RBF net, and curve
detection. IEEE Trans. Neural Netw. 4(4), 636–649 (1993)
9. Choy, C.S., Siu, W.: A class of competitive learning models which avoids neuron underuti-
lization problem. IEEE Trans. Neural Netw. 9(6), 1258–1269 (1998)
10. Banerjee, A., Ghosh, J.: Frequency-sensitive competitive learning for scalable balanced clus-
tering on high-dimensional hyperspheres. IEEE Trans. Neural Netw. 15(3), 702–719 (2004)
11. Van Hulle, M.M.: Entropy-based kernel modeling for topographic map formation. IEEE
Trans. Neural Netw. 15(4), 850–858 (2004)
16 R. Kamimura and R. Kitajima
12. Hubel, D.H., Wisel, T.N.: Receptive fields, binocular interaction and functional architecture
in cat’s visual cortex. J. Physiol. 160, 106–154 (1962)
13. Bienenstock, E.L., Cooper, L.N., Munro, P.W.: Theory for the development of neuron selec-
tivity. J. Neurosci. 2, 32–48 (1982)
14. Schoups, A., Vogels, R., Qian, N., Orban, G.: Practising orientation identification improves
orientation coding in V1 neurons. Nature 412(6846), 549–553 (2001)
15. Ukita, J.: Causal importance of low-level feature selectivity for generalization in image
recognition. Neural Netw. 125, 185–193 (2020)
16. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualiza-
tion: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.-R. (eds.)
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol.
11700, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 4
17. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Müller, K.-R.: Layer-wise relevance
propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller,
K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS
(LNAI), vol. 11700, pp. 193–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-
030-28954-6 10
18. Morcos, A.S., Barrett, D.G.T., Rabinowitz, N.C., Botvinick, M.: On the importance of single
directions for generalization. Stat 1050, 15 (2018)
19. Leavitt, M.L., Morcos, A.: Selectivity considered harmful: evaluating the causal impact of
class selectivity in DNNs. arXiv preprint arXiv:2003.01262 (2020)
20. Arpit, D., Zhou, Y., Ngo, H., Govindaraju, V.: Why regularized auto-encoders learn sparse
representation? In: International Conference on Machine Learning, pp. 136–144. PMLR
(2016)
21. Goodfellow, I., Bengio, Y., Courville, A.: Regularization for deep learning. Deep Learn.
216–261 (2016)
22. Kukačka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv
preprint arXiv:1710.10686 (2017)
23. Wu, C., Gales, M.J.F., Ragni, A., Karanasou, P., Sim, K.C.: Improving interpretability and
regularization in deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 26(2), 256–
265 (2017)
24. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988)
25. Linsker, R.: Local synaptic rules suffice to maximize mutual information in a linear network.
Neural Comput. 4, 691–702 (1992)
26. Linsker, R.: Improved local learning rule for information maximization and related applica-
tions. Neural Netw. 18, 261–265 (2005)
27. Moody, J., Hanson, S., Krogh, A., Hertz, J.A.: A simple weight decay can improve general-
ization. Adv. Neural Inf. Process. Syst. 4, 950–957 (1995)
28. Fan, F.-L., Xiong, J., Li, M., Wang, G.: On interpretability of artificial neural networks: a
survey. IEEE Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021)
29. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspec-
tives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
30. Hu, J., et al.: Architecture disentanglement for deep neural networks. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 672–681 (2021)
31. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning aug-
mentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 113–123 (2019)
32. Gupta, A., Murali, A., Gandhi, D., Pinto, L.: Robot learning in homes: improving general-
ization and reducing dataset bias. arXiv preprint arXiv:1807.07049 (2018)
Min-Max Cost and Information Control 17
33. Kim, B., Kim, H., Kim, K., Kim, S., Kim, J.: Learning not to learn: training deep neural net-
works with biased data. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 9012–9020 (2019)
34. Wang, T., Zhao, J., Yatskar, M., Chang, K.W., Ordonez, V.: Balanced datasets are not enough:
estimating and mitigating gender bias in deep image representations. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 5310–5319 (2019)
35. Hendricks, L.A., Burns, K., Saenko, K., Darrell, T., Rohrbach, A.: Women also snowboard:
overcoming bias in captioning models. In: Proceedings of the European Conference on Com-
puter Vision (ECCV), pp. 771–787 (2018)
36. Cortés-Sánchez, J.D., Rivera, L.: Mission statements and financial performance in Latin-
American firms. Verslas: Teorija ir praktika/Business Theory Pract. 20, 270–283 (2019)
37. Bart, C.K., Bontis, N., Taggar, S.: A model of the impact of mission statements on firm
performance. Manag. Decis. 39(1), 19–35 (2001)
38. Hirota, S., Kubo, K., Miyajima, H., Hong, P., Park, Y.W.: Corporate mission, corporate poli-
cies and business outcomes: evidence from japan. Manag. Decis. (2010)
39. Alegre, I., Berbegal-Mirabent, J., Guerrero, A., Mas-Machuca, M.: The real mission of the
mission statement: a systematic review of the literature. J. Manag. Organ. 24(4), 456–473
(2018)
40. Atrill, P., Omran, M., Pointon, J.: Company mission statements and financial performance.
Corp. Ownersh. Control. 2(3), 28–35 (2005)
41. Vandijck, D., Desmidt, S., Buelens, M.: Relevance of mission statements in flemish not-for-
profit healthcare organizations. J. Nurs. Manag. 15(2), 131–141 (2007)
42. Desmidt, S., Prinzie, A., Decramer, A.: Looking for the value of mission statements: a meta-
analysis of 20 years of research. Manag. Decis. (2011)
43. Macedo, I.M., Pinho, J.C., Silva, A.M.: Revisiting the link between mission statements and
organizational performance in the non-profit sector: the mediating effect of organizational
commitment. Eur. Manag. J. 34(1), 36–46 (2016)
44. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (1991)
45. Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 535–541. ACM (2006)
46. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information
Processing Systems, pp. 2654–2662 (2014)
47. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015)
48. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Hints for thin deep
nets. In: Proceedings of ICLR, Fitnets (2015)
49. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling knowl-
edge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
50. Neill, J.O.: An overview of neural network compression. arXiv preprint arXiv:2006.03669
(2020)
51. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis.
129(6), 1789–1819 (2021)
52. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration
for deep neural networks (2020)
Face Generation from Skull Photo Using
GAN and 3D Face Models
Abstract. Generating face images from skull images have many appli-
cations in fields such as archaeology, anthropology and especially foren-
sics, etc. However, face/skull images generation remain a challenging
problem due to the fact that face image and skull image have different
characteristics and the data on skull images is also limited. Therefore, we
consider this transformation as an unpaired image-to-image translation
problem and research the recently popular generative models (GANs)
to generate face images from skull images. To this end, we use a novel
synthesis framework called U-GAT-IT, a new framework for unsuper-
vised image-to-image translation. This framework use AdaLIN (Adap-
tive Layer-Instance Normalization), which a new normalization function
to focus on more important regions between source and target domains.
Furthermore, to visualize the generated face in many other aspects, we
use an additional 3D facial generation model called DECA (Detailed
Expression Capture and Animation), which is a model for 3D facial
reconstruction that is trained to robustly produce a UV displacement
map from a low-dimensional latent representation. Experimental results
show that the proposed method achieves positive results compared to
the current unpaired image-to-image translation models.
1 Introduction
In recent years, information technology has developed at a rapid rate, especially
applications of artificial intelligence are increasingly interested and developed.
This promotes information technology to become one of the important fields
contributing to the economic development of the country and improving people’s
lives.
One of the problems that have been interested and researched by computer
scientist is the reconstruction problem, especially the facial generation problem,
which is the most commonly research due to the convenience of face data col-
lection. In which, the problem “Face images Generation From Skull images” is
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 18–31, 2023.
https://doi.org/10.1007/978-3-031-18461-1_2
Face Generation from Skull Photo Using GAN and 3D Face Models 19
2 Related Work
Generating facial images from skull images has been studied since the late 19th
century, starting with Paul Broca’s study of the relationship between bone and
thickness of soft tissues in the face [5] in 1867 and the work was completed
and officially published by the Russian scientist, Gerasimov [21]. In the field of
forensic science, some typical works such as the facial reconstruction of Karen
Price (a Welsh girl murdered in 1981, whose remains were found eight years later)
20 D. K. Vo et al.
and the second domain is the face images. With the help of science and technol-
ogy, two-dimensional images of the face are superimposed on the skull [7,13]. A
semi-supervised formulation of transform learning and a skull dataset so-called
IdentifyMe proposed in 2017 by Nagpal et al. [14] (see Fig. 3, which is an example
of this approach).
3 Proposed Method
In this section, the problem formulation is presented followed by a detailed
description of the proposed method. Also, details of generator and discriminator
network architecture are presented.
3.1 Formulation
Given a dataset (D) consisting of a source domain {x}N i=1 ∈ Xs , which is skull
images and a target domain {y}M j=1 ∈ Xt , which is face images, the goal of
the generating face images to skull images model is to learn mapping functions:
(1) G : Xs → Xt that represents skull image Xs to face image Xt generation,
(2) F : Xt → Xs that represents face image Xt to face image Xs generation.
In addition, we introduce two discriminators Ds and Dt , where Ds aims to
distinguish between {x} and G : Xs → Xt , Dt aims to distinguish between {y}
and F : Xt → Xs . The full objective contains four types of terms: adversarial
loss, cycle loss, identity loss and CAM loss.
Ls→t 2
gan =Ey∼Xt [(Dt (y)) ]
(1)
+ Ex∼Xs [(1 − Dt (Gs→t (x)))2 ],
where Gs→t tries to generate images Gs→t (x) that look similar to images from
target domain Xt , while Dt tries to distinguish between generated images
Gs→t (x) and images y from target domain Xt . Gs→t aims to minimize this
objective, while Dt aims to maximize it, i.e., minGs→t maxDt Lgan . The similar
adversarial loss for F : Xt → Xs and its discriminator Ds is minFt→s maxDs Lgan .
Cycle Consistency Loss. Cycle consistency loss was introduced with the
CycleGAN [24] architecture to the optimization problem that means if we trans-
late a skull image to a face image and then back to a skull image, we should get
the same input image back.
In the paper, we want to learn two mapping functions: G : Xs → Xt and
F : Xt → Xs . Cycle consistency loss encourages F (G(x)) ≈ x and G(F (y)) ≈ y.
24 D. K. Vo et al.
Lcycle = Ex∼Xs [|x − Ft→s (Gs→t (x)))|1 ] + Ey∼Xt [|y − Gs→t (Ft→s (y)))|1 ], (2)
Identity Loss. To ensure that the color distributions of the input image and
the generated image are the similar, the identity consistency constraint is applied
to the generator. For an image x ∈ Xs , the image is better not to be changed
after the translation of x using Gs→t .
Ls→x
identity = Ex∼Xs [|x − Gs→t (x)|1 ], (3)
CAM Loss. By exploiting the information from the auxiliary classifiers ηs and
ηDt , given an image x ∈ {Xs , Xt }, Gs→t and Dt get to know where they need to
improve or what makes the most difference between two domains in the current
state [10]:
Ls→t
cam = −(Ex∼Xs [log(ηs (x))] + Ex∼Xt [log(1 − ηs (x))]), (4)
LD
cam = Ex∼Xt [(ηDt (x)) ] + Ex∼Xs [(1 − ηDt (Gs→t (x)) ],
t 2 2
(5)
The Full Objective. Finally, the full objective is the sum of the four objectives
with pre-set default coefficients:
3.2 Generator
The main goal of the generator is to train a function Gs→t that maps an image
from the source domain Xs to the target domain Xt using unpaired images from
each domain [10] in the dataset.
The Gs→t model consists of an encoder Es , a decoder Gt , and an auxiliary
classifier ηs , where ηs (x) represents the probability that x comes from Xs [10].
Let x ∈ {Xs , Xt } represent a sample from the source and the target domain,
k
Esk (x) be the kth activation map of the encoder, Es ij (x) be the value at (i, j)
and wsk be the weight of the kth feature map for the source domain is trained
by ηs (x). Formula for attention feature maps ηs (x):
⎛ ⎞
ηs (x) = σ ⎝ wsk Eskij (x)⎠ (7)
k ij
Face Generation from Skull Photo Using GAN and 3D Face Models 25
a − μI a − μL
aˆI = 2 , aˆL = 2 , (10)
σI + σL +
3.3 Discriminator
4 Experimental Results
In this section, experimental settings and evaluation of the proposed method
are discussed in detail. Three reference databases are used: the Flickr-Faces-HQ
(FFHQ) [9], the Chinese University of Hong Kong Face Sketch (CUFS) [22] and
IdentifyMe [14].
4.1 Datasets
The paper focuses on generating face images from skull images and using a
face-to-three-dimensional transformation model to visualize the obtained results.
However, the datasets that contain together face images and skull images, respec-
tively, are few. Therefore, to be able to train the model, we propose a composite
dataset, which is a combination of the IdentifyMe [14] dataset and the two facial
datasets FFHQ [9] and CUFS [22].
The IdentifyMe dataset introduced in [14] consists of 464 skull and face
images divided into two parts:
– Part 1: Skull and Face Image Pairs. A total of 35 skull images and their cor-
responding face images are collected from various sources, some of these pairs
correspond to real world cases where a skull was found and later identified to
belong to a missing person [14].
– Part 2: Unlabeled Supplementary Skull Images. A total of 429 skull images
is collected from various sources on the Internet and in real life.
The FlickrFaces-HQ (FFHQ) dataset is a dataset of human faces, consists of
70,000 high-quality images at 1024 × 1024 resolution. The images were crawled
from Flickr (thus inheriting all the biases of that website) and automatically
aligned and cropped [9].
The CUHK Face Sketch database (CUFS) [22] is a viewed sketch database,
but we only use face images in this paper. We collect 188 face images from the
Chinese University of Hong Kong (CUHK) student database.
The proposed dataset is fed into the system to form two training sets used
to compare the results with each other. The first training set skull2ffhq is a
combination of two datasets IdentifyMe and FFHQ and the second training set
skull2CUFS is a combination of two datasets IdentifyMe and CUFS.
rate of 0.0002 and a batch size of 1. In particular, we replace the negative log-
likelihood objective by a least-squares loss [12]. The loss is more stable during
training and generates higher quality results [24]. For all the experiments, we
set λ1 = 1, λ2 = 10, λ3 = 10, λ4 = 1000. We keep the same learning rate for the
first 50 epochs and linearly decay the rate to zero over the next 50 epochs (see
Fig. 5 and 6).
Fig. 6. Facial images from two faces datasets [9, 22] on data preprocessing
4.3 Evaluation
We first evaluate the performance of the proposed method on two proposed
datasets. We then use the more performance dataset to train model for a recent
method for unpaired image-to-image translation - CycleGAN. Finally, we com-
pare the performance of our method against CycleGAN (see Fig. 7).
For evaluation, we use two indicates, those are Inception Score and Kernel
Inception Distance.
The Inception Score, or IS for short, was proposed in [18], is an objective met-
ric for evaluating the quality of generated images, specifically synthetic images
output by generative adversarial network models. The inception score has the
lowest value of 1.0 and the highest value of the number of classes supported by
the classification model.
The Kernel Inception Distance, or KID for short, was proposed in [4], is used
to evaluate the generated images by GAN model, the lower the KID, the more
similar the generated image is to the images in the source domain.
28 D. K. Vo et al.
Fig. 7. Visualization the skull images and their generated images: (a): Source images
from [14], (b): The generated images from the FFHQ facial images dataset [9], (c): The
generated images from the CUFS facial images dataset [22], (d): The generated images
from the CycleGAN model [24]
Model IS KID
U-GAT-IT and skull2ffhq 2.019 ± 0.157 9.868 ± 0.441
U-GAT-IT and skull2CUFS 1.465 ± 0.101 3.445 ± 0.318
Based on the obtained results table, we find that the generated face images
from the FFHQ dataset is more diverse than the CUFS dataset, but the gener-
ated images from the CUFS dataset is stable and similar to the source domain
more than the FFHQ dataset. Because the facial images of the FFHQ dataset is
confused by accessories such as hats, glasses, age, that makes the generate model
out of focus on the face.
Model IS KID
U-GAT-IT 1.465 ± 0.101 3.445 ± 0.318
CycleGAN 1.304 ± 0.084 20.011 ± 0.617
Based on the obtained results table, we find that the generated image from
the proposed method is more similar to the source domain than the CycleGAN
model. It can be seen that the CycleGAN model has not converged, so the
generated image is the same by different input images.
5 Conclusion
The paper focuses on the model of generating face images from skull images in
order to support in the process of identifying the skull’s identity. Through the
process of studying the problem, we have grasped the methods and techniques of
generating a face from a skull along with some basic knowledge about skull and
face. The approach presented in this paper can generate a face image and the
3d model of the face more optimal results with KID 3.445 ± 0.318. However,
30 D. K. Vo et al.
the obtained results depend entirely on the training data set. When changing
the training data set, the results will also be changed. Therefore, the obtained
results are for reference only, so it is not applicable in practice.
Future development of this method will try to train and test more data
to generate more transformation models, and at the same time improve the
accuracy of the generated face from the model so that it can be applied in
practice. In addition, developing a system to convert skull images to face images,
building a data normalization process so that the system gives the most accurate
results and can serve other studies that have the same research object.
References
1. Abate, A.F., et al.: FACES: 3D FAcial reConstruction from anciEnt Skulls using
content based image retrieval. J. Vis. Lang. Comput. 15(5), 373–389 (2004)
2. Andersson, B., Valfridsson, M.: Digital 3D facial reconstruction based on computed
tomography (2005)
3. Biederman, I., Kalocsai, P.: Neural and psychophysical analysis of object and face
recognition. In: Wechsler, H., Phillips, P.J., Bruce, V., Soulié, F.F., Huang, T.S.
(eds.) Face Recognition, pp. 3–25. Springer, Heidelberg (1998). https://doi.org/10.
1007/978-3-642-72201-1 1
4. Bińkowski, M., et al.: Demystifying MMD GANs. In: International Conference on
Learning Representations (2018)
5. Buzug, T.M., et al.: Reconstruction of soft facial parts (2005)
6. Feng, Y., et al.: Learning an animatable detailed 3D face model from in the- wild
images. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
7. Grüner, O.: Identification of skulls: a historical review and practical applications.
In: Iscan, M.Y., Helmer, R.P. (eds.) Forensic Analysis of the Skull: Craniofacial
Analysis, Reconstruction, and Identification. Wiley- Liss, New York (1993)
8. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance
normalization. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 1501–1510 (2017)
9. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4401–4410 (2019)
10. Kim, J., et al.: U-GAT-IT: unsupervised generative attentional networks with
adaptive layer-instance normalization for image-to-image translation (2020)
11. Li, T., et al.: Learning a model of facial shape and expression from 4D scans. ACM
Trans. Graph. 36(6), 194–1 (2017)
12. Mao, X., et al.: Least squares generative adversarial networks. In: Proceedings of
the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
13. Miyasaka, S., et al.: The computer-aided facial reconstruction system. Forensic Sci.
Int. 74(1–2), 155–165 (1995)
14. Nagpal, S., et al.: On matching skulls to digital face images: a preliminary approach.
In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 813–819.
IEEE (2017)
15. Delgado, A.N.: The problematic use of race in facial reconstruction. Sci. Cult.
29(5), 568–593 (2020)
16. Paoletti, M.E., et al.: Deep learning classifiers for hyperspectral imaging: a review.
ISPRS J. Photogramm. Remote. Sens. 158, 279–317 (2019)
Face Generation from Skull Photo Using GAN and 3D Face Models 31
17. Pearson, K.: On the skull and portraits of George Buchanan. Biometrika, 233–256
(1926)
18. Salimans, T., et al.: Improved techniques for training GANs. Adv. Neural. Inf.
Process. Syst. 29, 2234–2242 (2016)
19. Singh, M., et al.: Learning a shared transform model for skull to digital face
image matching. In: 2018 IEEE 9th International Conference on Biometrics The-
ory, Applications and Systems (BTAS), pp. 1–7. IEEE (2018)
20. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems, pp. 5998–6008 (2017)
21. Verzé, L.: History of facial reconstruction. Acta Biomed. 80(1), 5–12 (2009)
22. Wang, L., Sindagi, V., Patel, V.: High-quality facial photo-sketch synthesis using
multi-adversarial networks. In: 2018 13th IEEE International Conference on Auto-
matic Face & Gesture Recognition (FG 2018), pp. 83–90. IEEE (2018)
23. Wilkinson, C.: Facial reconstruction-anatomical art or artistic anatomy? J. Anat.
216(2), 235–250 (2010)
24. Zhu, J.-Y., et al.: Unpaired image-to-image translation using cycle- consistent
adversarial networks. In: 2017 IEEE International Conference on Computer Vision
(ICCV) (2017)
Exploring Deep Learning in Road Traffic
Accident Recognition for Roadside Sensing
Technologies
Swee Tee Fu(B) , Bee Theng Lau , Mark Kit Tsun Tee ,
and Brian Chung Shiong Loh
1 Introduction
Road traffic accidents have become the leading cause of casualties and death across the
world. Currently, it is ranked as the 9th leading cause of human casualties worldwide and
is projected to be the fifth leading cause of human casualties in 2030 [1]. A road traffic
accident is a phenomenon that involves traffic collisions between motorised vehicles or
crashes between motorised and non-motorised vehicles as well as pedestrians or other
stationary obstructions that may lead to property damage and injury to road users. One
of the most common causes of fatalities in a road traffic accident is the latency in the
intervention of the paramedic teams between the detection of accidents and their arrival
time at the scene [1–4]. Hence, early recognition of and reaction to accidents are crucial
to reducing road traffic fatalities.
Different traffic flow environments have distinct characteristics and traffic phenom-
ena, which is essential for a thorough understanding of the occurrences of road traffic
These research works focused on developing their models based on complex Deep
Learning frameworks.
CNN is the most commonly used Deep Learning approach to study patterns in spa-
tiotemporal traffic data and classifying traffic conditions. Many research works employ
CNN to perform image or video classification on a frame-by-frame basis, in which the
model is mainly trained to classify accident scenes and normal traffic scenes. The authors
in [22] use the invariant property of CNN architecture to extract spatial features from
images and video frames at the initial layer before combining them into meaningful pat-
terns at the subsequent layers to detect traffic accident scenes. Traffic-Net image dataset
is used to train the proposed CNN model to classify each video frame into four prede-
fined categories which are accident, dense traffic, fire, and sparse traffic. The proposed
CNN model is compared against a pre-trained ResNet50 model and concludes that the
proposed CNN model achieved a higher accuracy at 94.4% on the four target accident
classes. It is also reported that the accuracy of accident detection is enhanced with Deep
Learning compared to traditional neural networks.
The paper [23] proposed a CNN model which is trained on 2500 accident images and
2500 non-accident images to classify each video frames input. The classification result
for each of the video frames are stored in a deque and rolling prediction averaging is used
to predict the occurrence of an accident. The proposed model has achieved an image
prediction accuracy of 85%. Also, Inception V3 is a type of CNN that is 48 layers deep,
and [4] customised the Inception V3 with two Deep Learning network architectures:
DenseNet and SENet to detect high-speed head-on and single-vehicle collision. The
former is to increase the depth of the high dimensional neural networks by repeating the
use of features and the latter is to act as a filtering mechanism to remove the last output
features that are insignificant for the traffic accident detection. The proposed model is
trained upon various traffic collision accident images collected from the Internet. The
result showed that the modified Inception V3 model can reach 96% of accuracy in
traffic collision detection. However, the model suffers accuracy loss when the test set
is derived from a different dataset than the training set. The main reason for this is the
limited training dataset used to train on the traffic collision scenarios. Hence, a large
pool of training datasets is needed to increase the accuracy of the image classification
result.
Besides, [24] utilizes video dynamic detection whereby feature extractions are per-
formed across a series of video frames using the CNN model and the differences between
the video frames are obtained through GRU model before outputting the prediction result
through a fully connected neural network. The result is based on the occurrence of frame
changes. It has been reported that the training time is significantly reduced through the
fusion of both CNN and GRU models. This model achieved an accuracy of 96.64% but the
training dataset is limited which may cause overfitting issues. Recently, [25] addressed
the overfitting issue by proposing a CNN with drop-out method which uses Global Aver-
age Pooling (GAP) to reduce the number of learning parameters. The results concluded
that the performance of the proposed deep model is better than LR, DT, SVC, RF, and
KNN as it improves both the accuracy and F-1 score for crash detection. However, it
can be seen that using CNN to interpret traffic accident occurrence on a frame-by-frame
Exploring Deep Learning in Road Traffic Accident 35
basis might not be accurate as traffic accident is a time-series event, and the relationship
between frames is important to be considered within the model.
To address this issue, [17] combines both CNN (Inception V3) and LSTM for real-
time accident detection on highways. The LSTM layers were added to the existing CNN,
and both temporal and spatial features were taken into consideration. The extracted
features from each video frame were saved into sequences to be further trained in the
LSTM model for final classification as shown in Fig. 1. Recently, [26] proposed a
time-distributed model which combines a time-distributed LSTM, an LSTM network,
and a dense layer for traffic accident classification as shown in Fig. 2. The proposed
model is trained using the DETRAC dataset for normal traffic scenarios and the accident
video dataset from the YouTube platform. The time-distributed LSTM used allows the
modelling of long-term sequential dependencies across video frames which could further
improve the classification accuracy.
On the other hand, LSTM is also applicable to non-vision-based data, for example,
traffic flow data captured through loop detectors or radar sensors installed at the roadside.
The research [18] performed a comparative study between LSTM and GRU in terms
of traffic accident detection using spatiotemporal data collected from loop detectors.
Synthetic Minority Over Sampling Technique (SMOTE) is employed to balance the
dataset in terms of accident and non-accident cases. It has been demonstrated that both
models share similar false alarm rates but GRU performed slightly better than LSTM in
detection rate. Both models are capable of dealing with dependencies at different time
36 S. T. Fu et al.
scales and have a performance accuracy as high as 96%. Moreover, it has been observed
that considering traffic data of only one temporal resolution might not be sufficient to
represent the traffic trends at different time intervals.
Hence, [16] proposed a novel LSTM-based framework that comprises three LSTM
networks to comprehensively capture traffic states at different time intervals. The outputs
of the three LSTM networks are combined by a fully connected layer with a dropout
layer to avoid the overfitting issue. The model is tested against other similar freeways
and has achieved a crash detection accuracy of 65.15% on transferability which is better
than models with one or more temporal resolutions. Hence, considering traffic data of
different temporal resolutions can improve the prediction performance especially in a
real-world setting.
Apart from interpreting the traffic accident occurrence based on the entire image
scene, it will be more accurate if each of the objects within the given frame is detected
and tracked across frames for motion and appearances discrepancies. In [27] CNN is
used for lane boundary line extractions and selective search method for vehicle detection.
The prediction of accident occurrence is solely based on the lane boundary lines and
the position relationship between the vehicle’s trajectory. Hence, this approach is only
applicable in a non-mixed traffic flow environment whereby lane discipline adheres and
vehicle types are of the same size.
As traffic accident detection requires high real-time performance, one-stage models
such as You Only Look Once (YOLO) are one of the preferred Deep Learning techniques
in performing object detection. The author in [28] used YOLO net for car detection and
Fast Fourier Transform algorithm for building the object tracker. The Violent Flow (ViF)
descriptor is then used as input to a Support Vector Machine (SVM) classifier to detect
car crashes. Moreover, [29] utilises YOLO V3 based on darknet for vehicle detection and
uses bounding box prediction to track the movement through identifying the centroid of
the objects. Conservation of momentum is used to calculate the probability of accident
occurrence based on the set of features extracted through the object tracking process and
further classify it into major and minor accidents. Also, [30] combines YOLO V3 and
canny edge detection algorithm to detect cars to perform preliminary classification on the
severity of an accident based on three main pre-trained classes of cars which are normal
car, damaged car and overturned car. To enhance the performance in detecting small
objects, [3] proposes YOLO-CA which is a combination of one-stage model YOLO and
Multi-Scale Feature Fusion (MSFF). This model is comprised of 228 neural network
layers and is formally evaluated against several Deep Learning models such as Fast R-
CNN, Faster R-CNN, Faster R-CNN with FPN, Single Shot MultiBox Detector (SSD),
YOLO V3 without MSFF and YOLO V3. It is demonstrated that the proposed model
can detect car accident occurrences in 0.0461 s with 90.02% average precision.
On the other hand, [31] proposed an improved model based on their previous work
on traffic accident classification [32] to support the detection of a group of vehicles trav-
elling through the predefined cell by incorporating Faster R-CNN for vehicle separation
and track each of the individual vehicles using data association scheme to obtain the vehi-
cle trajectory for classification on traffic incident happened. The paper [1] uses Mask
R-CNN for its vehicle detection framework, which employs the RoI Align algorithm
to automatically segment and build pixel-wise masks for every object in the video to
Exploring Deep Learning in Road Traffic Accident 37
provide more accurate results. Even though this model has higher detection accuracy as
compared to Faster R-CNN, it is still ineffective when dealing with high-density traffic
and occlusion issues. A centroid-based object tracking algorithm is used to track each
of the vehicles detected and the probability of the accident occurred is calculated based
on the acceleration and trajectory anomaly obtained.
However, these aforementioned approaches focused on detecting crashes in
motorised traffic environments such as vehicle-vehicle collisions or single-vehicle colli-
sions. To date, a mixed traffic flow environment would be even more significant as there
are a large number of pedestrians and cyclists who share roadways with automobiles.
Hence, it is critical for the model built to take into consideration the traffic characteristics
in a mixed traffic flow environment. To support this, [19] initiated an attempt to investi-
gate collision detection in a dense lane-less mixed traffic flow environment using a new
proposed framework known as Siamese Interaction LSTM (SILSTM). This proposed
framework comprises one or more bidirectional LSTM (BLSTM) layers that can learn
salient features from long vehicle interaction trajectories as well as pedestrian spatial
trajectories and allowed for highly effective detection of safe and collision-prone interac-
tion trajectories in lane-less traffic. A bidirectional LSTM can capture the propensity of
a collision caused by a sudden change in speed of the vehicles more accurately through
incorporating both future and past context of each time step using a separate LSTM on
the reversed sequence [19]. This approach could better represent the complex interaction
trajectories of every road user in a mixed traffic flow especially with the absence of lane
discipline and inconsistent vehicle speed by examining the relative distance and speed
information of the vehicle and its neighbouring vehicles in a bidirectional LSTM. The
proposed model is compared against different variants of LSTM and GRU, and it can
be seen that the proposed model has a higher recall, precision, and F1 score but it is
relatively expensive in terms of computational cost.
Clearly, object detection and tracking in a mixed traffic flow environment are more
complex than in a non-mixed traffic flow environment due to the wide variations in the
sizes and types of road users. On top of that, a good object detector will significantly
contribute to the accuracy of the object tracked as well as contributing to good traffic
accident recognition. The study [7] incorporated Retinex image enhancement technique
to enhance the input images before YOLO V3 with Darknet-53 is being trained to
detect multiple objects from images such as fallen pedestrians/cyclists, moving/stopped
pedestrians/cyclists/vehicles, single-vehicle collisions, and multiple objects collisions.
Features extracted from YOLO V3 are then fed into a decision tree model to further
classify the crashes. The result shows that the proposed model has a detection accuracy
of 92.5% on crashes in the testing dataset used which comprised a total of 30214 crash
frames and 42148 normal frames over a total of 12736 frames.
To improve both the object detection speed and accuracy in a mixed traffic flow envi-
ronment, [33] proposed the Detection Transformer (DETR) approach which comprised
of CNN ResNet-50 as the backbone, a transformer encoder-decoder block to help in
focusing on the most influential features about the car, motorcycle, and truck for a par-
ticular detection and fully connected layers for the class with bounding box predictions.
It is reported that the DETR approach has achieved a detection rate of 78.2% with low
latency as compared with previous work. The DETR output is then fed into a random
38 S. T. Fu et al.
forest classifier to classify each frame into either accident or non-accident frames. Lastly,
the probability of an accident occurring is derived based on the predictions from the past
60 frames using the sliding window technique. The detection rate of DETR is remark-
able; however, there is still a gap especially in localising diminished objects. On the
other hand, [34] presents a condensed version of YOLO known as Mini-YOLO which is
composed of pre-trained YOLO V3 with ResNet-152 and MobileNet-v2. Mini-YOLO is
primarily trained to detect motorcycles, cars, trucks, and buses. Simple Online Realtime
Tracking (SORT) is then used to track each of the objects detected for their damage
status. Support Vector Machine (SVM) is then used to classify the accident occurrence
based on the damage status of the objects across the frames.
Detecting smaller road users such as motorcycles, bicycles and pedestrians has
always been challenging for traffic accident recognition in a mixed traffic flow envi-
ronment. The paper [35] deploys Context Mining and Augmented Context Mining on
top of Faster R-CNN to improve the detection of pedestrians which occupy smaller image
segments than other vehicle categories. The features extracted from Faster-RCNN are
the input into the Dynamic-Spatial-Attention LSTM (DSA-LSTM) for accident forecast-
ing. However, Faster R-CNN is a two-stage model that involves an additional stage of
selecting the proposal regions before object detection, hence processing is much slower
and is not suitable for real-time applications.
From the literature, it can be seen that most of the Deep Learning-based recognition
algorithms are based upon a single feature-based approach in traffic accident detec-
tion: either using appearance feature-based approaches [7, 35] or motion feature-based
approaches [1, 19, 27, 28, 31]. These look into appearance or motion crash features
within video frames to determine whether an accident has occurred. Recently, a fusion
feature-based approach that combines both the appearance and motion crash features is
widely adopted in traffic accident detection especially in a mixed traffic flow environ-
ment. This is mainly due to the complex nature of mixed traffic flow characteristics that
by considering only one single perception in traffic accident recognition might cause the
accuracy to suffer.
The study [15] proposed a two-stream traffic accident detection framework with
one stream focusing on collision detection and another stream focusing on abnormality
detection. Abnormality detection is performed through extracting deep representation
based on three modalities which are appearance, motion, and joint representation using
stacked autoencoder to obtain the anomaly scores and reconstruction errors, but this
approach is computationally intensive [36]. The collision score is then determined based
on the trajectories of the moving vehicles and it is used to increase the reliability of the
overall result.
Besides, [20] proposed an integrated two-stream CNNs architecture that performs
near-accident detection of mixed road users according to traffic data sourced from fisheye
surveillance footage, multi-scale drone footage, and simulated videos. Additionally, the
two-stream CNNs proposed comprise of a spatial stream network for object detection by
capturing appearance features using the YOLO detector and a temporal stream network
(ConvNet model) for leveraging motion information of multiple objects to generate
individual trajectories of each tracked target as shown in Fig. 3. In addition, a deep
cosine metric learning method called DeepSORT is used to train the temporal stream for
Exploring Deep Learning in Road Traffic Accident 39
Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[22] 2019 Deep CNN Image dataset from Urban - Motorized TrafficNet ResNet50 94.4% accuracy
CCTV
surveil-lance
S. T. Fu et al.
footage
[23] 2020 CNN Image dataset from Urban - Motorized Google Images on - 85% accuracy
CCTV surveillance accident scenes
footage
[4] 2019 Inception V3 Simulation videos - - Motorized NA - 96% accuracy
[24] 2018 CNN + RNN (GRU) Simulated videos - - Motorized NA STM 96.64% accuracy
[25] 2020 CNN + ANN Radar sensor Highway - Motorized Des Moines, IA LR, DT, RF, SVC, 76% accuracy
Traffic data KNN
[4] 2019 Inception V3 + CCTV surveillance Highway - Motorized NA - 92.38% accuracy
LSTM footage
(CNN-LSTM)
[26] 2021 LSTM CCTV surveillance - - Motorized (Car UA-DETRAC and - 94.33% accuracy
footage only) YouTube videos
[18] 2019 LSTM Loop detector Express-way Under various Motorized Chicago - 96% accuracy
weather conditions Express-way
and traffic
congestion status
[18] 2019 GRU Loop detector Express-way Under various Motorized Chicago - 95.9% accuracy
weather conditions Express-way
and traffic
congestion status
[16] 2020 LSTMDTR (LSTM Loop detector Freeway - Motorized I880-N and I805-N KNN, LR, NB, 70.43% accuracy
for Different ANN, SVM, RF
Temporal
Resolution)
(continued)
Table 1. (continued)
Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[27] 2019 CNN CCTV surveillance Highway Under various Motorized UA-DETRAC GMM 95.2% detection
footage weather and rate
illumination
condition
[28] 2018 YOLO net CCTV surveillance City Under various Motorized (Car CCV - 75% accuracy
footage weather and only)
illumination
condition
[30] 2020 Yolo v3 CCTV surveillance City - Motorized NA - NA
footage
[3] 2019 YOLO-CA CCTV surveillance Urban road Under various Motorized (Car CAD-CVIS ARRS, DSA-RNN 90.02% average
footage weather and only) precision
illumination
conditions
[31, 32] 2018 CNN + Faster CCTV surveillance Highway, urban Under various Motorized 6 videos consist of SVM + PHOG + F1 Score of 80%
R-CNN footage and rural weather and various traffic GMM, GoogleNet
illumination incidents CNN
condition
[1] 2019 Mask R-CNN CCTV surveillance Intersection Under various Motorized CCTV videos from Vision based 71% detection
footage weather and YouTube model (ARRS), rate
illumination
conditions
[19] 2019 SILSTM CCTV surveillance Intersections and - Mixed traffic flow SkyEye dataset Different variants -
footage + aerial Highways of SILSTM
footage
[7] 2020 Yolo v3 CCTV surveillance City Under various Mixed traffic flow CCTV videos from - 92.5% detection
Exploring Deep Learning in Road Traffic Accident
Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[33] 2020 ResNet + DETR CCTV surveillance Intersections and Under various Mixed traffic flow CADP MaskR-CNN, 78.2% detection
footage Highways weather and (Multi-vehicle Stacked rate
illumination crashes and Autoencoder
S. T. Fu et al.
conditions pedestrian-vehicle
crashes)
[34] 2021 Mini-YOLO CCTV surveillance Urban - Mixed traffic flow Boxy vehicles - -
footage (Car, Motorcycle, dataset
Truck and Bus)
[35] 2019 Faster R-CNN + CCTV surveillance Intersections and Under various Mixed traffic flow CADP - 80% recall
DSA-LSTM footage Highways weather and (Multi-vehicle
illumination crashes and
conditions pedestrian-vehicle
crashes)
[15] 2019 Stack Autoencoder CCTV surveillance City Under various Motorized (Bike ARRS, RTADS -
footage illumination and car)
conditions
[20] 2020 Two-stream CNN Fisheye Intersections Under various Mixed traffic flow TNAD datasets - F1 score of
(Spatial Stream + surveillance illumination (Cars, Bike and 93.09%
Temporal stream) footage + condition pedestrian)
multi-scale drone
footage +
Simulation videos
[21] 2020 ResNet + CCTV surveillance City Under various Mixed traffic flow Traffic image Faster R-CNN + 87.78% accuracy
Conv-LSTM footage weather and (Cars, Bike and datasets in China SORT, ResNet-50
illumination pedestrian) + CBAM
condition
Exploring Deep Learning in Road Traffic Accident 43
Table 2. (continued)
After a study on the state-of-the-art Deep Learning techniques in traffic accident recog-
nition from roadside sensing technologies, there are still many issues worth exploring.
Traditionally, traffic data is mainly collected through active sensors such as radar and
lightwave photosensors installed at the roadside which may not provide consistent and
reliable counts to support traffic accident detection especially in mixed traffic flow envi-
ronments [25]. Recently, video-based data captured from vision sensor-equipped within
CCTV surveillance cameras have become viable with the introduction of Deep Learn-
ing techniques as these can capture real-time traffic information and conditions [39] in
which Deep Learning models are proven to be a promising tool for real-time traffic acci-
dent detection [40]. Also, Deep Learning is the state-of-the-art method for time-series
problems and it can simulate dynamic changes in traffic conditions as well as detecting
anomalous activity which can improve the performance of traffic accident recognition
[16, 41].
However, the presence of various external factors that influence the detection per-
formance such as weather, illumination, and congestion condition. These factors can
severely influence the reliability of traffic accident recognition. Hence, it is crucial
to consider these external factors while designing the Deep Learning architecture to
improve recognition accuracy [39]. Multiple research works have addressed these exter-
nal factors but consideration of congestion conditions is rarely highlighted. Also due to
the dynamic nature of traffic data, the detection technique used must be able to model
traffic flow at different time intervals. However, most of the works considered traffic
Exploring Deep Learning in Road Traffic Accident 45
data from only one temporal resolution, which is not sufficient to represent the traffic
trends at different time intervals.
As reported by [7], most of the previous research works focused on detecting traffic
accidents in non-mixed traffic flow environments, for example, single-vehicle collisions
and vehicle-vehicle collisions [1, 3, 4, 31] instead of in mixed traffic flow environments
[7, 19–21, 35]. A mixed traffic flow environment is usually found in urban traffic scenes
and intersections which often have the highest number of traffic accident fatalities.
Nonetheless, there is limited literature on traffic accident recognition in this context. On
top of that, mixed traffic flow environments present an exponential increase in the factors
that influence traffic accident modelling when compared to its non-mixed counterpart,
which cannot be easily addressed through conventional methods. The authors in [19] and
[7] reported that crash detections remain a challenge in mixed traffic flow environments
as non-motorized traffic such as pedestrians or cyclists tend to be blocked by other
objects. These findings show that there is a gap between the implementation of existing
conventional traffic accident detection systems for non-mixed versus mixed traffic flow
environments.
In a mixed traffic flow environment, one of the greatest concerns is the complexity
in detecting smaller objects such as pedestrians, motorcyclists, and cyclists as these road
users are often represented in smaller image footprints compared to the rest of the vehicles
and it is often challenging to draw a tight bounding box around them especially across
video frames. This has caused significant degradation of accuracy in object detection
especially in accident scenes that may involve fallen motorcyclists and fallen pedestrians.
Besides, the absence of lane discipline in mixed traffic flow environment forms challenge
in annotating unsafe interaction trajectories especially when vehicle creeping phenomena
happens in which smaller sized vehicles can move through the gaps to reach the front of
the queue during overtaking, causing the vehicles to ply very close to each other. Also,
traditional ways of object tracking such as background subtraction [42–44] or optical
flow [45–48] are found not optimal under complex traffic scenarios in a mixed traffic
flow environment such as dense traffic scenarios, sudden acceleration of vehicle speed,
and occlusion. In light of the many challenges faced at the object level detection and
tracking, one possible concept that may help to improve its performance in mixed traffic
scenario is by first interpreting the whole traffic scene across series of video frames.
In [21] it is stated that scene understanding could provide contextual information and
scene structure that might be helpful in traffic accident recognition. Hence, it is suspected
that instead of pinpointing a single object within a single video frame or across frames,
looking at the bigger picture, in this case - the whole traffic scene, might provide better
accuracy in traffic accident detection.
In overviewing the covered Deep Learning-based traffic accident recognition
approaches in a mixed traffic flow environment, it can be seen that a majority of the
models proposed focus on modelling the motion flow of the road users to detect abnor-
mality [1, 19, 27, 28, 31] and modelling the crash appearances of the road users [7, 35].
Recently, a fusion feature-based approach that combines both the appearance and motion
crash features is widely adopted in traffic accident detection especially in mixed traffic
flow [15, 20, 21, 29]. It is worthwhile to note that the fusion feature-based approach has
significantly improved the overall performance of the traffic accident recognition model
46 S. T. Fu et al.
by considering multiple perspectives at the object level across video frames. The pos-
sibility of having other prominent features could be considered in this fusion approach
serving as a reinforcement for the decision made based on the other features. Perhaps in
the context of a mixed traffic flow environment, a fusion feature-based Deep Learning
model that takes into consideration features at both scene and object levels for traffic
accident recognition could improve the overall accuracy. However, the performance in
terms of recognition accuracy, speed, and computational cost of this fusion feature-based
Deep Learning approach warrants further investigation.
Lastly, comprehensive validation is necessary for any real-time detection/validation
model. From the literature, the most common performance evaluation metrics used for
Deep Learning models are detection rate, false alarm rate, and overall accuracy. How-
ever, only a few studies have employed all three measures to comprehensively validate
their model’s performance [1, 7, 15, 18, 27], and most did not report the specific mea-
surement of their model performance. It is observed that many proposed Deep Learning
models are validated against Machine Learning models, however, it is recommended to
have a comprehensive validation against other Deep Learning models as well, for better
benchmarking.
Therefore, this remains a challenge as most of the research works depended on their
own private datasets which cannot be accessed and compared against their proposed
models [7]. To date, benchmarking and balancing recognition accuracy to false posi-
tives/negatives among various Deep Learning models is an important issue that has yet
to be addressed in real-time traffic accident recognition literature.
The previous section highlighted the research need for Deep Learning models in the effec-
tive handling of traffic accident recognition. Drawing inspiration from fusion feature-
based deep learning approach that combines both the appearance and motion crash
features, this section presents a proposed model that visually expresses a fusion feature-
based deep learning model for vision-based traffic accident recognition that takes into
consideration features at both scene and object level for traffic accident recognition in
the context of a mixed traffic flow environment. This proposed model is inspired by
research works carried out by [15, 20, 21] and is believed to improve the robustness of
the model in accommodating traffic accident recognition in a mixed traffic flow envi-
ronment. Traffic accident defined for this proposed model includes major accident that
causes serious injuries and damages due to high-impact crashes happened on the road-
way. For instance, multi-vehicle crashes that cause severe vehicle appearance damage,
vehicle rollover, fallen motorists, or pedestrians detected continuously over a substantial
amount of time.
The model architecture of the proposed model mainly comprises of three main
streams of models: Model 1 focuses on classifying accident scenes at the frame level,
Model 2 focuses on classifying the damage status of the road user detected, whereas
Model 3 focuses on tracking the object across frames and model the damage status for
each of the object tracked as well as model the motion pattern across the frames for
Exploring Deep Learning in Road Traffic Accident 47
abnormality. The post-fusion score from these three streams of the models is used to
get the final score in determining the occurrence of a traffic accident. The complete pro-
cess flow of the proposed fusion feature-based deep learning model for traffic accident
recognition is illustrated in Fig. 5.
As discussed in the previous section, scene understanding is termed as an important
component to be considered in improving the accuracy of road traffic accident recogni-
tion in a mixed traffic scenario. Hence, in the proposed model, Model 1 accepts a video
input that is split into video frames in sequence and these video frames will run through
a recurrent neural network for feature extraction and video classification which produce
two independent outputs (accident and non-accident). From the reviews, recurrent neural
networks such as LSTM, CNN-LSTM, and ConvLSTM are some of the most popular
potential Deep Learning frameworks used in this context as they are particularly effec-
tive for real-time data such as traffic time series, weather data, and congestion status
data.
On the other hand, determining appearance change on the road user detected such as
crashes is also one of the prominent features in road traffic accident recognition. Model
2 is mainly trained using a deep neural network to identify patterns in the object detected
across space for the classification of damage features aimed at different classes of road
users such as cars, trucks, motorcycles, and pedestrians. This model is embedded as part
of the component for the development of Model 3.
Also, observing road user interaction is a primary component in traffic accident
recognition. This can be achieved through Multiple Object Tracking (MOT), in which
the trajectories of several moving objects across video frames are extracted and analysed
[49]. The main responsibilities of MOT are to be able to locate multiple objects, maintain
the identities of the objects, and yielding the individual object trajectories across the
video frames [50]. With the emergence of Deep Learning-based models, it has provided
a powerful framework to formulate and model the target association problem which can
boost the tracking performance significantly. Leveraging the remarkable studies by past
researchers on MOT in the transportation domain [20, 51], it is evident that utilizing a
Deep Learning-based approach can better track different types of road users in a mixed
traffic flow environment for traffic accidents recognition. Hence, Model 3 utilizes a deep
learning-based approach in detecting objects within each of the video frames from the
video sequences and takes the deep learning features of the object into account as one of
its tracking metrics. The tracking information such as bounding boxes, stack trajectories,
centroid positions, speed, and interaction-related parameters of moving objects across
several consecutive frames are saved into sequences, which are fed into a recurrent neural
network to learn the motion sequences across frames. On top of that, each of the objects
tracked in each frame is passed to Model 2 as an input for classification on the object’s
damaged status. The damage status for each of the objects tracked across consecutive
frames is saved into sequence and further passed into a recurrent neural network to learn
the appearance change features pattern. Both the output from the motion and appearance
abnormality is used to determine the occurrence of the traffic accident.
In short, the proposed model is comprised of three main streams of model in which
each stream houses a technique to cater for prominent features in traffic accident recog-
nition. However, this proposed model can be dynamically scaled to accommodate more
48 S. T. Fu et al.
simultaneous stream of models in the effort to increase traffic accident recognition accu-
racy. Likewise, there is a possibility that adding more streams may not result in significant
accuracy increase as its performance depends on the selected features and Deep Learning
techniques used. On the other hand, there is a risk of running too many streams in that
the accuracy increase would be negligible while acerbating computation time making
real-time operation impossible to achieve. Therefore, much care needs to be invested to
strike a balance between performance improvement and the number of streams used to
avoid the said extremes.
Fig. 5. The proposed fusion feature-based deep learning model for traffic accident recognition in
a mixed traffic flow environment
References
1. Ijjina, E, P., Chand, D., Gupta, S., Goutham, K.: Computer vision-based accident detection
in traffic surveillance. In: 2019 10th International Conference on Computing, Communica-
tion and Networking Technologies ICCCNT 2019 (2019). https://doi.org/10.1109/ICCCNT
45670.2019.8944469
2. Almaadeed, N., Asim, M., Al-Maadeed, S., Bouridane, A., Beghdadi, A.: Automatic detection
and classification of audio events for road surveillance applications. Sensors 18(6), 1858
(2018). https://doi.org/10.3390/s18061858
3. Tian, D., Zhang, C., Duan, X., Wang, X.: An automatic car accident detection method based on
cooperative vehicle infrastructure systems. IEEE Access 7, 127453–127463 (2019). https://
doi.org/10.1109/ACCESS.2019.2939532
4. Chang, W.J., Chen, L.B., Su, K.Y.: DeepCrash: a deep learning-based internet of vehicles
system for head-on and single-vehicle accident detection with emergency notification. IEEE
Access 7, 148163–148175 (2019). https://doi.org/10.1109/ACCESS.2019.2946468
5. Sai Kiran, M., Verma, A.: Review of studies on mixed traffic flow: perspective of develop-
ing economies. Transp. Dev. Econ. 2(1), 1–16 (2016). https://doi.org/10.1007/s40890-016-
0010-0
6. Phogat, A., Gupta, R., Kumar, E. N.: Study on effect of mixed traffic in highways, pp. 1288–
1291 (2020)
7. Wang, C., Dai, Y., Zhou, W., Geng, Y.: A vision-based video crash detection framework for
mixed traffic flow environment considering low-visibility condition. J. Adv. Transp. 2020,
1–11 (2020). https://doi.org/10.1155/2020/9194028
8. Parsa, A.B., Taghipour, H., Derrible, S., (Kouros) Mohammadian, A.: Real-time accident
detection: coping with imbalanced data. Accid. Anal. Prev. 129, 202–210 (2019). https://doi.
org/10.1016/j.aap.2019.05.014
9. Zhang, K.: Towards transferable incident detection algorithms. 6, 2263–2274 (2005)
10. Ki, Y.K., Lee, D.Y.: A traffic accident recording and reporting model at intersections. IEEE
Trans. Intell. Transp. Syst. 8(2), 188–194 (2007). https://doi.org/10.1109/TITS.2006.890070
11. Hui, Z., Xie, Y., Lu, M.,Fu, J.: Vision-based real-time traffic accident detection. In: Proceeding
of the 11th World Congress on Intelligent Control and Automation, 2014, pp. 1035–1038
(2015). https://doi.org/10.1109/WCICA.2014.7052859
12. Kwak, H.C., Kho, S.: Predicting crash risk and identifying crash precursors on Korean express-
ways using loop detector data. Accid. Anal. Prev. 88, 9–19 (2016). https://doi.org/10.1016/j.
aap.2015.12.004
13. Ravindran, V., Viswanathan, L., Rangaswamy, S.: A novel approach to automatic road-
accident detection using machine vision techniques. Int. J. Adv. Comput. Sci. Appl. 7(11),
235–242 (2016). https://doi.org/10.14569/ijacsa.2016.071130
14. Basso, F., Basso, L.J., Bravo, F., Pezoa, R.: Real-time crash prediction in an urban expressway
using disaggregated data. Transp. Res. Part C Emerg. Technol. 86, 202–219 (2018). https://
doi.org/10.1016/j.trc.2017.11.014
15. Singh, D., Mohan, C.K.: Deep spatio-temporal representation for detection of road accidents
using stacked autoencoder. IEEE Trans. Intell. Transp. Syst. 20(3), 879–887 (2019). https://
doi.org/10.1109/TITS.2018.2835308
16. Jiang, F., Yuen, K.K.R., Lee, E.W.M.: A long short-term memory-based framework for crash
detection on freeways with traffic data of different temporal resolutions. Accid. Anal. Prev.
141, 105520 (2020)
50 S. T. Fu et al.
17. Ghosh, S., Sunny, S.J., Roney, R.: Accident Detection Using Convolutional Neural Networks.
Int. Conf. Data Sci. Commun. IconDSC 2019, 1–6 (2019). https://doi.org/10.1109/IconDSC.
2019.8816881
18. Parsa, A. B., Chauhan, R. S., Taghipour, H.: Derrible, S.: Mohammadian, A.: Applying deep
learning to detect traffic accidents in real time using spatiotemporal sequential data 1, 312
(2019). http://arxiv.org/abs/1912.06991
19. Roy, D., Ishizaka, T., Krishna Mohan C., Fukuda, A.: Detection of collision-prone vehicle
behavior at intersections using siamese interaction LSTM. IEEE Trans. Intell. Transp. Syst.,
1–10 (2019). http://arxiv.org/abs/1912.04801
20. Huang, X., He, P., Rangarajan, A., Ranka, S.: Intelligent intersection: two-stream convo-
lutional networks for real-time near-accident detection in traffic video. ACM Trans. Spat.
Algorithms Syst. 6(2), 1–28 (2020). https://doi.org/10.1145/3373647
21. Lu, Z., Zhou, W., Zhang, S., Wang, C.: A new video-based crash detection method: balancing
speed and accuracy using a feature fusion deep learning framework. J. Adv. Transp. 2020,
1–12 (2020). https://doi.org/10.1155/2020/8848874
22. Kumeda, B., Fengli, Z., Oluwasanmi, A., Owusu, F., Assefa, M., Amenu, T.: Vehicle accident
and traffic classification using deep convolutional neural networks. In: 2019 16th International
Computer Conference on Wavelet Active Media Technology and Information Processing
ICCWAMTIP 2019, pp. 323–328 (2019). https://doi.org/10.1109/ICCWAMTIP47768.2019.
9067530
23. Rajesh, G., Benny, A. R., Harikrishnan, A., Jacobabraham, J., John, N. P.: A deep learning
based accident detection system. In: Proceeding of the 2020 IEEE International Conference
on Communication and Signal Processing ICCSP 2020, pp. 1322–1325 (2020). https://doi.
org/10.1109/ICCSP48568.2020.9182224
24. Zheng, K., Yan, W.Q., Nand, P.: Video dynamics detection using deep neural networks. IEEE
Trans. Emerg. Top. Comput. Intell. 2(3), 224–234 (2018). https://doi.org/10.1109/TETCI.
2017.2778716
25. Huang, T., Wang, S., Sharma, A.: Highway crash detection and risk estimation using deep
learning. Accid. Anal. Prev. 135, p. 105392 (2020). https://doi.org/10.1016/j.aap.2019.105392
26. Gupta, G., Singh, R. Patel, A. S., Ojha, M.: Time-distributed model in videos (2021)
27. Wang, P., Ni, C., Li, K.: Vision-based highway traffic accident detection. In: ACM Interna-
tional Conference on Proceeding Series, pp. 5–9 (2019). https://doi.org/10.1145/3371425.
3371449
28. Machaca Arceda, V. E., Laura Riveros, E.: Fast car crash detection in video. In: Proceeding
of the 2018 44th Latin American Computer Conference (CLEI) 2018, pp. 632–637 (2018).
https://doi.org/10.1109/CLEI.2018.00081
29. Paul, A. R.: Semantic video mining for accident detection, 5(6), 670–678 (2020)
30. Chung, Y.L., Lin, C.K.: Application of a model that combines the YOLOv3 object detection
algorithm and canny edge detection algorithm to detect highway accidents. Symmetry (Basel)
12(11), 1–26 (2020). https://doi.org/10.3390/sym12111875
31. Vu, H.N., Dang, N.H.: An improvement of traffic incident recognition by deep convolutional
neural network. Int. J. Innov. Technol. Explor. Eng. 8(1), 10–14 (2018)
32. Vu, N., Pham, C.: Traffic incident recognition using empirical deep convolutional neural
networks model. In: Cong Vinh, P., Ha Huy Cuong, N., Vassev, E. (eds.) ICCASA/ICTCC
-2017. LNICSSITE, vol. 217, pp. 90–99. Springer, Cham (2018). https://doi.org/10.1007/
978-3-319-77818-1_9
33. Srinivasan, A., Srikanth, A., Indrajit, H., Narasimhan, V.: A novel approach for road acci-
dent detection using DETR algorithm. In: 2020 International Conference on Intelligent Data
Science Technologies and Applications IDSTA 2020, pp. 75–80, (2020). https://doi.org/10.
1109/IDSTA50958.2020.9263703
Exploring Deep Learning in Road Traffic Accident 51
34. Pillai, M.S., Chaudhary, G., Khari, M., Crespo, R.G.: Real-time image enhancement for an
automatic automobile accident detection through CCTV using deep learning. Soft. Comput.
25(18), 11929–11940 (2021). https://doi.org/10.1007/s00500-021-05576-w
35. Shah, A. P., Lamare, J. B., Nguyen-Anh, T., Hauptmann, A.: CADP: a novel dataset for
CCTV traffic camera based accident analysis. In: Proceeding of the AVSS 2018 15th IEEE
International Conference on Advanced Video and Signal Based Surveillance, no. i (2019).
https://doi.org/10.1109/AVSS.2018.8639160
36. Tsiktsiris, D., Dimitriou, N., Lalas, A., Dasygenis, M., Votis, K., Tzovaras, D.: Real-time
abnormal event detection for enhanced security in autonomous shuttles mobility infrastruc-
tures. Sensors (Switzerland) 20(17), 1–24 (2020). https://doi.org/10.3390/s20174943
37. Li, P., Abdel-Aty, M., Yuan, J.: Real-time crash risk prediction on arterials based on LSTM-
CNN. Accid. Anal. Prev. 135, 105371 (2020). https://doi.org/10.1016/j.aap.2019.105371
38. Wen, L., et al.: UA-DETRAC: A new benchmark and protocol for multi-object detection and
tracking. Comput. Vis. Image Underst. 193, 102907 (2020). https://doi.org/10.1016/j.cviu.
2020.102907
39. Nguyen, H., Kieu, L.M., Wen, T., Cai, C.: Deep learning methods in transportation domain: a
review. IET Intell. Transp. Syst. 12(9), 998–1004 (2018). https://doi.org/10.1049/iet-its.2018.
0064
40. Theofilatos, A., Chen, C., Antoniou, C.: Comparing machine learning and deep learning
methods for real-time crash prediction. Transp. Res. Rec. 2673(8), 169–178 (2019). https://
doi.org/10.1177/0361198119841571
41. Pawar, K., Attar, V.: Deep learning approaches for video-based anomalous activity detection.
World Wide Web 22(2), 571–601 (2018). https://doi.org/10.1007/s11280-018-0582-1
42. Jun, G., Aggarwal, J. K., Gökmen, M.: Tracking and segmentation of highway vehicles in
cluttered and crowded scenes. 2008 IEEE Workshop on Applications of Computer Vision,
WACV (2008). https://doi.org/10.1109/WACV.2008.4544017
43. Kim, Z. W.: Real time object tracking based on dynamic feature grouping with background
subtraction. 26th IEEE Conference on Computer Vision Pattern Recognition, CVPR (2008).
https://doi.org/10.1109/CVPR.2008.4587551
44. Mendes, J. C., Bianchi, A. G. C., Pereira, Á. R.: Vehicle tracking and origin-destination count-
ing system for urban environment. VISAPP 2015 - International Conference on Computer
Vision Theory and Applications VISIGRAPP, vol. 3, pp. 600–607 (2015). https://doi.org/10.
5220/0005317106000607
45. Wu, S., Moore, B. E., Shah, M.: Chaotic invariants of Lagrangian particle trajectories for
anomaly detection in crowded scenes. In: Proceeding of the IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, pp. 2054–2060 (2010). https://doi.org/10.
1109/CVPR.2010.5539882
46. Patoliya, P., Bombaywala, P. S. R.: Object detection and tracking for surveillance system, vol.
3, issue 6, pp. 18–24 (2015)
47. Pradhan, B., Ibrahim Sameen, M.: Laser Scanning Systems in Highway and Safety
Assessment, vol. 7 (2020).https://doi.org/10.1007/978-3-030-10374-3
48. Veni, S., Anand, R., Santosh, B.: Road accident detection and severity determination from
CCTV surveillance. In: Tripathy, A.K., Sarkar, M., Sahoo, J.P., Li, K.-C., Chinara, S. (eds.)
Advances in Distributed Computing and Machine Learning. LNNS, vol. 127, pp. 247–256.
Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-4218-3_25
49. Ooi, H.-L., Bilodeau, G.-A., Saunier, N., Beaupré, D.-A.: Multiple object tracking in urban
traffic scenes with a multiclass object detector. In: Bebis, G., et al. (eds.) ISVC 2018. LNCS,
vol. 11241, pp. 727–736. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03801-
4_63
52 S. T. Fu et al.
50. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Kim, T.-K.: Multiple object tracking:
a literature review. Artif. Intell. 293, 103448 (2020). https://doi.org/10.1016/j.artint.2020.
103448
51. Chan, Z. Y., Suandi, S. A.: City tracker: multiple object tracking in urban mixed traffic
scenes. In: Proceedings of the 2019 IEEE International Conference Signal Image Processing
Applications ICSIPA 2019, pp. 335–339 (2019). https://doi.org/10.1109/ICSIPA45851.2019.
8977783
Alternate Approach to GAN Model
for Colorization of Grayscale Images: Deeper
U-Net + GAN
Seunghyun Lee(B)
1 Introduction
1.1 Background
Image colorization refers to when the input is a grayscale photo and the objective is to set
a natural, appropriate color to the photo as an output. As grayscale information does not
guarantee what color is appropriate for a certain section of the given photo, we use color
information from typically colored photos to construct an algorithm that determines a
plausible color for the grayscale photo.
Due to the ambiguity of the problem, it is yet to be perfectly solved, nor does it have
a definitive approach towards the solution. Apart from just colorizing the given photo,
the two crucial factors of this research are to 1) determine a color that seems natural for
any viewer, 2) use a wide variety of colors to capture significant details. These factors
serve to make this problem even more challenging, as it has to be achieved without clear
criteria to determine the accuracy of the output.
The evaluation of the algorithm’s accuracy is entirely dependent on the satisfaction of
the human viewers, who will use their experience to determine whether the colorization
has been done naturally. Thus, although we have a method to determine the accuracy,
the method requires extensive effort and is difficult to guarantee objectiveness [1].
1.2 Purpose
This research aims to construct a new image colorization program. Colorization programs
that already exist involve deep learning, shown in Fig. 1 (as a means of gathering immense
data), which opens various methods upon which the program can be established upon.
Thus, this research will investigate which model is able to provide an algorithm with
lower computation speed and high resolution.
Accuracy is another aspect to be focused on in this research. Existing colorization
programs suffer from low accuracy in their output due to the existence of “outlier”
photos. For example, a grayscale photo of an exotic-colored flower such as orange
would be colorized as relatively red or pink, since the algorithm mostly gathers data on
the typical color of a flower.
In this research, typical methods were utilized- CNN-based U-net, Autoencoder,
GAN - and compared these methods in terms of time efficiency and accuracy. This will
show the limitations of the currently available colorization algorithms. A new algorithm
based on the original GAN model will be suggested, and compared the results with the
original algorithm to see how much the algorithm has improved. The rest of the paper
proceeds as follows: firstly, in the related work section, we introduce several previous
kinds of research related to our topic. Then, in the material and methods section, some
deep learning models and datasets are introduced. In the results section, the experimental
results, especially the generated photos from the proposed model are exhibited. In the
discussion section, the principal finding of our research, real-life usage of this finding,
Fig. 1. The architecture of deep neural networks consists of three hidden layers
Alternate Approach to GAN Model for Colorization of Grayscale Images 55
and future research are covered. Lastly, we summarize those sections mentioned above
in the conclusion section.
2 Related Work
Kinani et al. used the DIV2K data set, from the NTIRE image coloring challenge. The
dataset includes 1000 color images of 2K resolution, 800 of them being training images
for the algorithm, another 100 being validation images, and the rest of the 100 being
test images. For training, they used the VGG19 model, a convolutional neural network
(CNN) that is pre-trained with a large dataset and consists of 19 layers. They also
produced the tested results and compared with original photos to show the accuracy of
the result visually. However, they failed to provide a detailed analysis on their result’s
accuracy. Although they define the PSNR index to represent accuracy numerically, they
do not provide insights to the linkage between the index value and the actual accuracy,
making it difficult to interpret the accuracy from the index itself. Although the team
presents their program as the second most accurate program out of the other DIV2K
dataset programs, they still fail to maintain high accuracy in the “outlier” photos which
hold completely different colors from the colorized image [2].
An et al. used the VGG-16 as a basic reference but mainly trained on ImageNet in
order to learn a variety of colors. They provide a more detailed function that expresses the
accuracy of the result. They also use it to provide an objective standard for evaluating the
accuracy of each colorization algorithm, which they apply to compare their algorithm
with other prominent algorithms. They conclude that Zhang et al. resulted with the
greatest per-pixel RMSE [3].
Shankar et al. used Inception ResNet V2, a pre-trained model that Google publicly
released, as the basis of their model, and a fusion layer to train the model further and gain
a closer, required output. Then they used it as an example to measure the algorithm’s
performance in terms of the degree of error. They also defined an error function and
applied it to find the optimal number of epochs and steps per epoch [4].
Varga et al. referred mainly to images in SUN database and other images too. They
used VGG-16 and a two-stage CNN as their model. They differ from other researches
in that they established a two-stage algorithm based on CNN that predicts the U and
V channels of the input. They also relied on Quaternion Structural Similarity (QSSIM)
as a base for evaluating the accuracy of colorization. They provide reasons as to why
they chose QSSIM, and quantitative analysis using QSSIM. Three different experiments
were carried out to compare each model. The first one is utilizing autoencoder-based
convolution, while the second and third ones are VGG16 based U-net, and our proposed
model, respectively [5].
to accurately estimate the colored version from a given grayscale image. The dataset is
gathered from the Kaggle website, which can be accessed via https://www.kaggle.com/
theblackmamba31/landscape-image-colorization. The figures, which are demonstrated
in Fig. 2 show the sample images (color, and grayscale) from the given dataset [6].
Fig. 2. Examples of the Image Datasets Gathered from the Kaggle Website
Alternate Approach to GAN Model for Colorization of Grayscale Images 57
3.2 CNN
3.3 Autoencoder
Unlike previous research, we propose a novel approach based on the GAN (Generative
Adversarial Network). The GAN model consists of the generator, and discriminative
part respectively. The generator generates the synthetic data based on the probability
distribution of the given data, and the discriminator decides whether the generated data
is genuine or not. In previous research, the authors suggested a model which has similar
58 S. Lee
architecture to the GAN model. They utilized the U-net architecture, Patch GAN as a
generator and discriminator, respectively [12].
However, we propose a new model, which is based on asymmetry U-net architec-
ture. The origin U-net network has a symmetry architecture, which is similar to the
autoencoder. However, we tried to construct a deeper deep model than the previous
research, in order to attain better performance. Therefore, for the encoder parts in the
U-net, we added a 1x1 convolution layer before every single encoding layer. The 1x1
convolution layer is mainly used for reducing a computation, and it is a principal back-
ground of the GoogLeNet, presented by Google [13]. Through this approach, we could
add non-linearity to our encoder layers with activation functions, rectified linear acti-
vation function (ReLu), and leaky ReLu. The mathematical expression of the ReLu
function is f (x) = 0(forx < 0), x(forx >= 0) while the leaky Relu’s expression is
f (x) = 0.01x(forx < 0), x(forx >= 0). Main difference between those functions is that
leaky ReLu could prevent to prevent the value from converging to zero. These activation
functions are dscribed in Fig. 5.
In collaboration with asymmetry architecture, we successfully constructed a deeper
model, while reducing the total number of parameters, which could efficiently decrease
the computation time and memory. In other words, previous approaches utilized a lot of
memory for both GPU and RAM, which made it difficult to apply the model in relatively
simple environments such as Colab. Furthermore, with using a lot of memories, applying
the colorization techniques in a mobile environment would be tough.
Our proposed model mainly consists of two parts: generator and discriminator, which
is quite similar to other GAN models. The encoder part utilizes asymmetry U-net archi-
tecture which has the same architecture as the VGG11, and the PatchGAN was utilized
Alternate Approach to GAN Model for Colorization of Grayscale Images 59
Fig. 5. Visualization of Two Non-Linear Activation Functions: leaky ReLu, and ReLu
for the decoder part. As the PatchGAN utilizes a patch unit for determining the authen-
ticity of the generated image, it is far more effective in colorization compared to the
original GAN model. The overall process could be found in Fig. 6.
Fig. 6. The overall architecture of the proposed model: A deeper VGG 11 for the generator and
patch GAN for the discriminator: The generator produces fake images, and the discriminator
decides whether the input images are fake or not.
4 Result
The below figures, Fig. 7 present more predicted images produced by this algorithm.
These images are based on the test cases used for our improved algorithm, and the results
are an evident failure, being a nearly grayscale image. It is questionable whether these
images are really colorized. In the below section, we will reexamine these test cases,
and see how the results are improved in our newly constructed algorithm.
60 S. Lee
The below figures (Fig. 8) are also predicted images produced by this code. Notice that
these images are identical to the images used to test our improved code below. Comparing
the results, we see a clear distinction between the first and the third test case. Results from
this algorithm are overall monotone in color, and are distinct from the actual color image,
while the improved algorithms provide a more vivid colorization of the grayscale images.
The second test case turns out to be almost similar for both algorithms, but notice that
the improved algorithm provides a clearer colorization compared to the above results.
Alternate Approach to GAN Model for Colorization of Grayscale Images 61
The below figures (Fig. 9) present three test cases and compare grayscale images with
their color image and their predicted image printed by the algorithm. All three results are
evidently similar to the actual color image in terms of the overall color. There are detailed
parts where the colors are different, but we take credit in that distinct objects such as
the sky, computer and the desk has nearly accurate colors. Also, referring to the results
of these test cases via the two previously examined algorithms, we clearly see that the
results via this algorithm are more accurate. Even though huge amounts of the datasets
were utilized for evaluating our model, we chose those three different pictures because
we believed those images were quite hard to colorize. For instance, the photo of the pizza
and the third image contain diverse colors and the divider boundary of each object was
complicated. Furthermore, for the second image, the gradation of the sky was difficult
to depict, so for those reasons, these three images were chosen for the representations
of our results. In addition to the performance of colorization, the proposed model was
much better than other models in terms of the time required. For other models, training
in the same environment took almost an hour, but for the proposed model, it took less
than 20 min.
62 S. Lee
5 Discussion
5.1 Principal Finding
In this research, we examined the typical methods used by the colorization algorithm:
autoencoder and CNN-based U-net. We tested these algorithms by using sample data
to compare the color image and the predicted image produced by the algorithm. As a
result, we found that the conventional algorithms lacked accuracy in the majority of the
test cases.
This research also suggested a new model by improving the traditional GAN model.
This method turned out to produce better-predicted images than the previously examined
algorithms, and we concluded that the GAN model can be made more accurate with less
memory usage (RAM, GPU) and computation time.
find further resources within the photos. It can also let viewers capture the scene more
realistically.
6 Conclusion
This research proposed an alternate colorization algorithm, which was developed upon
the conventional GAN model. We reformed the generator part of the original GAN
model, which was based on the unit model of VGG11, by adding convolution layers
to the symmetric unit model. This led us to create a deeper model and use nonlinear
functions like the ReLU function. As a result, the number of parameters involved in the
algorithm were reduced, and the computation time was shortened. By comparing the
results formed by the original algorithm and the reformed algorithm, we also confirmed
that our algorithm had greater accuracy, in terms of the plausibility of the colorization.
References
1. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe,
N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46487-9_40
2. Kiani, L., Saeed, M., Nezamabadi-Pour, H.: Image colorization using a deep transfer learning.
In: 2020 8th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS) (2020)
3. An, J., Gagnon, K.K., Shi, Q., Xie, H., Cao, R.: Image colorization with convolutional neural
networks. In: 2019 12th International Congress on Image and Signal Processing, BioMedical
Engineering and Informatics (CISP-BMEI) (2019)
4. Shankar, R., Mahesh, G., Murthy, K.V.S.S., Ravibabu, D.: A novel approach for gray scale
image colorization using convolutional neural networks. In: 2020 International Conference
on System, Computation, Automation and Networking (ICSCAN) (2020)
5. Varga, D., Sziranyi, T.: Fully automatic image colorization based on convolutional neural
network. In: 2016 23rd International Conference on Pattern Recognition (ICPR). (2016)
6. Kaggle: https://www.kaggle.com/theblackmamba31/landscape-image-colorization Accessed
23 Jan 2022
7. Joo, H., Choi, H., Yun, C., Cheon, M.: Efficient network traffic classification and visualizing
abnormal part via hybrid deep learning approach : Xception + Bidirectional GRU. Global
Journal of Computer Science and Technology, pp. 1–10 (2022)
8. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network.
In: 2017 International Conference on Engineering and Technology (ICET) (2017)
9. Badino, L., Canevari, C., Fadiga, L., Metta, G.: An auto-encoder based approach to unsu-
pervised learning of subword units. In: 2014 IEEE International Conference on Acoustics
(2014)
64 S. Lee
10. Ye, M., Ma, A.J., Zheng, L., Li, J., Yuen, P.C.: Dynamic label graph matching for unsupervised
video re-identification. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 5142–5150 (2017)
11. Seyfioglu, M.S., Ozbayoglu, A.M., Gurbuz, S.Z.: Deep convolutional autoencoder for radar-
based classification of similar aided and unaided human activities. IEEE Trans. Aerosp.
Electron. Syst. 54(4), 1709–1723 (2018)
12. Ren, H., Li, J., Gao, N.: Two-stage sketch colorization with color parsing. IEEE Access 8,
44599–44610 (2020)
13. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going
deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1–9 (2015)
Implementing Style Transfer with Korean
Artworks via VGG16: For Introducing Shin
Saimdang and Hongdo KIM’S Paintings
Jeanne Suh(B)
Abstract. By introducing genre painting to the artists of the time, artist Kim
Hongdo is a man known for opening the Renaissance era of Joseon’s art history.
Not only did he introduce new genres of art, but he also combined his delicate tech-
niques with his own unique art styles and thus completed over 130 art pieces dur-
ing his lifetime. Shin Saimdang is another prominent artist of the Joseon Dynasty.
Despite the Confucianism beliefs that limited women at that time, Shin Saimdang
managed to introduce her meticulous art styles to the public and receive acknowl-
edgment from many officials of the time. This research aimed at a deep learning-
based algorithm to recreate original photos by implementing Kim Hongdo’s and
Shin Saimdang’s art styles. Unlike previous research which utilized western paint-
ings for the target of the style transfer, this paper proposed the traditional Korean
artwork; such difference contributes to making this research meaningful. Further-
more, this paper suggests a novel method that is based on the VGG16 model, in
order to reduce the computation speed compared to the VGG19 model. The model
implemented style transfer to five original photos which created successful results,
capable of introducing Kim Hongdo’s and Shin Saimdang’s art styles and tech-
niques to the general public. The brush strokes and color themes of each artist are
successfully recreated in the new images. Despite such drastic changes, the overall
structure of the original photo is well maintained and expressed. The five exam-
ples can become a helpful guideline for a better understanding of Kim Hongdo’s
and Shin Saimdang’s art styles which can further stretch to the understanding of
Joseon’s art history as a whole.
1 Introduction
1.1 Background
Kim Hongdo, the exemplar of Joseon art history, is still considered as a phenomenal
artist that highly elevated the style of art throughout Korean art history. In 1745, Hongdo
was born as an only son of a poor family [1]. Unlike most artists of the time, Hongdo
wasn’t from an artist family. Therefore, his early influence on art was mostly from his
teacher Gang Se-hwang, a government official who was also known as an art critic as
well as a part-time artist. As an influence of Gang Se-hwang’s recommendation, Kim
Hongdo began to work under Jeongjo, the 22nd ruler of the Joseon dynasty, from his
mid-twenties. Hongdo succeeded in receiving high credibility from Jeongjo, and within
a few years, Hongdo was entitled to be in charge of several confidential errands of the
country, such as going to Tsushima island to draw the map of Japan on his own [2].
He also drew many portraits of the members of the royal family. Although Hongdo
was a versatile, talented artist capable of diverse painting genres such as landscape and
portraits, the reason why he is considered a crucial artist that influenced many artists
during the 18th and 19th centuries is because of his paintings that depicted specifically
the lives of ordinary people. This specific style of drawing is called genre painting. 25
pieces of Hongdo’s genre paintings were compiled into an art book called “Dan won
Pung sok do chub.” One famous painting from the book “Dan won Pung sok do chub” is
this painting called “Seodang.” which is described in Fig. 1. At first glance, “Seodang”
can be read from the center to the outside of the painting; at first, the focal point stays on
the little boy weeping in sadness, which opens up to the surrounding view of the other
boys and the teacher irresistibly laughing towards the crying boy. The rough, simple
strokes, as well as the low saturated gouaches, are also noticeable. The minor details of
Hongdo’s painting arouse curiosity from the viewer. For instance, Hongdo intentionally
hid the face of the boy located in the lowest part of the painting while exaggerating the
wrinkles on his clothing, making the viewers doubt whether the boy is laughing together
with his fellow classmates or trembling in fear that he would be the one who would
be scolded next. Such minor details function as a comical element adding humor to
the painting. The exact year of Hongdo’s death is still left as a mystery, although some
historians assume that it was around the year 1806 [3]. One thing for sure is that Hongdo
had a higher influence than any other artists throughout Korean art history.
One of the major branches of Confucianism is Neo-Confucianism, also known as
“Xing li xue” in traditional Chinese characters, which was the national, cardinal beliefs
of the Joseon period. Neo-Confucianism and also Confucianism in general believed in
“Nam jon yeo bi,” a rather discriminating idea that men were better than women [4].
Therefore, the Joseon period was not the best time period for women to show the world
their talent. Shin Saimdang is the one and only female Joseon artist left on record. Shin
Saimdang was born into a very wealthy and prestigious family. The Shin family was
known for their great wealth and therefore, Shin Saimdang had more opportunities to
dedicate to different studies than many other girls of her age. From when she was very
young, she started to study poetry, drawing, and calligraphy, in which she was very
talented. Shin Saimdang was most famous for her floricultural drawings: drawings of
nature, especially flowers. Figure 2 is one of her famous paintings named “Gaji and Banga
kkaebi.” Her drawings were aesthetically eye-catching: it was very feminine, poetic, and
her brushstrokes were very meticulous and detailed [5]. Therefore, her drawings were
acknowledged by many famous scholars of the time, most famously Sukjong, the king
of the Joseon dynasty from 1674 to 1720. The records about her marriage life clearly tell
us that Shin Saimdang was not the passive, obedient type of woman that Confucianism
valued. However, Shin Saimdang is now more known as the “exemplar mother” after the
birth of Yulgok Yi i, a famous philosopher. Despite her talents in drawing, her talents were
hidden due to the idiosyncratic Confucianist ideas of discrimination towards women that
existed during the Joseon period [6].
1.2 Purpose
Throughout art history, Western and Eastern artists built different artistic styles depend-
ing on each of their cultural views on aesthetic preferences. The disparities were not
something made suddenly in one day; it is rather the product of centuries of cultural
separation between the west and the east. Unlike European artists of the past centuries,
who are now very well-known to the public, many Eastern artists are yet to be acknowl-
edged by a wider range of people. Through this project, the art styles and techniques of
artist Kim Hongdo can be easily approached and understood by a wider public through
the various examples provided through this algorithm.
There are a variety of artists throughout Korean art history, but both Kim Hongdo
and Shin Saimdang are cardinal examples, necessary for people to know in order to
understand Korean art history. Hongdo was so influential to the artists of his time that
Joseon’s art can be marked as the era before Hongdo and after Hongdo. He opened
the era of genre painting in the late 18th century of the Joseon dynasty [7]. Unlike
other artists that lived a similar time period as Hongdo, Hongdo specifically excelled
at capturing a specific scene, then copying it into his canvas with high accuracy and
delicacy. Hongdo would also omit the background in order to give extra focus to the
scenes he depicted. Furthermore, his paintings did not follow the traditional Joseon’s
style of drawing: utilizing multiple points of view and adding them up into one picture.
68 J. Suh
Instead, Hongdo followed the camera’s direct angle of view and utilized that one point
of view in his paintings [8].
Shin Saimdang was a competent figure who succeeded in leaving her name in history,
despite the limits she had to face as a woman in the Joseon period. Although her name
is more known as a wife of a man and mother of a man, it is an undeniable fact that she
is a phenomenal painter. As a means to appreciate her virtues and talents, a portrait of
Shin Saimdang is on the front cover of the Korean 50,000 won bill.
In conclusion, Hongdo was capable of using simple methods and styles of drawings
to give depth to the painting and utilized new techniques that artists of the time failed
to come up with. Shin Saimdang was also a figure who drew beautiful drawings of
nature and was known for her meticulous art style. By implementing a deep learning
algorithm, original photos would be recreated through Hongdo’s and Shin’s art styles,
which could help introduce pictures of Hongdo and Shin efficiently. The rest of the paper
consists as follows: related works section, which introduces previous research, methods
and materials section about deep learning algorithms, proposed model, result section,
discussion section, and conclusion section.
Even though Chinese painting is quite popular in Asian countries, neural transfer research
about Chinese painting is scarce. Therefore, they proposed modified extended difference-
of-Gaussians to develop a novel neural transfer model for Chinese-style paintings. They
begin with an MXDoG filter and then combine the MXDoG operation with three new loss
Implementing Style Transfer with Korean Artworks via VGG16 69
values for the training process. The proposed model consists of two different networks,
which are generative network, and loss network, respectively. For the loss network,
VGG16, which is a pre-trained CNN model was utilized, and MXDoG loss was utilized
to evaluate the loss function of the proposed network. Chinese paintings were collected
from various search platforms, including Google, Baidu, and Bing. Their experiment’s
result reveals that the proposed technique yield more attractive stylized outcomes when
transferring the style of Chinese traditional painting, compared to the state-of-the-art
neural style transfer models [9].
Zhao et al. carried out research about neural style transfer but mainly focused on
inventing a novel loss function through combining global and local loss, in order to
enhance the quality of the proposed model. They constructed a deep learning model
based on the VGG19 network and designed the model to contribute to the overall loss.
Layers ‘relu1_2’, ‘relu2_2’, ‘relu3_3’ and ‘relu4_2’ were chosen as a global loss section,
and ‘relu_3’ was selected as the local loss part. The global loss part gathered more global
information in the given image and the local loss for the local one. With these proposed
novel methods, the authors successfully reduce the artifacts, and transfer the style of the
contents image, while preserving the base structure and color of the given image [10].
Gatys et al. investigated the method to overcome a drawback of the conventional
neural style transfer methods that those models sometimes could duplicate a color dis-
tribution of the style image. Therefore, the authors proposed two different novel style
transfer methods, while preserving the colors. The first one is color histogram matching
and this method makes the colors of the style picture be modified to suit the colors of
the content image. The second method is luminance-only transfer. During this method,
luminance channels are extracted from both style and content images and then generate
a luminance image. The experiment’s result yielded that the second method is proper
to preserve the colors, even though the relationship between the luminance and color
channels no longer existed in the output image of the model [11].
Gupta et al. compared the method to conduct neural style transfer with various CNN-
based deep learning algorithms. Pretrained CNN models were utilized via Keras API, and
VGG16, VGG19, ResNet50, and InceptionV3 were used for the comparison. The Gram
matrices were utilized to compute the loss of the proposed models. The result yielded that
InceptionV3 and ResNet50 were not suitable models for the style transfer, especially,
the InceptionV3 model showed only a black screen. Furthermore, the performance of
the VGG19 is far greater than the VGG16, since the VGG19 consists of more layers
compared to the VGG16. However, it was found that the training speed of the VGG19
is slower than the VGG16, so each model has pros and cons respectively [12].
“Imagenet”. As the model was pre-trained with the dataset beforehand, the weight during
the training was conserved, which allowed making better performance when tested with
the user’s dataset. The input size of the VGG16 is fixed as 224X224, the stride of the
Conv feature map is fixed to 1, and padding is applied during operation for the precise
analysis. In the case of the pooling layer, the max pooling is applied after the Conv layer,
and a total of five max poolings are used. These components of the VGG16 can be found
in Fig. 3. The default layers from the VGG16 can be easily loaded through the Keras API.
Therefore, when applying the model to the user’s dataset, users could combine various
deep learning models including deep neural network (DNN), long short term memory
(LSTM), recurrent neural network (RNN), and gated recurrent unit (GRU) [13].
Style transfer refers to changing the style of the content image like a style image when
a content image and a style image are given. To this end, two images, VGG19, and a
generator model to create images are used. The purpose is to correctly blend the features
collected when the two content photos and style images travel through the VGG19 model
to combine the contents and style. From the style image, the feature is extracted from
almost all layers, and in the content image, the feature is extracted from the fourth layer
Implementing Style Transfer with Korean Artworks via VGG16 71
and passed to generator networks to create an image. Extracting features from the image
by CNN proceeds as follows: in the shallow layer, monotonous patterns are repeated,
but in the deeper layer, features in the shallow layer are becoming more complex and
enlarged. Therefore, the CNN could extract the features from the style image efficiently
and apply them to the content image. Previous neural transfer algorithms implied a
critical shortcoming: when the content image came in, the model should train it again,
which made it quite time-consuming. However, VGG19 based model could overcome
this problem by proceeding with the feed-forward process [14].
4 Result
4.1 Result of Hongdo KIM’S Painting
In order to implement the VGG16 based style transfer model, “Seodang” was selected
for the style image and five different pictures for the content image. During the execution,
total loss, style loss, and content loss were calculated, and each of them yielded 1.9979e
+ 05, 9.9740e + 04 and 1.0005e + 05 respectively.
Overall, the model succeeded in copying the general characteristics noticeable from
Hongdo’s paintings, including the rough, simple, and diluted black brush strokes, as well
as the yellowish, low saturated opaque colors. Hongdo’s paintings are mostly drawn in
a yellowish drab, which caused the model to recreate the opposite colors of yellow,
such as blue, drawn with less precision. Although the model did struggle in the details
that Hongdo’s paintings failed to provide, it succeeded in recreating a good overview of
Hongdo’s art style as well as techniques, and the results could be found in Fig. 4, 5, 6,
7, 8.
The training epoch for the model was set to 500, and at epochs 25, 75, 390, and 495,
the model yielded the transferred output, and it could be concluded that as the number of
the epoch increases, the performance of the model also got enhanced. Furthermore, the
figures below show that the minimum epoch should be higher than 75 since the picture
at epoch 75 could not exhibit the desirable output. Furthermore, the experiment was
conducted on Colab pro, where GPU is Tesla P100, and it only took 125 s to complete
the execution. This fast computation is mainly due to the proposed model that is mainly
based on the VGG16 model. The figures below depict the output of the model pear
epochs (Fig. 9, 10, 11, 12).
72 J. Suh
Fig. 4. Style transfer results from the proposed model (Data #1)
Fig. 5. Style transfer results from the proposed model (data #2)
Fig. 6. Style transfer results from the proposed model (data #3)
Fig. 7. Style transfer results from the proposed model (data #4)
Implementing Style Transfer with Korean Artworks via VGG16 73
Fig. 8. Style transfer results from the proposed model (data #5)
Fig. 9. Style transfer results from the proposed model (epoch: 25)
Fig. 10. Style transfer results from the proposed model (epoch: 75)
74 J. Suh
Fig. 11. Style transfer results from the proposed model (epoch: 395)
Fig. 12. Style transfer results from the proposed model (epoch: 495)
Furthermore, the colors are not diluted but solid and clear to the eye. As shown in
Fig. 13, 14, 15, 16 and 17, the VGG16 model successfully captured such attributes of
Shin’s Saimdang’s paintings in terms of color. As a result, the style transfer image utilizes
a more vivid and bright color palette compared to the original content image.
Fig. 13. Style transfer results from the proposed model (data #1)
Fig. 14. Style transfer results from the proposed model (data #2)
Fig. 15. Style transfer results from the proposed model (data #3)
76 J. Suh
Fig. 16. Style transfer results from the proposed model (data #4)
Fig. 17. Style transfer results from the proposed model (data #5)
5 Conclusion
5.1 Discussion
One factor that can differentiate this model from other analogous research is that this
algorithm specifically recreates a piece from a Korean artist of the Joseon period. Similar
research tends to recreate the art styles of prominent European artists such as Vincent
Van Gogh, a representative Post-Impressionist artist of the 18th century. This research
can raise awareness of ethnic artists, in which our research chose Kim Hongdo, the artist
who opened the Renaissance period of Joseon dynasty’s art. The algorithm successfully
copied Hongdo’s art styles in terms of color and composition. There were also certain
limits in recreating cerulean colors and backgrounds of the photos, such as the sky. This
is because Hongdo did not draw backgrounds in any of his genre paintings and also
did not use a lot of cerulean colors. The piece used to recreate the photos, “Seodang,”
also does not have either a background or cerulean color. Despite these limitations, the
algorithm is successfully unique, in a way that it promotes a traditional Korean artist by
recreating its art style suggesting an easier approach to the public.
Implementing Style Transfer with Korean Artworks via VGG16 77
In terms of Shin Saimdang’s artwork, Shin’s usage of vivid and bright colors was
successfully implemented in the resulting images. Similar to Hongdo’s paintings, Shin
also did not draw backgrounds in her paintings, nor did she use cerulean colors. This
resulted in a similar result as Hongdo-implemented pictures in the above: a lack of
representation of blue colors and background. However, Shin Saimdang did have a wider
range of color usage, such as green, so a glimpse of blueish colors is noticeable in the style
transfer images. Another limitation is that Shin’s meticulous, detailed style of drawing
is implemented in a rather disappointing manner. To overcome this shortcoming, further
research will focus on constructing a pre-trained model with a large number of Korean
paintings, which could enhance the overall quality of the style-transfer results, especially
implementing the meticulous brush stroke styles of Shin Saimdang. Furthermore, we
would try to construct the virtual museum through metaverse in future research for
introducing Korean traditional paintings efficiently.
5.2 Summary/Restatement
The deep learning based algorithm captured the focal characteristics of Hongdo’s and
Shin’s paintings and applied them to the original photos. Furthermore, this model suc-
cessfully provides an opportunity for the public to acknowledge and appreciate artists
Kim Hongdo’s and Shin Saimdang’s art styles through the different examples processed
by the algorithm. The original photos that were implemented in Hongdo’s paintings were
recreated with high accuracy, replacing the rigid lines of the original photos with opaque
and curvy brush strokes. The overall color theme was also changed using yellowish
gouaches throughout the painting. The model was successful in recreating a new color
palette for each picture based on Shin Saimdang’s usage of vivid colors. The resulting
image utilized a relatively larger color palette using colors such as violet, pink, cerulean,
and more. This model is quite successful in that it can be a new guideline for those
who want to understand Hongdo’s and Shin’s art styles, as well as the art history of
the Joseon period since both Kim Hongdo and Shin Saimdang were two of the most
influential artists during the Joseon dynasty.
References
1. Lee, K.H.: The Painting in Our Times: Kang Sehwang‘s criticism of art and genre paintings
of Kim Hongdo. Art History and Visual Culture 15(0), 28-61 (2015)
2. Newsis: https://www.chosun.com/site/data/html_dir/2020/05/27/2020052704647.html
Accessed 26 Oct 2021
3. Cho, J.Y.: Samgong bulhwando (三公不換圖) by Gim Hongdo: Changes in Gim Hongdo’s
Paintings after 1800 and His Relationship with Hong UiyeongJo. Art History Association of
Korea, 275276(275276), pp. 149-175 (2012)
4. Koreatimes. https://www.koreatimes.co.kr/www/art/2017/03/691_225097.html Accessed 14
Jan 2022
5. Google Arts & Culture. https://artsandculture.google.com/story/animals-and-plants-in-
korean-traditional-paintings-i-plants-and-insects-national-museum-of-korea/mgXBU0JJW
G92Lg?hl=en Accessed 14 Jan 2022
6. K-Paper. https://k-paper.com/en/magazine_k_no1_sinsaimdang/?v=38dd815e66db
Accessed 4 Jan 2022
78 J. Suh
7. Kim, M.H., Chung, H.K.: A study on the characteristic of the tableware pottery and the food
culture for genre painting in the 18th Chosun period-focused on the works of Dan-won Kim
Hong-do. J. Korean Society Food Culture 22(6), 653–664 (2007)
8. KyungHyang Shinmun. http://m.khan.co.kr/amp/view.html?art_id=200408181759431
Accessed 6 Nov 2021
9. Li, B., Xiong, C., Wu, T., Zhou, Y., Zhang, L., Chu, R.: Neural abstract style transfer for
chinese traditional painting. In: Asian Conference on Computer Vision, pp. 212–227 (2018)
10. Zhao, H.H., Rosin, P.L., Lai, Y.K., Lin, M.G., Liu, Q.Y.: Image neural style transfer with
global and local optimization fusion. IEEE Access 7, 85573–85580 (2019)
11. Gatys, L.A., Bethge, M., Hertzmann, A., Shechtman, E.: Preserving color in neural artistic
style transfer. arXiv preprint arXiv:1606.05897 (2016)
12. Gupta, V., Sadana, R., Moudgil, S.: Image style transfer using convolutional neural networks
based on transfer learning. Int. J. Computational Systems Eng. 5(1), 53–60 (2019)
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
14. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural net-
works. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2414–2423 (2016)
Feature Extraction and Nuclei
Classification in Tissue Samples
of Colorectal Cancer
1 Introduction
Humans are able to perceive the three-dimensional (3D) nature of a two-
dimensional (2D) picture that depicts a 3D environment around us. Further-
more, we are able to recognize diverse objects and we are even able to overcome
some optical illusions that may hinder our visual understanding. In other words,
our eyes are attracted by so many visual cues, but the brain only focus on a
small part of this visual information for interpretation purpose [4].
Computer/machine Vision is a multi-disciplinary research field that aims
at making a machine capable of understanding the contents of images, taken
by a variety of sensors. In particular, researchers in this fields are hoping to
be able to extract and interpret more information from images than what we,
humans, can perceive [2]. Furthermore, this has to be practical so that to be used
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 79–99, 2023.
https://doi.org/10.1007/978-3-031-18461-1_6
80 B. Boufama et al.
in real-life applications. Such systems, should mix both the science (theory)
and the engineering aspects of vision. Example of research topics in machine
vision include motion estimation, autonomous navigation, 3D reconstruction,
object/pattern recognition and augmented reality (AR). Given that an image is
a simplified representation of the real object, machine vision tasks are inherently
challenging and sensitive to a multitude of factors. Those challenges include
noise, lighting changes, resolution, occlusions, deformation of non-rigid objects
and intra-class variations [3].
This paper aims to contribute computer vision methods for assisting medical
diagnosis (CAD) and is inspired from [1]. As common, this field follows a scheme
based on the extraction and combination of features, that are obtained from
pre-processed images. These features are then utilised to create models for the
extraction of hidden information, making it possible to reach valuable outcomes
in performing CAD tasks.
Firstly, we address the problem statement in Subsect. 1.1, to connect a CADx
contribution provided by machine learning. So, we explain Machine Learning -
supervised techniques to nuclei classification based on features in Subsect. 1.2.
WHO (World Health Organization) states that cancer is the third cause of death
in the world and the second cause in the developed countries, right after car-
diovascular diseases. According to GLOBOCAN (an online database for global
statistics on cancers), over 16 millions of Americans have or had cancer. Factor-
ing population growth and aging, this number is poised to exceed 22 millions
by the year 2030. This report is published each 3-year period by the American
Cancer Society and the National Cancer Institute [12]. This currents statistic
implies that 1 out of 5 males and 1 out of 6 females will develop cancer by the
age of 75. In particular, cancer will eventually kill 1 out of 8 males and 1 out of 12
females [15]. The most alarming fact is that the developed countries accounted
for 57% of the new cases and 65% of the total number of deaths.
Among men, three cancers stood out, i.e., prostate cancer (over 3.5 millions),
colon/rectal cancer (over 750k million) as well as skin melanoma (close to 700k).
The three dominant cancers in the female population are in order: breast cancer
(close to 4 millions), endometrial or uterine body (over 800k), and colon/rectal
cancer (over 750k). According to [12], it is estimated that there will be over 22
millions cancer survivors by the year 2030. According to [15], it is estimated that
57% or roughly eight millions, of all new cancer cases, 65%, or roughly 5 millions,
of cancer deaths and 48%, or roughly 15 millions, of living cancer patients will
occur in the developing countries.
In Machine Learning, sample data of past experiences are used to program com-
puters for optimal solutions to new instances of the same problem [7].
Feature Extraction and Nuclei Classification 81
Fig. 3. Sample image of epithelium and stroma tumours (Image Acquired from http://
fimm.webmicroscope.net/supplements/epistroma) (Image Acquired from [6])
Fig. 4. H&E and IHC stained images (Image Acquired from [5])
84 B. Boufama et al.
Fig. 5. The workflow diagram for the acquisition of histopathology pictures [14].
Because the above steps involve humans, there is a risk of variability resulting
into visual heterogeneity. At least three factors are worth mentioning here [14]:
(i) Magnification, it depends on how the human operator adjusts the lenses
of the microscope; (ii) Staining, meant to enhance the contrast the sample;
(iii) Slice Orientation: even using the same staining, the slice orientation is
affected by how the cut is done (cross-section vs longitudinal) and consequently,
the appearance of the tissue would vary.
Here is a typical workflow of a histopathology image [14]:
3 Methodology
The classification model has been applied to two separate data sets. The perfor-
mance and robustness of the proposed model on these data sets was compared
and discussed in Sect. 4.3.
Here are the two data sets we have used.
– Data set I: This data set contains eight classes (described in Sect. 3.1). It
consists of 5000 rows and 161 visual features: 15 color features and around
150 morphological features.
Feature Extraction and Nuclei Classification 85
– Data set II: It consists of 165 images obtained from 16 H&E stained his-
tological sections from colorectal adenocarcinoma stage T3 or T4. They are
labeled as “Cancer” or “No Cancer”, depending on their overall glandular
architecture.
Data Set II. This data set was obtained from the database of “GlaS MIC-
CAI’2015: Gland Segmentation Challenge Contest”. It contains a total of 165
images. The latter were obtained from 16 H&E stained histological sections.
They were at stages T3 or T4 of colorectal adenocarcinoma. Note that the most
common colon cancer is Colorectal adenocarcinoma that originates in intestinal
glandular structures. Pathologists use the morphology of intestinal glands, such
as glandular formation and architectural appearance to get prognosis and to
choose individual treatments for patients [13].
Every segment comes from a separate patient, as they were obtained at differ-
ent dates. The digitization of histological sections to create a whole slide image
86 B. Boufama et al.
(WSI) was done with the slice scanner Zeiss MIRAX MIDI. With a resolution
of 0.465 µm pixel. WSIs were rescaled to a pixel resolution of 0.620 µm (similar
to an objective magnification of 20X).
In order to cover the wide range of the tissue variety, 52 visual fields of
benign/malignant regions across the whole set of WSIs were chosen. A special
pathologist (DRJS) then classified every visual field as “malignant” or “benign”,
using the general glandular architecture. In addition, the boundaries of each
individual glandular region was delineated by the pathologist (see Table 2).
This phase is about the extraction of features that are morphological or coming
from colors. Table 3 describes these features that are important to model.
– Color Features
We use level-1 statistics, i.e., median, mean, skewness, kurtosis, standard devi-
ation and color histogram to characterize the color feature. These statistics
are calculated for different color spaces, like RGB, HSV, YUV and grayscale.
Figure 7 describes this process of extracting this information for RGB (simi-
larly, the same kind of information is extracted from other channels like YUV
and HSV).
Feature Extraction and Nuclei Classification 87
– Morphological Features
On the other hand, The extraction of morphological features is done via a
nucleation segmentation. The latter is performed on the hematoxylin and
eosin color space. The hematoxylin’s color component is extracted using
watershed algorithm to detect nuclei as illustrated on Fig. 8. Furthermore, it is
necessary to core-segment the morphological characteristics before proceeding
to estimate the above-mentioned statistical entities. Finally, the main features
like eccentricity, circularity, solidity, area, and perimeter, are obtained.
After the above features are obtained, we mine the useful data. Using pan-
das library, we generated a clean data frame to prepare sparse features for
next-stage classification algorithms. In the context of ML. The categorical
variables in our data sets are viewed as discrete entities that are coded as
feature vectors. “Dirty” non-organized data produces redundant categorical
variables, that is, several categories represent the same entity [11]. This can
be overcome by the conversion of categorical data into numbers. In our case,
one hot encoding was used, where binary vectors represent the categorical
variables. The binary vector has all zeros except for the index of the inte-
ger, marked as 1. The one hot encoding A one hot encoding allows a more
expressive representation of categorical eliminates redundancy. This enhances
the performance of our model.
88 B. Boufama et al.
Furthermore, we have decreased the size of our two data sets by ignor-
ing variables not representing relevant data for building our model (i.e.
“ID of sample”). We also evaluated whether the resulting data sets are bal-
anced or not. We made sure that classes have the same number of samples.
In the end, the obtained clean data sets are split into 80%/20% for training
and testing, respectively. Each set consists of random values from the corre-
sponding clean data set. This way, we are able to evaluate the performance
and robustness of the classification methods.
Parameter Value
Dense-1 100.0
Dense-2 50.0
Dense-3 8.0
Layer 1, 2 -Activation function ReLu
Layer 3 - Activation function Softmax
Epochs 30.0
Validation split 10.0%
4 Results
4.1 Eight-Class Data Set: Results of the Classification
The above four ML Algorithms are used with the first data set (the eight-
class one). We aim here at comparing the performances of SVM, MLP, RF and
Naive Bayes.
The obtained results of our experiments are shown in Table 5. As the table
shows, SVM yielded best results among all four, considering Accuracy, Precision
and Recall. However, RF and SVM perform best when F1-Score is used for
comparison. Overall, all four algorithms have performances exceeding 0.8 with
the precision lowest for MLP at 0.89.
90 B. Boufama et al.
Figure 9 shows the 4-metric general classification results for each of our ML
technique. This figure suggests that the best classifications are obtained by SVM
and RF, with scores exceeding 0.95. On the other hand, Naive Bayes has scored
below 0.93 for each measurement while MLP scored even worse at (0.85).
One can notice from Table 6, that in addition to F1-score, it is crucial to look
at precision and recall, in order to get a good estimate of the performance of the
method. In addition, Table 6 provides some details about the classes giving our
model more challenges. In particular, some special patterns in Histopathological
images makes their analysis more challenging. For example, the poor precision of
MLP at 0.64 for the 6th class. This tells us that MLP is not good for predicting
a true positive value of “Mucosa” a “not cancer” class. Four values are under
0.70 in Table 6 (Classes 2, 3, 5, 6) for MLP.
SVM is a great computational diagnosis tool. For most of the used metrics,
SVM obtained scores that are close to 1. As a conclusion, we can say that SVM
can classify every class of the data set correctly.
We might also say that RF is somewhat reliable. The model obtained only 3
scores that were below 0.90. It is less computational-intensive and was able to
get a 1 score on for three easy classes.
Naive Bayes achieved scores above 0.90. However, it was challenged in classes
5 and 6. Comparing all 4 models, MLP was just acceptable, we believe it is the
least desirable model for CAD tools.
Figure 10 shows the performances by Precision for every class to compare
results among the proposed models. Similarly, Fig. 11 illustrates performances
by Recall. Last, Fig. 12 shows the result of the performances using F1-Score.
Considering class-8, one can notice a singularity. All methods achieved very high
scores, all above 0.98. We can conclude that class-8 is an easy class to classify.
Feature Extraction and Nuclei Classification 91
In addition to these scores, confusion matrices for all proposed methods are
needed to further evaluate their performances.
First, Fig. 13 describes confusion matrix of MLP. One can notice that class-2
is confused with class-3. The reason could be that MLP model does not dis-
tinguish between shapes (morphological characteristics) for classes “Debris” (3)
and “Complex” (2). In addition, class-6 and class-5 are close. Because there are
no obvious common characteristics, one might conclude of a model deficiency at
validation error.
Similarly, Naive Bayes’s confusion matrix (Fig. 14) suggests very good classi-
fication results where the classes are clearly distinguished. Although class-5 and
class-6 are confused, this is not a major classification issue because both of them
are non-cancer classes.
From Fig. 15, we can see that RF confuses class-6, the “Mucosa”, with class-1,
“Adipose”. This is due to the small variability in shapes from classes “Adipose”
and “Mucosa”. Despite this, most samples were classified correctly. We can con-
clude that this is still an efficient model. Even though the model slightly confuses
these classes, they are not “cancer” classes. Hence, this is not a major issue that
might affect a CADx system. Figure 16 the results of SVM that are robust in
classifying all classes.
92 B. Boufama et al.
The performance of SVM, MLP, RF and Naive Bayes is compared when trained
on the second data set (2-class).
Fig. 17. Performance of SVM, MLP RF and Naive Bayes with the 2-class data set
The obtained results show that SVM and RF perform very well with scores
exceeding 0.8. Figure 17 suggests that RF yields the best classification with scores
over 0.90. On the other hand, MLP and Naive Bayes did poorly with scores below
0.7.
When selecting features, working with subsets of features, such as morphol-
ogy and color, can be useful to discover the relevant characteristics to be used
for classification. In the case of RF we can say that all its scores on performance
metrics are high as they exceed 0.85. All the other three algorithm’s scores are
in the range of 0.70–0.85. This suggests that RF is ranked first followed by SVM,
Naive Bayes and MLP, respectively.
We have added Fig. 18 to investigate the detailed precision of the four models
tested on the 2-class (binary) data set. On the other hand, Fig. 19 depicts the
Feature Extraction and Nuclei Classification 95
performance by recall and, Fig. 20 does the same thing by F1-Score. One can
easily see that precision and recall scores that are variable among classes, yield
a high F1-Score score of 0.90. To each algorithm there exists a difficulty to
recall on 0-class (“no cancer”) that prohibits it from reaching a value greater
than 0.750, thus precision score has to compensate and define a mean value too
different through each algorithm. Given the variability in scores of Precision,
Recall and F1-Score, we can conclude RF, Naive-Bayes, SVM and MLP are not
having enough training data. Therefore, we cannot say whether or not they are
reliable and/or robust, when using only color and morphological features.
We also obtained the confusion matrices for these algorithms to evaluate
their performance in distinguishing between ‘cancer’ and ‘no-cancer’.
Fig. 18. Performance of the 4 models with precision - 2-class data set
Fig. 19. Performance of the 4 models with recall - 2-class data set
Unlike Subsect. 4.1, this data set is binary as it consists of two classes only:
“c 0 =cancer” or “c 1 =no-cancer”. MLP confusion matrix is shown on Fig. 21
which describes what can easily confuse the class “cancer”, while “no cancer” is
recognizable for this model.
On the other hand, RF and SVF seem to perform better according to Fig. 24.
Both models are able to clearly distinguish between a cancer tissue sample and
non cancer one. However, RF has better precision than SVM.
Figure 22 suggests that Naive Bayes yields comparable results as RF. How-
ever, Naive Bayes suffers from higher level of confusion for the “no-cancer” class
96 B. Boufama et al.
Fig. 20. Performance of the 4 models with F1-score - 2-class data set
(Fig. 23). RF comes as best performing model when using all obtained classifica-
tion scores (all metrics) as well as the confusion matrices. Its scores of 0.91 and
0.89 for accuracy and F1-score are high.
4.3 Comparison
First, when looking at the results from Fig. 9 and 17, we can see that Naive
Bayes and MLP achieve performance scores that exceed 0.80 on the first data
set (8-class). Their scores fall to the range of 0.60–0.70 on the second data set
(2-class). The results of these models could possibly be improved by customizing
them further. For instance, one can add a new layer to the MLP model or by
changing the activation function of its final layer. For Naive Bayes, one can
replace its “Gaussian” by a “Multinomial” configuration. It is worth noting that
the second data set (2-class) is quite small, explaining in part why the methods
performed poorly (misclassifications).
Using the same Fig. 9 and 17, SVM easily distinguishes classes from each
other on the 8-class data set (scoring 0.90). However, SVM performs not as
good on the 2-class data set (scoring 0.83). RF on the other hand yields great
performance for the two data sets. RF achieves scores exceeding 0.90 for all met-
rics and easily differentiates between different classes as shown in the confusion
matrix.
After careful metrics analysis, and to avoid misclassification of morphological
and color features, especially on the binary data set, we believe that the number
of training samples should be increased.
When the performance of methods with the binary data set improves, while
the number of classes and the number of samples are reduced, this is very likely
due to the small size of the data set and its small validation set as well.
To summarize, considering the number of features detected properly, we can
say that current results suggest that RF performs best, with large and small
data sets. We can also say that MLP performs the worst in the 8-class and 2-
class data sets, mainly due to smaller number of samples in the context of neural
networks.
98 B. Boufama et al.
5 Conclusion
Similar techniques have been used in the past to classify images in the context of
colorectal cancer classification. However, these techniques used mainly texture as
feature. This paper has shown that other features, i.e., color and morphological
characteristics, are also very useful in the context of colorectal cancer classifica-
tion. Hence, paving the way for other possibilities to investigate the same kind
of images.
Among four ML models, RF has achieved the best performance when con-
sidering Accuracy, Precision and Recall scores. These scores exceeded 0.95 for
8-class (first) data set, and exceeded 0.90 for the 2-class(binary) data set. The
proposed models were tested on two very different data sets to investigate the
computational cost and classification performance. RF has proven itself to yield
high accuracy even when the set of training data was smaller. In other words,
RF has the capability to increase the learner’s generalization. Up to a certain
extent, the performance of SVM, a classical classification model, is comparable
to the performance of RF.
In future work, it is possible to improve the configuration, with the goal
of improving precision and accuracy, using different kinds of experiments. In
particular, more experiments with more images of adenocarcinoma to increase
samples, will likely improve the 2-class (binary) classification. Comparison with
other ML techniques and combining multiple models, like RF and SVM, could
lead to better results. In addition, one might explore other customization, like
increasing the depth in RF, to enhance performance and robustness.
References
1. Syed, S.A.: Color and morphological features extraction and nuclei classification
in tissue samples of colorectal cancer (2021). Electronic Theses and Dissertations.
8539. https://scholar.uwindsor.ca/etd/8539
2. Bluteau, R.: Obstacle and change detection using monocular vision. Electronic
theses and Dissertations (2019)
3. Yang, M.H., Kriegman, D., Ahuja, N.: Detecting faces in images: a survey. Pattern
Anal. Mach. Intell. IEEE Trans. 24, 34–58 (2002)
4. Davies, E.: Computer Vision: Principles, Algorithms, Applications, Learning (2017)
5. Irshad, H., Veillard, A., Roux, L., Racoceanu, D.: Methods for nuclei detection,
segmentation and classification in digital histopathology: a review current status
and future potential. IEEE Rev. Biomed. Eng. 7, 97–114 (2014)
6. Jain, A.K., Lal, S.: Feature extraction of normalized colorectal cancer histopathol-
ogy images. In: Hu, Y.-C., Tiwari, S., Mishra, K.K., Trivedi, M.C. (eds.) Ambient
Communications and Computer Systems. AISC, vol. 904, pp. 473–486. Springer,
Singapore (2019). https://doi.org/10.1007/978-981-13-5934-7 42
7. Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press, Cam-
bridge (2010)
8. Xu, Y., Zhu, J.Y., Chang, E., et al.: Weakly supervised histopathology cancer
image segmentation and classification. Med. Image Anal. 18(3), 591–604 (2014)
Feature Extraction and Nuclei Classification 99
9. Kather, J.N., Weis, C.A., Bianconi, F., Melchers, et al.: Multi-class texture analysis
in colorectal cancer histology. Sci. Rep. 6, 27988 (2016)
10. Kather, J.N., Marx, A., Reyes-Aldasoro, C.C., et al.: Continuous representation
of tumor microvessel density and detection of angiogenic hotspots in histological
whole-slide images. Oncotarget 6, 19163–19176 (2015)
11. Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty
categorical variables. Mach. Learn. 107(8–10), 1477–1494 (2018)
12. Miller, K.D., Nogueira, L., Mariotto, A.B., et al.: Cancer treatment and survivor-
ship statistics. 2019 CA A Cancer J. Clin. 69, 363–385 (2019)
13. Sirinukunwattana, K., Snead, D.R.J., Rajpoot, N.M.: A stochastic polygons model
for glandular structures in colon histology images. IEEE Trans. Med. Imaging 34,
2366–2378 (2015)
14. Arévalo, J., Cruz-Roa, A., González, F.: Histopathology image representation for
automatic analysis: a state of the art review. Revista Med. 22(2) (2014)
15. Wild, C.P., Stewart, B.W. (eds.): World Cancer Report 2014, pp. 482–494. World
Health Organization, Geneva, Switzerland (2014)
A Compact Spectral Model
for Convolutional Neural Network
1 Introduction
The convolutional neural network (CNN) is a machine learning model that has
been successful in handling complex problems in Computer Vision (CV). It is
one of the most accurate solutions for image recognition tasks, and it achieves
this by leveraging the input image’s intrinsic invariance to motion, rotation,
and deformation [1]. CNN is suitable to deal with non-linear problems since it
can learn the features and patterns of the given data. It has been successfully
used in a wide range of applications, including natural language processing [2],
document processing [3], financial forecasting [4], face detection and recognition
[5], speech recognition [6], monitoring and surveillance [7,8], image classification
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 100–120, 2023.
https://doi.org/10.1007/978-3-031-18461-1_7
Compact Spectral CNN 101
[9,10], autonomous robot vision [11], and character recognition [12]. These appli-
cations mainly utilize deep learning algorithms that can autonomously extract
features from input data to achieve high accuracy.
One of the most significant challenges of CNN is the computational cost of
executing them [13,14]. This is especially true in view of the rapid increase in
the usage of big data in web servers, as well as the large number of samples to be
classed in the cloud. The growing number of datasets and the complexity of CNN
models placed a significant computing burden on any processing platform. The
higher precision required in real-world applications has raised the computational
difficulty of CNNs, in addition to the issue of high data complexity. To achieve
the high accuracy requirements of today’s practical recognition applications,
CNNs must be larger and deeper, thus requiring more processing resources.
The most computationally intensive aspect of CNN is the convolution lay-
ers [13] (hereafter referred to as CONV layers). In the inference (classification)
phase, spectral representation can significantly speed up the computation of con-
volutions [15–18]. The convenient property of operator duality between convolu-
tion in the spatial domain and Element-Wise Matrix Multiplication (EWMM) in
the spectral (frequency) domain allows convolution to be computed in spectral
domain with higher computational efficiency. Computing CONV layers in spec-
tral domain can significantly lower the computational cost of the CONV layers
from O(N 2 × K 2 ) to O(N 2 ), where N is the width of input feature map, and K
is the width of the kernel, thanks to this operator duality.
Since high computational complexity (CC) of existing spatial or spectral
CNN is an obstacle for real-time deployment in embedded applications, this
research aims to provide efficient optimizations for reducing the computational
burden of CNNs. More specifically, the objective is to eliminate unnecessary com-
plex computations in different layers of the spectral CNN model while retaining
an acceptable level of accuracy. These optimizations make CNNs more suitable
for implementation in resource-constrained systems.
Existing spectral CNN models can be memory-intensive due to larger-sized
kernels multiple spectral-spatial domain transformations, necessitating a lot of
memory usage and processing power [1,16]. The training of these networks is typ-
ically conducted offline to save energy and remove some computing strain from
embedded devices, and only the time-sensitive task of classification (testing) is
implemented on the target system [19,20]. Therefore, the goal of this research
is to increase the run-time speed of testing (classification) algorithm, as mea-
sured in milliseconds (ms), while maintaining acceptable recognition accuracy,
as evaluated by the misclassification error rate (MCR).
The following are the primary contributions of this work: we present a
compact spectral convolutional neural network model with a smaller feature
map (FM) size that has minimal CC and higher classification speed. The pro-
posed spectral CNN model outperforms conventional models by 24.11× and
4.96× in terms of classification speed on AT&T face recognition and MNIST
(Mixed National Institute of Standards and Technology) digit/fashion classifi-
cation datasets, respectively, when compared to the equivalent network in the
102 S. O. Ayat et al.
2 Related Work
The size and depth of state-of-the-art CNNs have grown in response to the rapid
increase in today’s datasets and the higher accuracy demands of new applica-
tions. On the other hand, most hand-held devices that use artificial intelligence,
such as smartphones, are becoming smaller and more energy-efficient. These
trends necessitate algorithmic optimizations that are hardware friendly in order
to lower the computational workload required to run these compute-intensive
algorithms [21].
Singular Value Decomposition (SVD) is one of the techniques in linear alge-
bra that has been used extensively to reduce the CC of the network. It was first
proposed by Denton et al. [22] in object recognition systems and works based
on the factorization of the input matrix. Another popular compression method
is pruning techniques employed by Han et al. [23] to decrease the complexity of
CNN. This technique is based on the idea that some neurons have less contribu-
tion to the network performance and, therefore, can be removed. These methods
are hard to implement in real-world applications and may result in accuracy loss
if overused [24]. Another work in [25] applied data quantization (or precision
scaling) method that dynamically changes the bit-width of the data from 4-bits
to 16-bits in different CNN layers.
The binarized neural network (BNN) is a commonly used approach for reduc-
ing number of computations [26–29]. The goal here is to cut down on the num-
ber of bits used to represent activation (output FMs of CONV layers) or ker-
nels (filter weights). As a result, despite binarized convolution having a CC of
O(N 2 × K 2 ), the storage requirements as well as the computational cost of the
CNN model are significantly reduced [30]. For instance, the work in [27] pro-
posed the use of one/two-bit data format in their CNN model that makes a
considerable reduction in the computation cost and therefore the run-time of
the algorithm. The classification time of this model on MNIST dataset is almost
seven times faster than its baseline implementation. In addition to fewer com-
putations and faster operations, BNNs also improve the energy efficiency of the
embedded hardware as proven in [31].
Another approach is to use stochastic computing (SC), which is suitable for
area optimized hardware implementations [32–35]. In this method, a sequence
of randomly generated bit-streams is used to represent the actual number in the
Compact Spectral CNN 103
The proposed spectral CNN model for handwritten digit classification using
the MNIST datasets in the spectral domain is shown in Fig. 1. This model is
referred to as CNN3 in this work and it goes through two steps of computational
reduction. The first step is based on the fusion technique proposed in [15], and
the resulting CNN is labeled as CNN2. The second step in computation reduction
(that creates CNN3 model from CNN2) is to compact the input FM sizes, and is
explained farther in Sect. 4. Compared to the conventional spatial CNN model
(denoted as CNN1) depicted in Fig. 2, CNN3 has fused layers, smaller FM sizes,
and executes CNN computations in the spectral domain.
In our proposed spectral CNN model in Fig. 1, the entire feature-extraction
segment including the CONV layers C1, C2, and C3 (and associated pooling and
activation layers) are performed in the spectral domain. After layer C3, we per-
form IFFT to perform the classification segment (involving the fully connected
and softmax layers (F 4 and X5)) in spatial domain. Therefore, the layers in the
classification segment are computed with real numbers, rather than complex-
valued numbers. The spectral rectified linear unit (SReLU) is employed here to
prevent multiple domain switching, similar to the spectral CNN model proposed
Compact Spectral CNN 105
in [15]. These methods are summarized in this paper, and readers are advised
to refer to the original paper for more information.
C1 C2 C3
20@3×3 50@3×3 150@1×1
F4 X5
10@1×1 10@1×1
Input
1@3×3
Full
Softmax
Connection
Fig. 1. Proposed spectral CNN model with reduced FM sizes (labeled as CNN3) for
handwritten digit classification
Input C1 S1 C2 S2 C3 F4 X5
1@28×28 20@24×24 20@12×12 50@8×8 50@4×4 150@1×1 10@1×1 10@1×1
4×4 Full
Softmax
Convolution Connection
Fig. 2. CNN Model for handwritten digit classification using the conventional spatial
model (labeled as CNN1)
Convolution layers play an essential role in CNN architecture as the CNN name
implies the importance of these layers. These layers are the main memory of
the network that can extract the information from the input data and save
them in the kernel weights parameters. Therefore, its main job is to do feature
106 S. O. Ayat et al.
extraction from the raw data given to them. Deep learning algorithms usually
employ multiple CONV layers in their structures. Each of these layers is designed
to extract features at various levels. These layers are responsible for extracting
the superficial information such as curves and edges. As we go deeper into the
network, the CONV layers extract higher-level information such as semi-circles,
and squares [48] and also covers more area of the input pixel space.
Algorithm 1 shows the computation steps involved in these layers that only
can be expressed as multiple nested loops that should cover the whole input
FM, output FM, and also kernels. These nested loops are the main reason why
CONV layers are the most computation-intensive part of the whole network as
they can occupy 90% of the computation of the whole network [19,44].
After the input FM has gone through the EWMM process with the associated
trainable weight kernel (w), each output FM (x) is computed as the sum of the
input FMs (x). As shown in Eq. 1, this concept is used to implement the compute
intensive CNN CONV layers as simple EWMM in the Fourier domain.
I I
Yj = Xi · W i, j = F ( xi wi,j ) (1)
i=1 i=1
where i and j are the indexes for input FMs and output FMs, respectively.
Algorithm 2 demonstrates the operations involved in EWMM where C1 nFMs,
C2 nFMs, R, C and K are FM size in layer C1, FM size in layer C2, row
and column sizes of the FM and finally K is the kernel size in the convolution
operation. In this paper, capital letters are used to present the Fourier transform
of the original signal. Similarly, non-capital letters represents the original FMs
in the spatial domain.
Compact Spectral CNN 107
Input : X ∈ C M ×N
(2)
Output : Y ←
− Crop (X, H × W ), Y ∈ C H×W
f (x) = c0 + c1 · x + c2 · x2 (3)
mb
Xi (0, 0) − (μB × M × N )
X̂ ← (4)
i=1
σB
2 +
Yi ← γ X̂ + DC (5)
DC ← X̂new = X̂old + (β × δ(u, v) × M × N ) ≡ BNγ,β (Xi ) (6)
As in the original work [49], the scale and shift parameters gamma and
beta are determined using the back-propagation step in spatial domain. As pro-
posed in [15], we have added the BN layer after each CONV layer. All activation
functions will thus benefit from a uniformly shaped, continuous, and stable dis-
tribution over their optimal values.
Table 1. FM size reduction involved in the spectral CNN for digit classification task
1×1 2×2 3×3 4×4 5×5 6×6 7×7 8×8 9×9 10×10 11×11 12×12
MNIST
mnist-back-image
mnist-back-random
Fashion-MNIST
mnist-rot
AT&T
Fig. 3. Effect of reduction in the input FM size of CNN3 on the input images from
different datasets
From Fig. 3, it is quite clear that by shrinking the input FM sizes in CNNs,
the visible features get more blurred. Most of the samples are recognizable up to
the size of 3 × 3. This situation is problematic for smaller FMs, whereas in the
110 S. O. Ayat et al.
The compact spectral CNN model proposed in this paper was evaluated on differ-
ent standard datasets. The effectiveness of the proposed computation reduction
method is compared to spatial models (CNN1) and conventional spectral models
(CNN2). Both accuracy and classification times are compared among the three
CNN models discussed in this paper. Furthermore, accuracy of the proposed
model was compared to some earlier works in spatial CNNs employing SC and
BNN method for computation reductions. Due to differences in platforms, CNN
models, and test-cases employed, a complete comparison with prior publications
that use spectral representation in neural networks is not feasible.
We have applied the previous approaches to the same LeNet-5 architecture,
utilizing the same test cases (MNIST variations and AT&T face dataset), and
operating on the same system, in order to make a fair comparison with earlier
works. As a result, three alternative CNN models have been constructed for our
tests in order to benchmark the proposed spectral CNN model’s performance.
All network weights and inputs in the spatial model (named CNN1) are in real
number format because it is run in the spatial domain. The baseline spectral
CNN model, dubbed CNN2, is the subject of the second experiment. The final
network proposed in this paper (named CNN3) is similar to CNN2, with the
exception that the input FM sizes are reduced to 3 × 3 pixels.
The CNN models were trained on MNIST, Fashion-MNIST, mnist-back-
random, mnist-rot, mnist-back-image and AT&T datasets. Figure 4 and Figure 5
show some randomly selected samples of these datasets. The networks are trained
Compact Spectral CNN 111
off-line in MATLAB using the open-source MatConvNet [50] package. The net-
work’s classification task is implemented in the C programming language, with
the Fourier transform for the network weights performed off-line in MATLAB.
Fig. 4. Ten different classes in the (a) MNIST, (b) Mnist-Back-Random (c) Mnist-Rot
and (d) Mnist-Back-Image and (e) Fashion-MNIST datasets
Table 2. Classification speed (Per sample image) of the proposed spectral model in
comparison with previous approaches
As evident from the results provided in Table 2, the proposed spectral CNN
model outperforms previous approaches in terms of classification time. The pro-
posed model outperforms spatial model by about 5 times in case of MNIST
variants and 24 times when evaluated on AT&T dataset. Essentially, the high
CC in the CONV layers [51] impacts the classification speed of the spatial CNN
model [50]. This computational burden is removed in this work by computing the
convolutions as EWMMs in the spectral domain. Furthermore, because of the
reduced FM sizes in its design, the algorithm for the proposed model (CNN3)
requires fewer computations for classification operations than existing spectral
CNN models such as CNN2. For this reason, CNN3 can perform faster than
CNN2 on any computing platform.
The CNN models CNN1 and CNN2 are trained for 50 epochs. Because of their
slower convergence rate, CNN3 had to go through 200 epochs of training. On
MNIST variations as well as AT&T face recognition datasets, each CNN model
is tested independently. The test accuracy of the three CNN models were mea-
sured in terms of MCR. Figure 6 and 7 show the training and testing MCRs of
the proposed spectral model (CNN3) on MNIST variants and AT&T datasets,
respectively. Table 3 documents the test accuracy attained on all the three CNN
models for the six datasets employed in this work for evaluating the models.
Compact Spectral CNN 113
Fig. 6. Training and Testing MCRs of Spectral CNN3, with and without Batch Nor-
malization (BNorm) technique, on (a) MNIST, (b) Mnist-Back-Random (c) Mnist-Rot
and, (d) Mnist-Back-Image Datasets
Fig. 7. Training and Testing MCRs of spectral CNN3, with and without Batch Nor-
malization (BNorm) technique, on (a) AT&T and, (b) Fashion-MNIST datasets
114 S. O. Ayat et al.
The difference between CNN2 and CNN3 is only in the FM sizes of the
CONV layers. The input FM size in CNN2 is 12 × 12, whereas this size in
CNN3 is reduced to 3 × 3. The reason why 3 × 3 is chosen for this network is
illustrated in Fig. 8, where it shows the classification accuracy of the network
applying different input FM sizes in the C1 layer. As can be observed from the
plot, the performance does not considerably drop while the number of frequency
spectrums in the FM is reduced. This behavior exists until FM size is reduced to
2 × 2. In this case, accuracy is not in the acceptable range anymore. It is worth
reporting that the network with the single DC spectrum (1 × 1 pixel) does not
converge at all. The accuracy has a minor drop in the case of 3 × 3 FM size
in the first 50 epochs, which is compensated with a longer training time of 200
epochs.
In all of the experiments, the spectral domain network takes longer (or more
epochs) to converge in the training phase than the conventional method, espe-
cially for CNN3, where the training duration is increased from 50 to 200 epochs.
However, it is not a significant issue, particularly for the end-user who merely
requires a quick and accurate classifier machine, independent of how much time
it requires for training stage. In other words, we can use dedicated hardware to
speed up the time-critical inference process while leaving the training phase to
servers or a powerful host PC.
We have benchmarked accuracy our proposed model against previous CNN mod-
els (both spatial and spectral) that employed state-of-the-art strategies for reduc-
ing CC in CNNs. Table 4 shows the accuracy attained by our model as well these
previous works. The majority of these works are aimed at various embedded sys-
tem applications.
The proposed model’s accuracy was compared to previous works that use
spectral representation, as well as works that use various spatial domain tech-
niques (such as SC and BNN) to reduce CC in CNN. The CC of computing
convolution in all the three approaches are also listed in Table 4. On the MNIST
dataset, the results show that the proposed model is more accurate than alter-
native approaches. With BNN and SC approaches, the reduction in computation
frequently comes at the cost of a significant loss of accuracy in network perfor-
mance. In other words, in neural networks, there is always a trade-off between
model accuracy and data bit-width of the weights and activations [13].
116 S. O. Ayat et al.
6 Conclusions
In this work, we have demonstrated that computing CNN in spectral domain can
be accurate and computationally inexpensive. This gain in computational effi-
ciency arises from computing convolutions as element-wise products in spectral
domain with a much lower CC of O(N 2 ) (instead of O(N 2 × K 2 )) and employ-
ing extremely compact FM sizes. This work contributes to the spectral CNN
paradigm with a model that computes inference with fewer computation than
state-of-the-art spectral CNNs. It achieves its objectives by introducing an FM
size reduction approach that results in a CNN model that offers faster inference
speed than state-of-the-art spectral CNN models. This proposed method does
not impact the classification accuracy, just as originally intended, and moreover
produces reduced MCRs in datasets with noisy images. This is because the pro-
posed model discards high-frequency components like an LPF, which results in
a noise-tolerance solution.
As discussed in Sect. 2, BNN and SC are other successful approaches for
simplifying computations in CNN. Therefore, a possible path for future work is to
design a hybrid CNN model with BNN or SC data representation in the spectral
domain (hybrid spectral-BNN model or hybrid spectral-SC model). Unlike BNN
and SC methods, the spectral representation is independent of the number of
binary digits (bit length) representing data. In other words, the focus of spectral
representation is on the CNN algorithm rather than the bit length of the data.
Hence, it is possible to further compress the bit width of the complex-valued
data in the spectral domain toward binary or stochastic implementation. The
Compact Spectral CNN 117
main concern would be regarding how to maintain the accuracy of the hybrid
network at an acceptable range.
Acknowledgment. The authors thank Universiti Teknologi Malaysia (UTM) for their
support under the Research University Grant (GUP), grant number 16J83.
References
1. Almasi, A.D., Wozniak, S., Cristea, V., Leblebici, Y., Engbersen, T.: Review of
advances in neural networks: neural design technology stack. Nerocomputing 174,
31–41 (2016)
2. Collobert, R., Weston, J.: A unified architecture for natural language processing:
deep neural networks with multitask learning. In: Proceedings of the 25th Inter-
national Conference on Machine Learning, pp. 160–167. ACM (2008)
3. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural
networks applied to visual document analysis. In: Proceedings of the 7th Interna-
tional Conference on Document Analysis and Recognition (ICDAR), vol. 3, pp.
958–963. IEEE (2003)
4. McNelis, P.D.: Neural Networks in Finance: Gaining Predictive Edge in the Market.
Academic Press, Cambridge (2005)
5. Ahmad Radzi, S., Mohamad, K.H., Liew, S.S., Bakhteri, R.: Convolutional neural
network for face recognition with pose and illumination variation. Int. J. Eng.
Technol. (IJET) 6(1), 44–57 (2014)
6. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep struc-
tured semantic models for web search using clickthrough data. In: Proceedings of
the 22nd ACM International Conference on Information & Knowledge Manage-
ment, pp. 2333–2338. ACM (2013)
7. Rasti, P., Uiboupin, T., Escalera, S., Anbarjafari, G.: Convolutional neural network
super resolution for face recognition in surveillance monitoring. In: Perales, F.J.J.,
Kittler, J. (eds.) AMDO 2016. LNCS, vol. 9756, pp. 175–184. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-41778-3 18
8. Yeap, Y.Y., Sheikh, U.U., Ab Rahman, A.A.: Image forensic for digital image
copy move forgery detection. In: 14th IEEE International Colloquium on Signal
Processing and Its Applications (CSPA), pp. 239–244. IEEE (2018)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017)
10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Proceedings of the 3rd International Conference on Learning
Representations, ICLR (2015)
11. Sermanet, P., et al.: A multirange architecture for collision-free off-road robot
navigation. J. Field Robot. 26(1), 52–87 (2009)
12. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network.
In: Proceedings of the 2nd International Conference on Neural Information Pro-
cessing Systems (NIPS), pp. 396–404. NeurIPS (1989)
13. Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B.: Recent advances in
convolutional neural network acceleration. Neurocomputing 323, 37–51 (2019)
14. Amer, H., Ab Rahman, A., Amer, I., Lucarz, C., Mattavelli, M.: Methodology
and technique to improve throughput of FPGA-based Cal dataflow programs: case
study of the RVC MPEG-4 SP intra decoder. In: Proceedings of the IEEE Work-
shop on Signal Processing Systems (SiPS), pp. 186–191. IEEE (2011)
118 S. O. Ayat et al.
15. Ayat, S., Khalil-Hani, M., Ab Rahman, A., Abdellatef, H.: Spectral-based con-
volutional neural network without multiple spatial-frequency domain switchings.
Neurocomputing 364, 152–167 (2019)
16. Rizvi, S., Ab Rahman, A., Khalil-Hani, M., Ayat, S.: A low-complexity complex-
valued activation function for fast and accurate spectral domain convolutional
neural network. Indones. J. Electr. Eng. Inform. (IJEEI) 9(1), 173–184 (2021)
17. Liu, S., Luk, W.: Optimizing fully spectral convolutional neural networks on
FPGA. In: Proceedings of the 19th IEEE International Conference on Field-
Programmable Technology (ICFPT), pp. 39–47. IEEE (2020)
18. Watanabe, T., Wolf, D.: Image classification in frequency domain with 2SReLU:
a second harmonics superposition activation function. Appl. Soft Comput. 112,
107851 (2021)
19. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based
accelerator design for deep convolutional neural networks. In: Proceedings of the
2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
pp. 161–170. ACM (2015)
20. Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neu-
ral network. In: Proceedings of the 2016 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pp. 26–35. ACM (2016)
21. Ristretto, G.P.: Hardware-oriented approximation of convolutional neural net-
works. arXiv preprint (arXiv:1605.06402). arXiv (2016)
22. Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear
structure within convolutional networks for efficient evaluation. In: Proceedings of
the Advances in Neural Information Processing Systems, pp. 1269–1277. NeurIPS
(2014)
23. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: Proceedings of the 28th International Conference on
Neural Information Processing Systems (NIPS), pp. 1135–1143. NeurIPS (2015)
24. Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning
networks for learning and classification: a review. IEEE Access 7, 7823–59 (2018)
25. Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations
of nonlinear convolutional networks. In: Proceedings of the IEEE Conference on
Computer Vision and pattern Recognition (CVPR), pp. 1984–1992. IEEE (2015)
26. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural
networks: Training deep neural networks with weights and activations constrained
to +1 or -1. arXiv preprint (arXiv:1602.02830). arXiv (2016)
27. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neu-
ral networks. In: Proceedings of the Advances in Neural Information Processing
Systems, pp. 4107–4115. NeurIPS (2016)
28. Liang, S., Yin, S., Liu, L., Luk, W., Wei, S.: FP-BNN: binarized neural network
on FPGA. Neurocomputing 275, 1072–1086 (2018)
29. Wu, Q., Lu, X., Xue, S., Wang, C., Wu, X., Fan, J.: SBNN: slimming binarized
neural network. Neurocomputing 401, 113–122 (2020)
30. Mittal, S.: A survey of FPGA-based accelerators for convolutional neural networks.
Neural Comput. Appl. 32(4), 1109–1139 (2018). https://doi.org/10.1007/s00521-
018-3761-1
31. Umuroglu, Y., et al.: A framework for fast, scalable binarized neural network infer-
ence. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 65–74. ACM (2017)
Compact Spectral CNN 119
32. Ma, X., et al.: An area and energy efficient design of domain-wall memory-based
deep convolutional neural networks using stochastic computing. In: Proceedings
of the 19th International Symposium on Quality Electronic Design (ISQED), pp.
314–321. IEEE (2018)
33. Li, J., et al.: Hardware-driven nonlinear activation for stochastic computing based
deep convolutional neural networks. In: Proceedings of the International Joint Con-
ference on Neural Networks (IJCNN), pp. 1230–1236. IEEE (2017)
34. Li, Z., et al.: HEIF: highly efficient stochastic computing-based inference framework
for deep neural networks. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 38(8),
pp. 1543–1556 (2019)
35. Abdellatef, H., Khalil-Hani, M., Shaikh-Husin, N., Ayat, S.: Accurate and com-
pact convolutional neural network based on stochastic computing. Neurocomputing
471, 31–47 (2022)
36. Qian, W., Li, X., Riedel, M.D., Bazargan, K., Lilja, D.J.: An architecture for fault-
tolerant computation with stochastic logic. IEEE Trans. Comput. 60(1), 93–105
(2010)
37. Hayes, J.P.: Introduction to stochastic computing and its challenges. In: Proceed-
ings of the 52nd ACM/IEEE Design Automation Conference (DAC), pp. 1–3. IEEE
(2015)
38. Bottleson, J., Kim, S., Andrews, J., Bindu, P., Murthy, D.N., Jin, J.: clCaffe:
OpenCL accelerated Caffe for convolutional neural networks. In: Proceedings of
the IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW), pp. 50–57. IEEE (2016)
39. Bareiss, E.H.: Numerical solution of linear equations with Toeplitz and vector
Toeplitz matrices. Numerische Mathematik 13(5), 404–424 (1969)
40. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural
networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
41. Winograd, S.: Arithmetic Complexity of Computations. SIAM (1980)
42. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 4013–4021. IEEE (2016)
43. Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neu-
ral networks. In: Proceedings of the 28th International Conference on Neural Infor-
mation Processing Systems (NIPS), pp. 2449–2457. ACM (2015)
44. Ayat, S., Khalil-Hani, M., Ab Rahman, A.: Optimizing FPGA-based CNN accel-
erator for energy efficiency with an extended Roofline model. Turk. J. Electr. Eng.
Comput. Sci. 26(2), 919–935 (2018)
45. Niu, Y., et al.: SPEC2: SPECtral SParsE CNN accelerator on FPGAs. In: Proceed-
ings of the 26th IEEE International Conference on High Performance Computing,
Data, and Analytics (HiPC), pp. 195–204. IEEE (2019)
46. Sun, W., Zeng, H., Yang, Y.-h., Prasanna, V.: Throughput-optimized frequency
domain CNN with fixed-point quantization on FPGA. In: International Conference
on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–8. IEEE (2018)
47. Guan, B., Zhang, J., Sethares, W., Kijowski, R., Liu, F.: Spectral domain convolu-
tional neural network. In: Proceedings of the 46th IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 2795–2799. IEEE (2021)
48. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
49. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by
reducing internal covariate shift. arXiv preprint (arXiv:1502.03167). arXiv (2015)
120 S. O. Ayat et al.
50. Vedaldi, A., Lenc, K.: MatConvNet: convolutional neural networks for MATLAB.
In: Proceedings of the 23rd ACM International Conference on Multimedia (MM),
pp. 689–692. ACM (2015)
51. Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In:
Proceedings of the 24th International Conference on Artificial Neural Networks
(ICANN), pp. 281–290. IEEE (2014)
Hybrid Context-Content Based Music
Recommendation System
Abstract. Due to the increase in technology and research over the past few
decades, music had become increasingly available to the public, but with a vast
selection available, it becomes challenging to choose the songs to listen too. From
research done on music recommendation systems (MRS), there are three main
methods to recommend songs; context based, content based and collaborative fil-
tering. A hybrid combination of the three methods has the potential to improve
music recommendation; however, it has not been fully explored. In this paper, a
hybrid music recommendation system, using emotion as the context and musical
data as content is proposed. To achieve this, the outputs of a convolution neural
network (CNN) and a weight extraction method are combined. The CNN extracts
user emotion from a favorite playlist and extracts audio features from the songs and
metadata. The output of the user emotion and audio features is combined, and a
collaborative filtering method is used to select the best song for recommendation.
For performance, proposed recommendation system is compared with content
similarity music recommendation system (CSMRS) as well as other personalized
music recommendation systems.
1 Introduction
Music has always been an important topic while discussing the development of the
modern internet. In the earlier days, music was mostly bought from the stores in the form
of vinyl or discs. As the internet evolved, music piracy increased and was ubiquitous in
the listening, sharing, and storage of music. This was because music was more accessible
in the mp3 format. In the modern internet, music streaming services such as Spotify,
Pandora and Deezer have increased the availability of music with subscription or free.
Due to the enormous amount of music content available, these streaming services have
developed recommender systems that account for user preferences in recommending
music to them. Recommender systems are important because they can enhance the user
experience on a particular music streaming site. By enhancing the experience, it may help
with providing quick and potential recommendation to users so that they can be retained,
and more potential users can be gained. Also knowing user preferences is important for
placing advertisement, which serve as a major income generator for streaming sites.
Information about Music Information Retrieval, Generation, and recommendation are
discussed in the following sections.
XI • XK
CosineSimiliarity : Sik = (2)
XI XK
Hybrid Context-Content Based Music Recommendation System 123
where,
SIK is the similarity between the ratings given by users I and K, LIK are the items
rated/liked by both users, XI and XK is the average rating of each user [2]. rIK is the
rating if user-I has not listened to the song and is calculated using Eq. 3 below.
n∈NI Sni (rnx − Xn )
rIx = XI + (3)
n∈NI Sni
where NI is the set of users I nearest neighbors who rated item x with respect to the
similarity score SI . Finally, the items with the highest rIx are recommended to user I.
The process for CF is to calculate the similarities between user-I and other users,
select the users with highest similarities, and take the weighted average of their ratings
using the similarities as the weight. Because of the difference in people creating bias,
each user’s average rating is subtracted and added back for the target user-I as shown in
Eq. 3. Therefore, CF finds users similar to the target user and recommends songs to the
target user based on the weights attached to each song.
Another form of recommendation is the Content Based recommendation System
(CBF). This system recommends songs based on the features extracted from the audio
signal like, rhythm, tempo, melody, and metadata. The information can also contain
low-level audio data such as the Mel-Frequency Cepstral Coefficients (MFCCs), Zero
Crossing Rate (ZTR), and Teager Energy Operator (TEO). Since audio information is
used to recommend songs, it eliminates the need to have a large user base and is also
more accurate than CF in predicting users’ taste [3].
The last form of recommenders is context based. This recommends factor in contexts
like time, location, weather, news, and emotion. Since music is often laden with contexts
like emotion, it is important to factor them in while recommending songs to a user. Many
music streaming services take advantage of this by creating playlists such as morning
moods, sad/happy, or Sunday tunes.
In this article, a preliminary music recommendation system that incorporates the
CBF, CB, and context-based methods to analyze user’s taste and accurately predict
songs to be recommended to a user is proposed and explained. Emotion is used as the
context, metadata is used for the content, and collaborative filtering is to be done when
a large number of users is acquired. The rest of the paper is structured as follows. In the
next section, related works done on music recommendation are presented. In Sect. 3, our
proposed model is discussed. Section 4 gives more details on the experimental results.
Analysis of the proposed system is done in Sect. 5. Finally, the paper concludes with a
summary and directions for future work.
2 Related Work
Researchers have used audio content to recommend music, while using a two-stage
approach by extracting low level audio content like MFCC and using these features
to predict what a user likes [4, 5]. The results achieved however, were unsatisfactory.
Therefore, researchers developed better recommendation systems such as the use of a
deep belief network and a probabilistic graphical model to unify the two steps into an
124 V. Omowonuola et al.
automated process [6]. One of the earliest uses of deep learning (DL) in content-based
music recommendation (MR) is by Van Den Oord, who used a CNN using Rectified
Linear Units (RELU) to reduce processing time. The input to the CNN consisted of short
audio samples from 7 digital and track data in the Million Song Database containing
the song metadata and was used to train the network. Each song was represented by 50
latent factors and used to minimize the mean square error (MSE) of predicting a user’s
likes. The experiment showed that a CNN performs best in user predictions.
Other researchers used the process of collaborative filtering to collect and recom-
mend song data to users. This method was feasible as shown in [5] but it had a cold
start problem. Due to the lack of user data at the beginning of the process, it was impos-
sible to recommend songs to a different user. To curtail this problem, content-based
music recommendation was incorporated into collaborative filtering methods. Papers
by Richard Stenzel and Thomas Kamps show the advantages of incorporating multiple
recommendation methods.
To increase accuracy of recommender models, researchers have also extracted emo-
tional data from songs by creating a CNN trained by data in the MediaEval 1000 songs
database (DEAM). This database contains song information classified by an emotional
scale of valence and arousal. The CNN model is based on regression and gives values
for both valence and arousal. Mel spectrograms of 15s audio snippets are used as inputs
to the CNN and the network is trained by the DEAM dataset. The experiments obtained
mean absolute error below 0.1, meaning that the predicted valence and arousal are only
I value away from the truth.
Previous work on music recommendation has been done by our group of researchers;
the paper involved the use of the Spotify API to access song information and attributes
like danceability, energy and valence. Attributes were normalized and then used in a
K-means cluster in which songs belong to a cluster with the nearest mean. A function
then utilized the cosine distance to recommend songs to a user. When this model was
tested with a subset of 20 people, recommended songs were close to a user’s preference
about 75% of the time [7].
Overall, this paper uses three music recommendation models (content based, col-
laborative filtering, and context based) methods verify and create a system that achieves
higher accuracy in predicting musical likes and recommending music to a user. By
perusing research done by previous authors, we believe this hybrid method should lead
to greater accuracy in predictions.
The next section gives an overview of the recommender systems, the dataset used,
explains the processing of the audio signals, and explains the architecture. The content
used and valence-arousal method which was chosen to represent emotion is explained
in this section.
3 Model Construction
3.1 Emotional Representation
Music is a great conveyor of emotions, and the emotions perceived from listening to
music vary by person. However, research over the recent years has shown improvements
in classifying music based on moods/emotions. This can be seen by the enormous number
Hybrid Context-Content Based Music Recommendation System 125
of emotional playlists in music streaming sites, and success in the use of CNN and
recurrent neural networks (RNN) in classifying songs [8, 9]. Due to the difficulty in
granulating emotions and the variability in emotion perceived by humans, Posner et al.
proposed a model which splits emotions into two components: valence and arousal.
Valence describes the attractiveness (positive valence) or aversiveness (negative valence)
of stimuli along a continuum (negative – neutral – positive), and arousal refers to the
perceived intensity of an event from very calming to extremely exciting or agitating [9].
The valence-arousal scale and its association with emotions are depicted in Fig. 1. Since
this scale represents a wide range of emotions, it was chosen as the form to represent
emotions found in songs.
As shown in Fig. 1, most emotions can be represented using this scale. For example,
a low arousal and low valence score represents tiredness, and the opposite represents
excited.
In our emotional classification, the authors chose to use a CNN since it has been
established as a method of classifying emotion [8–11]. Since CNN’s were created to
classify images, the audio data must be represented using an image. Spectrograms are
widely used in machine learning field to represent audio data therefore the authors
chose to use it also. Audio spectrograms are a way to visualize sound on a 2D plane,
the horizontal axis represents time, and the vertical axis represents frequency. Color
represents the amplitude of the time-frequency pair. The spectrogram can be defined as
an intensity plot of the Short-Time Fourier Transform (STFT) magnitude. Let X be a
signal of length Y. Consecutive equal segments of length n are taken where n < Y. The
segments may be overlapping. Each segment is windowed, and the spectrum is computed
using the STFT. The STFT is represented by Eq. (4) below.
∞
STFT {x(t)}(τ, ω) ≡ x(t)ω(t − τ )e−iωt dt (4)
−∞
where, ω(τ ) is the window function centered around zero, and x(t) is the signal to be
transformed. Some research has shown that humans do hear frequencies on a linear scale,
therefore a normal spectrogram is limited in fully representing the way audio is heard
and translated by our brains. Therefore, a more accurate image representation is the
126 V. Omowonuola et al.
Mel-spectrogram. This spectrogram converts the frequencies into a Mel scale as shown
in Eq. 5.
f
M (f ) = 1125ln 1 + (5)
700
Training and testing of the neural network were done with the DEAM dataset at first.
The dataset consists of 1000 songs with 744 unique songs selected from the Free Music
Archive. Each audio snippet has a length of 45s from a random starting point in a song.
The songs are annotated in a table consisting of valence and arousal values ranging from
1 to 9 [14]. The arousal and valence values are centered around 5 and there is a lack
of songs with low arousal and valence values. The distribution of the arousal values is
shown in Fig. 3. Due to the distribution of values not being balanced, this leads to a lack
of subjectivity in the results. To curtail that; data from Deezer emotional classification
dataset was added to improve the balance of the new dataset.
Due to the distribution of values not being balanced, this leads to a lack of subjectivity
in the results. To curtail that; data from Deezer emotional classification dataset was added
to improve the balance of the new dataset. At first, the songs are divided into three 15
s samples. The audio files are converted to mono and then resampled at 16 kHz. The
songs are converted into Mel-spectrograms using the ‘Librosa’ python library [15]. A
window size of 2048 and hop size of 512 is used.
The valence arousal values are normalized from [01] and the spectrogram
is normalized from [(−1)(+1)]. Two network architectures; (a) the pre-trained
mobilenet_v2_130_224 model developed by google, and (b) a simple network as shown
in previous research are chosen for comparison of results [7].
The simple network developed consists of an input layer of 128 * 1024 (128 Mel
scales and 1024-time window), followed by a convolution layer with 16 filters with a 3*3
kernel size. The Relu activation function is used next in the model to reduce processing
Hybrid Context-Content Based Music Recommendation System 127
time followed by a 3*3 pooling layer; a dense layer of 64 neurons and output layer of 10
layers. As mentioned before, hidden layers used the Relu activation function, and output
layer used the SoftMax function. The model was used to get outputs for both valence
and arousal.
For the pre-trained mobilenet_v2_130_224 model, each spectrogram was converted
into images of 224*224 pixels. These pixels were converted into tensors used as the
input. The sequential Keras Api was used since our model is not complicated. The first
layer takes the images and finds patterns in them; the second layer takes the information
from the first layer and outputs it into ten unique labels (0–9). SoftMax is used as the
activation function and Adam as the optimizer function. For both models, the dataset
was split into an 80:10:10 split for training, testing, and validation.
Oi = Sig i × HW i (7)
Output signal is then converted into the frequency domain by the fast Fourier
transform using Eq. 8.
i
Mel(i) = 2595 × log 10 1 + (8)
700
128 V. Omowonuola et al.
Pitch. Pitch extraction calculates the distances between the peaks of a given segment
of the music audio signal. Let Sig i denote the audio segment, k denotes the pitch period
of a peak, and Leni denotes the window length of the segment, and the pitch feature can
be obtained using Eq. (9).
Leni−k−1
Pitch(k) = Sig i Sig i+k (9)
i=0
Tonality: Tones are letters attached to different frequencies by humans. Most music
has various tones in them, but they end up with a home tone from which variation starts.
This home tone is used as a feature of the song by musicians. Since it is in the form of
letters, the letters are encoded into numbers to serve as input to the neural network.
Other features like beat and rhythm are also used as inputs to the neural network. To
increase the accuracy of the predictions, high level features used in previous research
are also used as inputs. These features as defined by Spotify are:
Danceability: Describes how suitable a track is for dancing based on a combina-
tion of musical elements including tempo, rhythm stability, beat strength, and overall
regularity.
Energy: Represents a perceptual measure of intensity and activity. Typically, ener-
getic tracks feel fast, loud, and noisy. For example, death metal has high energy, while
a Bach prelude scores low on the scale.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are
averaged across the entire track and are useful for comparing the relative loudness of
tracks.
Speechiness: This detects the presence of spoken words in a track. The more exclu-
sively speech-like the recording (e.g., talk show, audiobook, poetry), the closer to 1.0
the attribute value.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah”
sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly
“vocal.”
Liveness: Detects the presence of an audience in the recording. Higher liveness
values represent an increased probability that the track was performed live.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
Duration: The duration of the track in milliseconds [16].
These features were obtained from the Free Music Archive (FMA) small and features
dataset. The FMA is a large dataset created for music analysis containing about 100,000
songs from 160 genres. The audio files and features of the songs are provided [17]. The
FMA small dataset contains 8000 songs divided into an equal amount of 8 genres. Due to
the balance of this dataset and its relation to the dataset used for emotional classification,
it was chosen by the authors [8]. Table 1 shows the division of genres in the FMA small
dataset. Due to problem of achieving a better accuracy using the FMA dataset, a new
dataset mixing data from the FMA and GTZAN dataset was used for classification.
The GTZAN dataset (refer Table 2) contains about 1000 music samples (10 different
genres) 30 s each [18]. Audio labels are also provided making it similar to the FMA
dataset.
Hybrid Context-Content Based Music Recommendation System 129
The new data set contains 200 songs each of the genres present in the GTZAN dataset
(blues, classical, country, disco, hip hop, jazz, metal, pop, reggae, rock).
The features serve as input to our neural network to classify songs into genres. At
first, each feature is normalized from 01 to reduce processing time and increase accuracy.
Since the dataset was small, a simple deep neural network (DNN) was developed to
classify genres.
To train the DNN, the data set was split into a 70–30 split for training and testing.
Considering results from experimentation, Relu is used instead of the sigmoid function
to lead to faster convergence and the ADAM optimizer function with a learning rate of
0.01 gave the best results. The parameters of the model are listed below.
• Batch size of 64
• 1000 Epoch
• Relu activation function
• Learning rate of 0.001 and momentum of 0.9
• Adam Solver
The output of the model gives 10 different genres, this layer is discarded after training,
and each song is represented by a vector of 64 values.
4 Experimental Results/Analysis
In the emotional classification model, two different neural networks were used. The pre-
trained mobilenet_v2_130_224 (referred to as v2_net) and the created neural network
is referred to as emoCNN. The results were obtained with the k-fold cross validation
method, and the accuracy rate was attained as the average of three-fold. Table 3 shows
the accuracy from the training, validating, and testing stages.
With the higher accuracy of the EmoCNN model, it was chosen to represent the
valence-arousal values for the final recommendation model. Higher accuracy could be
observed with a balanced and larger dataset compared to the one used, but that will be
for further research. On the other hand, the neural network used to classify songs based
on genres, performed better. The root mean square error (RMSE) was used to evaluate
the model. It is defined by;
N 2
i=1 (ri − ri)
RMSE = (10)
N
where, ri , ri ’ are the true and predicted ratings, respectively. Overall, this genre classifi-
cation model attained RMSE of 0.53, the accuracy was lower than expected, however,
considering genre classifications with similar dataset do not have an accuracy above 70%,
it is not surprising. Moreover, accuracy observed in the model satisfied the requirement
for genre predictions.
With both models built and tested, a vector combining the vector and arousal values,
and 64 latent values from genre classification is made. This vector is attached to each song
and will be used to recommend songs while a collaborative filtering model is employed.
This is to be done in future research. Overall, the accuracy of the recommendations
should be high based on the results of the emotional and feature classification.
Hybrid Context-Content Based Music Recommendation System 131
6 Conclusion
The influence music has on the population is immense and cannot be overlooked, there-
fore by examining music and finding out why particular features influence our musical
taste, helps in providing newer way to understand music. This paper describes prelimi-
nary research done on creating a hybrid music recommendation system. Neural networks
are leveraged to get emotional values and classify genres to extract features. This model
differs from other approaches by incorporating emotion as a context and adding content
and collaborative filtering methods to recommend songs. Emotions are extracted using
Mel-spectrograms and represented using the valence-arousal scale, songs are classified
into genres using audio data, and genres are used to get interesting song features. Due
to the limitations of the dataset for genre classification, more research is necessary to
create a better feature database and use it for analysis. By doing so, variety of genres,
tracks, and features, will be incorporated with the data set. This will lead to better rec-
ommendation model designs that classify music accurately and provide benefits to the
learning and teaching process of music. Besides, the prototype model tested by people
to identify errors, as accuracy is not an exact prediction for human recommendations.
132 V. Omowonuola et al.
By creating a model like this and integrating it into music recommendation systems, the
digital profiling and therapy effects of the model can be actualized.
References
1. Chou, P.-W., Lin, F.-N., Chang, K.-N., Chen, H.-Y.: A simple score following system for
music ensembles using chroma and dynamic time warping. In: Proceedings of the 2018 ACM
on International Conference on Multimedia Retrieval, pp. 529–532 (2018)
2. Luo, S.: Intro to Recommender System: Collaborative Filtering. Towards Data Sci-
ence (2018). https://towardsdatascience.com/intro-to-recommender-system-collaborative-fil
tering-64a238194a26
3. Hassen, A.K., JanBen, H., Assenmacher, D., Preuss, M., Vatolkin, I.: Classifying music genres
using image classification neural networks. In: Archives of Data Science, Series A (Online
First), 5(1), 20. KIT Scientific Publishing (2018)
4. Pedro, C., Koppenberger, M., Wack, N.: Content-based music audio recommendation. I:n
Proceedings of the 13th annual ACM international conference on Multimedia, pp. 211–212
(2005)
5. Yoshii, K., Masataka, G., Kazunori, K., Tetsuya, O., Okuno, H.: Hybrid collaborative and
content-based music recommendation using probabilistic model with latent user preferences.
In: ISMIR6, pp. 296-301 (2006)
6. Wang, X., Wang, Y.: Improving content-based and hybrid music recommendation using deep
learning. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 627–
636 (2014)
7. Mandapaka, J.S., Omowonuola, V., Kher, S.: Estimating musical appreciation using neural
network. In: Proceedings of the Future Technologies Conference FTC 2021, pp. 415–430.
Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89880-9_32
8. Malik, M., Sharath, A., Konstantinos, D., Tuomas, V., Dasa, T., Jarina, R.: Stacked convo-
lutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:
1706.02292 (2017)
9. Aljanaki, A., Yang, Y.-H., Soleymani, M.: Developing a benchmark for emotional analysis
of music. PloS one 12(3), e0173392 (2017)
10. Akella, R.: Music Mood Classification Using Convolutional Neural Networks. San Jose State
University, Master’s project (2019)
11. O’Shaughnessy, D.: Speech Communication. Addison Wesley, Human and Machine (1987)
12. Roberts, Leland. 2020. Understanding the Mel Spectrogram. Medium. March 14, 2020. (2020)
13. Roberts, L.: Medium.com, 05-Mar-2020. https://medium.com/analytics-vidhya/understan
ding-the-mel-spectrogram-fca2afa2ce53 Accessed 21 Jun 2022
14. Soleymani, M., Caro, M.N., Schmidt, E.M., Sha, C.-Y., Yang, Y.-H.: 1000 songs for emotional
analysis of music. In: Proceedings of the 2nd ACM international workshop on Crowdsourcing
for multimedia, pp. 1–6 (2013)
15. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O.: librosa:
Audio and music signal analysis in python. In: Proceedings of the 14th Python in Science
Conference 8, 18–25 (2015)
16. Spotify. 2019. Web API Reference | Spotify for Developers. Spotify.com. (2019). https://dev
eloper.spotify.com/documentation/web-api/reference/
17. Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: Fma: A dataset for music analysis.
arXiv preprint arXiv:1612.01840 (2016)
18. Olteanu, A.: GTZAN Dataset - Music Genre Classification. Kaggle.com (2019). https://www.
kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification
Development of Portable Crack Evaluation
System for Welding Bend Test
Abstract. This paper describes a system to evaluate the crack severity for the
welding bend test fragment. The examination in the welding qualification test in
Japan, is conducted by human visual inspection and its burden is concern. The
authors constructed an equipment to photograph the fragment specimens under
the stable optical condition. The proposed system is also designed for portability
to assist the evaluator in the field. We employed Resnet18 to evaluate the given
image. The image input layer of original Resnet18 was remodeled from 224-
by-224 to 500-by-500 to capture the crack feature in detail. The output layer is
replaced with three classification nodes such as “Bad,” “Good,” and “Neutral”
expressing the crack severity levels. Experiments showed that 83% accuracy were
obtained, confirming that CNN adequately captured the surface crack conditions.
The experimental details, remarks, and future works are discussed.
1 Introduction
In developed countries, skilled welding technicians are not sufficient. This issue leads to
a deficit of veteran to nurture young welders. Given this issue, several studies investigate
nurturing methods and construct educational system in the welding [1–3]. In Japan, the
welding qualification is certified by passing a practical examination stipulated by JWES
(Japanese Welding Engineering Society) policy [4]. The veteran judgment conducts
visual inspection for the bend test fragments shown in Fig. 1. As shown in Fig. 1, a
welding plate by a beginner is cut in a specific width and then bent at the welding joint
[5].
As Fig. 2 illustrates, the veteran (skilled welder) visually inspects the surface cracks
of the bend test fragments. However, the evaluators must carefully check many pieces
of bend test specimens, and thus this burden is a concern.
This study aims to build an automatic evaluation system using the Convolutional
Neural Network (CNN) model [6] to address the above problem. The schematic of the
proposed system is shown in Fig. 3.
Development of Portable Crack Evaluation System for Welding 135
The input to the CNN is specimen’s image and the CNN outputs the classification
result as “Bad”, “Good”, and “Neutral”. Such system will support the evaluators’ deci-
sions. Therefore, we intend to construct a system in a compact size so that we can carry
the system to the inspection site. If a computer judges “Bad” and “Good” specimens
correctly, the evaluator’s workload reduces. As the result, the evaluator will be able to
concentrate on evaluating “Neutral” specimens.
Various CNN models have been developed, such as VGG [7], Googlenet [8], Resnet
[9], Densenet [10], and Efficientnet [11]. These CNNs have high accuracy in the task
of classifying 1000 different objects in the RGB color images. Therefore, these models
are compatible with the bend test fragment’s RGB color image. As our initial attempt,
we chose Resnet [9] in the present paper because its structure is simplest in Directed
Acyclic Graph (DAG) networks [8–11]. On the other hand, VGG [7] is a classical style
compared with DAG networks because all layers are arranged one after the other.
We found excellent achievements on evaluating welding deficits using CNN [12–17].
However, our study differs from other similar studies in that it focuses on the evaluation
of specific crack severity features on the bend test fragment.
2 System Construction
Figure 4 shows the equipment to photograph bend test fragments. The digital camera
is fixed to the top end of the arm frame of the photographing stand. The photographing
stand is placed inside a compact box to keep stable optical condition. The optical box can
be folded and portable. From the hole in the ceiling of the optical box, we can confirm
the camera’s view. The LED rumps are attached to the ceiling to maintain stable optical
conditions.
136 S. Kato et al.
Fig. 4. Photographing stage is inside the shooting box to take the bend test fragment’s picture
using the fixed digital camera on the top of the arm.
As shown in Fig. 5, the specimen is put on the center of the markers so that only
specimen part is extracted. The extracted image is input to the CNN.
Development of Portable Crack Evaluation System for Welding 137
The bend test fragment is a roller bending test defined by JIS [18] (Japanese Industrial
Standards.) The plate thickness is mainly 9 [mm], however, few 10 [mm] specimens are
included. The material is SS400 (steel) specified by JIS. Note that we photographed bend
test fragments made from flat plates or pipes. The fragments are part of the flat plates or
pipes. The fragments were bent backward or faceward. Since focus of the present study
is on the surface crack severity, the thickness, or bent directions are not important.
Figure 6 shows the structure of the proposed CNN based on Resnet-18 [9]. The image
input layer size of original Resnet-18 is [224,224,3] resolution, and the output layer
comprises 1000 classification nodes. On the other hand, the proposed CNN is remodeled
so that the image input layer size [500,500,3] resolution, and output layer with three
classification nodes represents the probability of “Bad,” “Good,” and “Neutral” crack
severity. The class with highest probability becomes the classification result. The kernel
values of all convolutional layers were the same as the original Resnet-18 before training
in the following experiment to validate the proposed CNN.
138 S. Kato et al.
3 Experiment
[500,500] resolution RGB images compatible with the proposed CNN input layer size.
In total, 105 pictures were obtained.
The images were classified to “Bad”, “Good”, and “Neutral” depending on the
crack severity level. Figure 8 shows the classification examples. The numbers of “Bad”,
“Good”, and “Neutral” images were 29, 59, and 17, respectively (105 = 29 + 59 + 17).
140 S. Kato et al.
To validate the CNN, we carried out repeated random subsampling validation similar
to a bootstrap validation [19, 20] which is confident even though a small number of data
is available. First, 5 “Bad” and 5 “Good” images were taken out randomly from all 105
image data comprising 29 “Bad” 59 “Good” and 17 “Neutral” images (105 = 29 +
59 + 17). The taken 10 images (10 = 5 “Bad” + 5 “Good”) were left out to use for
testing CNN accuracy. The remaining 95 (95 = 105 – 10) image data were used for
training the CNN. The training was performed under the condition shown in Table 1.
After training the CNN, 10 images not used for training were input into the trained CNN
to validate accuracy of the CNN. This routine was performed in 20 trials as Table 2.
Table 2 enumerates the test data set and accuracy for each trial.
As Table 1 shows, the images were augmented as left-right and up-down reflected at
random while training the CNN to avoid overfitting. Figure 9 shows the loss and accuracy
changes for all 20 trials. Since the loss and accuracy were improved as iteration proceeds,
the training went well. Figure 10 shows the accuracy in each trial and the confusion matrix
for all 20 trials.
Development of Portable Crack Evaluation System for Welding 141
Method/Value
Solver SGDM(Stochastic Gradient Descent with Momentum)
Learn Rate 10−4
Max Epochs 150
Mini batch Size 32
Total Iterations 300
Augmentation Left-Right and Top-Down Reflection
CPU Intel core i9 10980XE
Main Memory 98 GB
OS Windows 10 64bit
Development Language MathWorks, MATLAB (R2022a)
GPU Nvidia RTX A6000 (VRAM 48GB, 10752 cuda cores)
Trial Test for “Bad” image ID Test for “Good” image ID Accuracy
(Bad 129) (Good 159)
1 4 9 12 24 26 21 47 49 53 58 0.8000
2 1 9 13 26 29 10 16 23 24 38 0.9000
3 9 18 23 27 28 10 15 36 40 49 0.9000
4 1 7 11 16 23 13 14 19 47 57 0.8000
5 6 14 17 27 29 2 6 23 36 40 0.7000
6 2 4 10 17 18 1 5 24 51 59 0.9000
7 7 9 17 18 28 5 18 24 38 41 1
8 7 15 19 23 27 13 18 24 41 46 0.6000
9 3 7 13 23 29 10 30 38 58 59 0.9000
10 2 14 20 27 28 33 35 42 50 56 0.7000
11 9 16 18 22 23 1 17 23 45 47 0.9000
12 10 17 20 24 29 11 18 28 38 49 0.9000
13 14 15 17 26 28 41 45 54 56 57 0.8000
14 4 6 10 25 26 4 24 38 52 56 1
15 6 16 18 27 29 25 29 32 46 51 0.8000
16 14 17 18 21 28 1 24 39 53 56 0.8000
(continued)
Development of Portable Crack Evaluation System for Welding 143
Table 2. (continued)
Trial Test for “Bad” image ID Test for “Good” image ID Accuracy
(Bad 129) (Good 159)
17 3 16 22 24 27 3 21 36 52 55 0.8000
18 1 4 7 24 29 7 10 17 40 58 0.8000
19 6 12 21 23 26 19 26 34 37 49 0.9000
20 6 7 8 13 14 6 15 17 29 58 0.7000
- - - Average 0.8300
The mean accuracy was 83% in total. Therefore, the proposed CNN captured the
crack severity features. However, 22 “Bad” images were misjudged as “Good,” as
Fig. 10(b) shows. In the future, we will experiment with not only Resnet18-CNN but
also with various high performance CNNs such as VGG [7], Googlenet [8], Densenet
[10], and Efficientnet [11].
4 Conclusions
This paper proposed automatic evaluation method for evaluating the welding bend test
crack severity to assist the human visual inspection. We constructed the equipment to
photograph the fragment specimens under the stable optical condition and employed
the Resnet18-CNN to evaluate the given image. The input layer of Resnet18-CNN was
customized from 224-by-224 to 500-by-500 to capture the crack feature in detail. The
output layer consists of three nodes expressing crack severity level such as “Bad”, “Good”
and “Neutral”. In the experiment, the mean accuracy was 83%. In the future, we will
experiment with other CNNs and compare their results.
Acknowledgments. The authors would like to thank Ueno in MathWorks for technical advice.
This work was supported by a Grant-in-Aid from JWES (The Japan Welding Engineering Society.)
References
1. Asai, S., Ogawa, T., Takebayashi, H.: Visualization and digitation of welder skill for education
and training. Welding in the world 56, 26–34 (2012)
2. Byrd, A.P., Stone, R.T., Anderson, R.G., Woltjer, K.: The use of virtual welding simulators
to evaluate experimental welders. Weld. J. 94(12), 389–395 (2015)
3. Hino, T., et al.: Visualization of gas tungsten arc welding skill using brightness map of backside
weld pool. Trans. Mat. Res. Soc. Japan 44(5), 181–186 (2019)
4. The Japanese Welding Engineering Society http://www.jwes.or.jp/en/ Accessed 28 Mar 2022
5. Wan, Y., Jiang, W., Li, H.: Cold bending effect on residual stress, microstructure and mechan-
ical properties of Type 316L stainless steel welded joint. Engineering Failure Analysis
117,104825 (2020)
144 S. Kato et al.
6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
7. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv preprint arXiv:1409.1556 (2014)
8. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
(2016)
10. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional
Networks. In: CVPR, 1(2), p. 3 (2017)
11. Tan, M., Le, Q.V.: EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks. ArXiv Preprint ArXiv:1905.1194 (2019)
12. Park, J.-K., An, W.-H., Kang, D.-J.: Convolutional neural network based surface inspection
system for non-patterned welding defects. Int. J. Precision Eng. Manufacturing 20(3), 363-374
(2019)
13. Dung, C.V., Sekiya, H., Hirano, S., Okatani, T., Miki, C.: A vision-based method for crack
detection in gusset plate welded joints of steel bridges using deep convolutional neural
networks. Automation in Construction 102, 217-229 (2019)
14. Zhang, Z., Wen, G., Chen, S.: Weld image deep learning-based on-line defects detection
using convolutional neural networks for Al alloy in robotic arc welding. J. Manuf. Process.
45, 208–216 (2019)
15. Dai, W., et al.: Deep learning assisted vision inspection of resistance spot welds. J. Manuf.
Process. 62, 262–274 (2021)
16. Abdelkader, R., Ramou, N., Khorchef, M., Chetih, N., Boutiche, Y.: Segmentation of x-ray
image for welding defects detection using an improved Chan-Vese model. Materials Today:
Proceedings 42(5), 2963–2967 (2021)
17. Zhu, H., Ge, W., Liu, Z.: Deep learning-based classification of weld surface defects. Appl.
Sci. 9(16), 3312 (2019)
18. The Japanese Industrial Standards Committee https://www.jisc.go.jp/eng/index.html
Accessed 29 Mar 2022
19. Priddy, K.L., Keller, P.E.: Artificial Neural Networks - An Introduction, Chapter 11, pp. 101–
105. Dealing with Limited Amounts of Data. SPIE Press, Bellingham, WA, USA (2005)
20. Ueda, N., Nakano, R.: Estimating expected error rates of neural network classifiers in small
sample size situations: a comparison of cross-validation and bootstrap. In: Proceedings of
ICNN’95 - International Conference on Neural Networks, 1, pp.101–104 (1995)
CVD: An Improved Approach of Software
Vulnerability Detection for Object
Oriented Programming Languages Using
Deep Learning
1 Introduction
about 4,600 by Common Vulnerabilities and Exposures (CVE) [1]. In 2016 the
number rose about to 6,500 [1]. An automatic detection system can reduce effort.
So, Machine Learning and Deep Learning approaches are being used in this area
to solve this problem. Several works are done for vulnerability detection using
the deep learning approach, but there are still limitations in detection perfor-
mance. Also, manual human labor and high false-negative rates are significant
drawbacks of existing solutions. Despite the effort expended in the pursuit of
safe programming, software vulnerabilities remain a significant concern and will
continue to be. Object-Oriented Programming (OOP) Languages are now trendy
among all types of developers. All the software and programs we see in our daily
lives are made with an Object-Oriented Programming (OOP) Language.
There are many popular Object-Oriented Programming (OOP) modeled Lan-
guages like C-Sharp (C#), Java, Python, Ruby, PHP, C++, VB.Net, JavaScript,
etc. One of the core objectives of OOP Languages is data security and encapsula-
tion. A simple line of vulnerable code can ruin the software’s complete security,
causing various damage to the user. For example, Banshee is a famous cross-
platform media player, made by the OOP Language C# [36]. This was the
default media player in the Linux Mint operating system for a long time. It had
a vulnerability CVE-2010-3998 for version 1.8.0 or earlier that allowed any local
user to gain privileges using a Trojan Horse [33]. Banshee was later replaced by
Rhythmbox, which also had a vulnerability CVE-2012-3355 for version 0.13.3
or earlier that allowed any local users to execute arbitrary codes by a Symlink
Attack [31]. So, for the safety of the users, vulnerability detection is a crucial
task. It needs to be done to know what they are using, and developers can under-
stand what they are missing. Also, it should be done so that a large number of
vulnerabilities can be detected in a short time without having much manual
human labor.
We introduce CVD (Common Vulnerability Detector), deep learning
approached complete model to determine vulnerabilities from the source codes
of various OOP-based programs and software. Specifically, we present a Deep
Learning based model that will be able to detect and pinpoint the common vul-
nerabilities of an OOP source code without any kind of manual labor with high
accuracy and a low false-negative rate. All the code analyzers used on deep learn-
ing models till now are mainly based on text classifiers, which is not fully efficient
for this particular purpose. We present a model that works directly with a code
analyzer that analyzes codes using a code classifier. The major contributions
that have been made in this paper are summarized below:
• A fully optimized OOP parser is applied for the OOP Languages. We know,
all the OOP languages have an almost similar structure in them. So, this
OOP parser can break any OOP source codes into small pieces and tokenize
them, so that they can be processed accordingly.
• Common Vulnerability Detector (CVD) is designed as an intelligent vulnera-
bility detector that detects common vulnerabilities of any OOP source codes
with high accuracy and low false-negative rate. To do so, we present a fully
CVD - Common Vulnerability Detector 147
The rest of the article is ordered as the most related works are in the Sect. 2
section. The details of the dataset are in Sect. 3 section. Section 4 describes our
research and deep learning-based models’ methods are explained in Subsect.
4.2, where our proposed CVD model with the convolutional recurrent neural
networks exists. The performances, results, and accuracies are evaluated in the
Sect. 5 segment and in Sect. 6, we finish with conclusions.
2 Related Works
Several experiments are conducted to analyze the source code. Two main
branches, static algorithm analysis tools, and dynamic machine learning algo-
rithm analysis describe them. Static Application Security Testing (SAST) [20],
Attackflow [4], the dotTEST by Parasoft [3] are examples of static source code
vulnerability analysis tools.
Using machine learning algorithms for source code analysis and detecting
bugs/vulnerabilities is a very new concept in computer science. A source code
analysis model for PHP is named TAP, a token, and deep learning-based tech-
nology [16]. In this paper, they designed a custom tokenizer, a new method to
perform data flow analysis. They used Deep Learning Technology (LSTM) to
solve the problem, and TAP became the only machine learning model for deal-
ing with 7 vulnerability categories of PHP. To achieve their goal, they collected
their datasets from SARD where, total samples are 42,212 with 29,258 good,
and 12,954 vulnerable samples. That dataset contains 12 categories of CWE
vulnerabilities. After tokenizing all the source codes, they used Word2vec, which
transforms these into vector space using Neural Network. In the main Experi-
ment, they used LSTM over RNN because LSTM is an improved version of RNN
to deal with long-term dependencies, and thus LSTM outperforms RNN. Then
they used all the created vector spaces as the input of the LSTM layer, and
the output layer was decoding the vulnerability labels. Though TAP got good
results, it is not perfect. Many targeted labels from CWE-862 were unspecified
in the results, and nearly all targeted labels from CWE-95 were unidentified.
The lack of samples in these categories may be a potential explanation.
Yongcheng Liu and his team proposed a new model using neural networks
to imbed vulnerability concerned text to derive their implicit meaning [17]. Fea-
ture merging, providing support for text features, and other implicit features
are the significant components of the proposed model for vulnerability classifi-
cation. To get a more accurate tacit representation of vulnerability, concerned
text lemmatization, eliminating stop words, and embedding neural networks are
used. Based on the training model output of text feature through the neural net-
work, embedded representation, or the classified result to determine if the bug
148 S. Siddique et al.
is exploitable or not. Finally, these obtained results are joined with other robust
features and a two-class ML model is used to detect the possibility of attackers
exploiting vulnerabilities. Open-source intelligence data from different databases
like NVD, SecurityFocus, and Exploits are used to extract the fastEmbed sys-
tem’s characteristics. This implicit artwork of predicting exploitation has been
replicated and the replication has been used as a benchmark. This model beats
the baseline model regarding performance in both generalization, and prediction
ability without terrestrial interlacing by training the establishment of vulner-
ability concerning text classification on highly biased datasets. Moreover, this
model also beat the baseline model’s performance for predicting vulnerabilities
with a 33.577% improvement [17].
A system was built and named ‘VulDeePecker’, short for Vulnerability Deep
Pecker [28]. The first vulnerability dataset ever has been presented here. This
VulDeePecker later was able to find four new vulnerabilities from three different
software programs that were not listed in the National Vulnerability Database
(NVD). To accomplish this, Recurrent Neural Network (RNN), Bidirectional
Recurrent Neural Network (BRNN), and Gated Recurrent Unit (GRU) have
been ruled out because Long Short-Term Memory (LSTM) outperforms all of
them. So, Bidirectional LSTM (BLSTM) has been chosen as unidirectional is not
sufficient in this case. In the process, it is assumed that the programs’ source
codes are available, and those are written in C/C++ codes. To train the BLSTM,
at first the library/API calls and similar code slices are extracted and turned
them into code gadgets. Code gadgets are lines of codes or number of instruc-
tions matched with one another. Then the code gadgets are labeled with 0 and 1
indicating if it has a vulnerability or not. After that, the code gadgets are trans-
formed into symbolic representations and then converted into vectors. The vec-
tors are the input of the BLSTM Neural Network. The program has been tested
with 19 well-known C/C++ products. After testing, VulDeePecker detected 4
new vulnerabilities of 3 different known products that were not listed in the NVD.
There are some limitations too. It can only analyze C/C++ source codes; the
implementation works with the BLSTM. A better approach can be possible, and
the dataset only contains buffer error vulnerabilities, and resource management
error vulnerabilities. Some of the Chinese researchers proposed a new automatic
vulnerability classification model named TFI-DNN [24]. This model is made by
combining Term-Frequency- Inverse Document Frequency (TF-IDF), Informa-
tion Gain (IG), and Deep Neural Network (DNN). After applying this model to
a dataset from NVD, it acquired an accuracy of 87%. The recall is 0.82, Precision
is 0.85, and the F1-score is 0.81. Another research team presented three deep
learning models for software vulnerability detection and compared their perfor-
mances with the traditional model. The three models they used are Convolution
Neural Network (CNN), Long Short-Term Memory (LSTM), and a Combination
of both named CNN-LSTM. After implementing, they compared them with the
traditional method Multi-layer Perception (MLP) [35]. The prediction accuracy
of their proposed method is 83.6%, which outperformed MLP. A study was per-
formed by Chakraborty et al. where they used CNN+RF, BLSTM, BGRU, and
CVD - Common Vulnerability Detector 149
For all corresponding to six CWEs’, there are also safe source codes. These
invulnerable source codes are labeled as a good target class. So overall, we have
seven unique target classes for prediction from source codes. Each vulnerable
source codes contain different types of flaws.
Two examples of source code files are shown here. Listing 1.1 is the vulnerable
source file, in line 5 there is no filtering, but direct use of the input variable in
SQL query. It is possible to inject SQL queries here, which is labeled as CWE-89
(SQL Injection). As well as the vulnerable code samples, there are also good code
samples for that corresponding CWE. Listing 1.2 shows the sample of safe code
for CWE-89. Before applying the SQL query, the input variables are preprocessed
and removed all special characters to make the query safe. The query remover
preprocessing is added from code line number 6 to 13. All the characters which
can be responsible for making SQL query are replaced there.
Listing 1.2. Safe source code sample
1 public static void Main ( string [] args ) {
2 string tainted_2 = null ;
3 string tainted_3 = null ;
4 tainted_2 = Console . ReadLine () ;
5 tainted_3 = tainted_2 ;
6 string pattern = @ ‘ ‘/^[0 -9]* $ / " ;
7 Regex r = new Regex ( pattern ) ;
8 Match m = r . Match ( tainted_2 ) ;
9 if (! m . Success ) {
10 tainted_3 = ‘‘" ;
11 } else {
12 tainted_3 = tainted_2 ;
13 }
14 string query = ‘‘ SELECT * FROM ’" + tainted_3 + ‘‘’" ;
15 string connectionString = ‘‘ Server = localhost ; port =1337; User Id =
postgre_user ; Password = postgre_password ; Database = dbname " ;
16 NpgsqlConnection dbConnection = null ;
17 try {
18 dbConnection = new NpgsqlConnection ( connectionString ) ;
19 dbConnection . Open () ;
20 NpgsqlCommand cmd = new NpgsqlCommand ( query , dbConnection ) ;
21 NpgsqlDataReader dr = cmd . ExecuteReader () ;
22 while ( dr . Read () ) {
23 Console . Write ( ‘ ‘{0}\ n " , dr [0]) ;
24 }
25 dbConnection . Close () ;
26 } catch ( Exception e ) {
27 Console . WriteLine ( e . ToString () ) ;
28 }
29 }
As simpler the parser, the more generalized supporting power for different
types of programming languages (Object Oriented Structured). In this step, all
the source codes are cleaned, comments are cleared. We designed a custom source
code parser to analyze the Object definition of the source code files. We try to
CVD - Common Vulnerability Detector 151
identify the source code files structural patterns to make a simple parser, parse
all the source codes as a single file, means a single test case, and CWE id is the
targeted class of that sample.
All of the source code files contain the same structure as defined in the Fig. 1.
Our Object-Oriented Model parser unifies the tokens to extract features from
the dataset. For the methods implementation section of fig, the parser parses
word by word and removes all special characters (Code Syntax). The dataset
is considered as the natural language for analyzing source codes using deep
learning algorithms. Using regular expressions, the OOP parser is made simpler
to support all OOP-based source code datasets of different languages.
4 Methods
Different types of classification algorithms are applied to find out the most suit-
able algorithm based on the source code dataset. Depend on algorithm struc-
tures, and the samples are trained and tested on both static classification pre-
diction algorithms and deep learning-based algorithms.
152 S. Siddique et al.
size is equal to the maximum length of each sequence. The dataset now seems
like sequences to multiclass classification. Text classification algorithms always
create an extra bias on sequence order. But a bug in code can be anywhere,
which cannot assure sequence order classification. We proposed an Optimized
Convolutional Recurrent Neural Network model for source code analysis and
named the overall system as a Common Vulnerability Detector (CVD). Here we
applied 4 popular deep learning neural network models Long short-term memory
(LSTM) [22], Gated Recurrent Unit (GRU), Recurrent Neural Network (RNN)
for text classification, and Convolutional Neural Network (CNN) to investigate
the performance of our CVD model source code analyzer. All 5 models are shaped
with equal training parameters and similar shaped hidden layers for performance
evaluation simplicity.
ex − e−x e2x − 1
tanh (x) = = (2)
ex + e−x e2x + 1
Later these two parts are combined to create an update to the state. After
that, the output is produced. The output is based on the cell state but a filtered
version. For that, the sigmoid layer is run to decide what parts are going to be
outputted, and then the cell state is put through the tanh layer and multiplied
by the output of the sigmoid layer to get what was decided.
function is used to activate the Dense layer. Another Dense layer has been used
for the output of the model. The activation function of this Dense layer is named
softmax. So overall, inputs are given in the input layers. Those inputs are passed
through all the layers of the model, like sequential architecture. According to the
position and meaning of the words, inputs are parsed and tokenized. After that,
the tokens are passed through all the layers, and finished with the output layer.
Thus, RNN has been used as a persuasive text classifier.
⎡ ⎤
a b c d
⎣ ⎦ w x
Input = e f g h Kernel =
y z
i j k l
aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
Filtered output =
ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
The output is resampled and taken the maximum value according to the win-
dow size using the MaxPooling Layer.When we have the super feature matrix
(CNN applied), we reshaped it and applied Simple RNN to maintain the sequence
order with 32 units. This type of composite structure makes our model highlight
because CNN can identify the Super Features (the exact bug) shown in Fig. 5,
and RNN maintains the sequence order. Another dense layer is applied to gen-
erate the output layer of predicted classes, where the kernel size is equal to the
number of targeted classes. Finally, slash the output with the softmax activation
function to make the output smooth and normalized.
exp(xi )
sof tmax(x)i = (3)
j exp(xj )
From Eq. 4 loss function, y is the actual label, and ŷ is the output’s predicted
label. As well as minimizing loss, we use Adam [25] as the optimization algorithm
per epoch.
Hamming Loss [9] is calculated for test data samples. The Jaccard Score
introduced by Paul Jaccard, helps to find positive definiteness from actual data
to the predicted data [8]. Cohen Kappa Score [13] is also evaluated to find our
designed model’s inter-rater reliability in the testing phase.
Table 3 indicates the comparison of accuracy, loss, AUC, and the cosine sim-
ilarity for the training samples. From Table 3 we can see, GRU has the worst
accuracy of 78.23%, and CVD has the best performance of 96.64% of vulnera-
bility detection. Among all the algorithms, CNN has the nearest good accuracy.
160 S. Siddique et al.
The same result we see in the loss too. CVD has the lowest loss of 0.067, and
GRU has the highest error of 0.452. CNN has a loss of 0.086, nearest to our
designed CVD model. And the same goes for the AUC and cosine similarities.
CVD is the best among all of them.
The F1 score of CVD is about 96.40%, which is assembled our model’s ulti-
mate performance. The nearest F1 Score is obtained 92.60% by the LSTM model.
According to the Cohen Kappa’s reliability score, CNN achieves 91.80%, elapsed
about 93.20% by CVD. The recall of CVD is about 97.30% and high recall means
that an algorithm returns most of the relevant results.
rates interpret the risks of using our model in reality, the maximum rate of true-
negative (TN) is about 5.6% for both CWE-22 and CWE-89. If any sample does
not contain bugs but marked as vulnerable by our system, is specified the rate
of false-positive. The maximum rate of false-positive (FP) is 1.20% for CWE-22.
References
1. Common Vulnerabilities Exposures (CVE) (2017). https://cve.mitre.org. Accessed
18 Oct 2020
2. Common Weakness Enumeration (CWE) (2017). https://cve.mitre.org. Accessed
18 Oct 2020
3. Efficiently Achieve Compliance With C# Testing Tools for.NET Development
(2020). https://www.parasoft.com/products/parasoft-dottest. Accessed 18 Oct
2020
4. Identify all vulnerabilities in your source code (2020). https://www.parasoft.com/
products/parasoft-dottest. Accessed 18 Oct 2020
CVD - Common Vulnerability Detector 163
5. Bengio, Y., LeCun, Y., Henderson, D.: Globally trained handwritten word rec-
ognizer using spatial representation, convolutional neural networks, and hidden
Markov models. In: Advances in Neural Information Processing Systems, pp. 937–
944 (1994)
6. Black, P.E.: A software assurance reference dataset: thousands of programs with
known bugs. J. Res. Nat. Instit. Stand. Technol. 123, 1 (2018)
7. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Pro-
ceedings of COMPSTAT 2010, pp. 177–186. Springer (2010). https://doi.org/10.
1007/978-3-7908-2604-3 16
8. Bouchard, M., Jousselme, A.-L., Doré, P.-E.: A proof for the positive definiteness
of the Jaccard index matrix. Int. J. Approximate Reason. 54(5), 615–626 (2013)
9. Butucea, C., Ndaoud, M., Stepanova, N.A., Tsybakov, A.B., et al.: Variable selec-
tion with hamming loss. Ann. Stat. 46(5), 1837–1875 (2018)
10. Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability
detection: are we there yet. IEEE Trans. Softw. Eng. 1 (2021)
11. Chernis, B., Verma, R.: Machine learning methods for software vulnerability detec-
tion. In: Proceedings of the Fourth ACM International Workshop on Security and
Privacy Analytics, pp. 31–39 (2018)
12. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur-
rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
13. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur.
20(1), 37–46 (1960)
14. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional net-
works for natural language processing. arXiv preprint arXiv:1606.01781, 2 (2016)
15. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
16. Fang, Y., Han, S., Huang, C., Runpu, W.: TAP: a static analysis model for PHP
vulnerabilities based on token and deep learning technology. PLoS ONE 14(11),
e0225196 (2019)
17. Fang, Y., Liu, Y., Huang, C., Liu, L.: FastEmbed: predicting vulnerability exploita-
tion possibility based on ensemble machine learning algorithm. PLoS ONE 15(2),
e0228439 (2020)
18. Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely
sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)
19. Fukunaga, K., Narendra, P.M.: A branch and bound algorithm for computing k-
nearest neighbors. IEEE Trans. Comput. C-24(7), 750–753 (1975)
20. Guaman, D., Sarmiento, P.A., Barba-Guamán, L., Cabrera, P., Enciso, L.: Sonar-
qube as a tool to identify software metrics and technical debt in the source code
through static analysis. In: 7th International Workshop on Computer Science and
Engineering, WCSE, pp. 171–175 (2017)
21. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference
on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
22. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (1997)
23. Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression,
vol. 398. John Wiley & Sons (2013)
24. Huang, G., Li, Y., Wang, Q., Ren, J., Cheng, Y., Zhao, X.: Automatic classification
method for software vulnerability based on deep neural network. IEEE Access 7,
28291–28298 (2019)
25. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
164 S. Siddique et al.
26. Le, T.H.M., Chen, H., Babar, M.A.: Deep learning for source code modeling and
generation: models, applications, and challenges. ACM Comput. Surveys (CSUR)
53(3), 1–38 (2020)
27. LeCun, Y.: Deep learning & convolutional networks. In: 2015 IEEE Hot Chips 27
Symposium (HCS), pp. 1–95. IEEE Computer Society (2015)
28. Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detec-
tion. arXiv preprint arXiv:1801.01681 (2018)
29. Lin, G., Wen, S., Han, Q.-L., Zhang, J., Xiang, Y.: Software vulnerability detection
using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
30. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028,
cs.CL/0205028 (2002)
31. Manadhata, P.K., Wing, J.M.: An attack surface metric. IEEE Trans. Softw. Eng.
37(3), 371–386 (2010)
32. Pendleton, M., Garcia-Lebron, R., Cho, J.-H., Shouhuai, X.: A survey on systems
security metrics. ACM Comput. Surv. (CSUR) 49(4), 1–35 (2016)
33. Sharma, V.: An analytical survey of recent worm attacks. Int. J. Comput. Sci.
Network Secur. (IJCSNS) 11(11), 99–103 (2011)
34. Siddique, S., Ahmed, T., Talukder, M.R.A., Uddin, M.M.: English to Bangla
machine translation using recurrent neural network. Int. J. Future Comput. Com-
mun. 9(2) (2020)
35. Wu, F., Wang, J., Liu, J., Wang, W.: Vulnerability detection with deep learning.
In: 2017 3rd IEEE International Conference on Computer and Communications
(ICCC), pp. 1298–1302. IEEE (2017)
36. Xinogalos, S.: Studying students’ conceptual grasp of OOP concepts in two interac-
tive programming environments. In: Lytras, M.D., et al. (eds.) WSKS 2008. CCIS,
vol. 19, pp. 578–585. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-
540-87783-7 73
37. Zagane, M., Abdi, M.K., Alenezi, M.: Deep learning for software vulnerabilities
detection using code metrics. IEEE Access 8, 74562–74570 (2020)
A Survey of Reinforcement Learning
Toolkits for Gaming: Applications,
Challenges and Trends
Abstract. The gaming industry has become one of the most exciting
and creative industries. The annual revenue has crossed $200 billion in
recent years and has created a lot of jobs globally. Many games are using
Artificial Intelligence (AI) and techniques like Machine Learning (ML),
Reinforcement Learning (RL) gained popularity among researchers and
game development community to enable smart games involving AI-based
agents at a faster rate. Although, many toolkits are available for use,
a framework to evaluate, compare and advise on these toolkits is still
missing. In this paper, we present a comprehensive overview of ML/RL
toolkits for games with an emphasis on their applications, challenges,
and trends. We propose a qualitative evaluation methodology, discuss
the obtained analysis results, and conclude with future work and per-
spectives.
1 Introduction
Computer gaming has positioned itself as an important source of audiovisual
education and entertainment due to its dynamism and accessibility, in addition
to increasing the imagination of players. It is a fast growing market showing
a global revenue increase of 8.7% from 2019 to 2021 to reach $218.7 billion
in 2024 [38]. Many games have multiple non-player characters (NPCs) which
play with the player, against them or take a neutral position within the game.
They take an essential part in video games to increase the player experience and
should therefore be supplied with a fitting behavior by creating a fitting Artificial
Intelligence (AI) for them [57]. They can take multiple roles like providing a
challenge for the player to fight against or representing a trusted ally with whom
they fought many battles [62]. It is therefore important, that the field of game
design and development finds new ways to build their intelligence and let them
play their role inside the game [64].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 165–184, 2023.
https://doi.org/10.1007/978-3-031-18461-1_11
166 C. S. Jayaramireddy et al.
Tasks are usually described in terms of how ML should process a data item
(i.e. an example). If the desired behavior is to assign the input data item to one
category among several, this is a classification task, e.g. object recognition. Other
examples of tasks are machine translation, transcription, anomaly detection, etc.
[11,33].
where γ is a discount factor, and the expectation is over all the possible runs
(or traces) allowed by the policy π. R0 is the total reward for t = 0. Among
the most popular algorithms to reach an optimal policy in this context are value
iteration and Q-learning.
Value iteration assumes that the reward model and the transition model are
known a priori or passively learned [32]. The reinforcement learning problem is
then reduced to a Markovian Decision Process (MDP) that can be represented
using Bellman equations. Policy iteration assumes similar ideas but uses the fact
that convergence to the optimal policy happened long before the convergence of
the utility values.
Q-learning is called off-policy learning and actively learns a utility function
for (State, Action) pairs by alternating exploitation and exploration actions [51]:
By combining these ideas from reinforcement learning with the recently re-
invented neural networks a new set of algorithms emerges and is dubbed DRL.
One of the seed contributions in this area is value learning. In [36] a Con-
volutional Neural Network (CNN) was trained to play Atari with a variant of
Q-learning. The CNN approximates the utility function of Q-learning based on
raw pixels for input and an estimation of future reward as output. The loss
function for value learning is
where Qpredicted (s, a) is the output of the neural network and Qreal (s, a) is the
actual Q value:
Qreal (s, a) = r + γ max
Q(s , a )
a
Domain Knowledge
DRL Go Chess Shogi Atari Human Domain Known
Play Knowledge Rules
ALpha Go [52]
Alpha Go Zero [54]
Alpha Zero [53]
Mu Zero [47]
4.2 OpenAI
OpenAI is a research lab whose mission is to ensure that artificial general intel-
ligence benefits all of humanity [4]. OpenAI provides various tools to support
applications of RL and ML in scientific research and game design and develop-
ment.
OpenAI Safety Gym. Safety Gym is a suite of environments and tools for
RL agents with safety constraints implemented while training. While training
the RL agents, safety is not much focused, but in certain aspects, safety is an
important concern and is to be considered. To address the safety challenges while
training the RL agents and to accelerate the safe exploration research, OpenAI
introduced Safety Gym. It consists of two components:
– An environment builder for creating a new environment by choosing from a
wide range of physics elements, goals and safety requirements.
– A suite of pre-configured benchmarks environments to choose from.
Safety Gym uses the OpenAI Gym for instantiating and interfacing with
the RL environments and MuJoCo physics simulator to construct and forward-
simulate each environment [43].
OpenAI Gym Retro. OpenAI Gym Retro enables the conversion of classic
retro games into OpenAI Gym compatible environments and has integration for
around 1000 games. The emulators used in OpenAI Gym Retro support Libretro
API which allows to create games and support various emulators [39]. It is useful
primarily as a means to train RL on classic video games, though it can also be
used to control those video games using Python scripts.
5 Evaluation Methodology
In this study, we propose a qualitative evaluation methodology that uses a set of
eleven specific technical criteria (See the following subsections). Each candidate
ML/RL toolkit introduced in Sect. 4 is evaluated based on the following qualita-
tive data collection techniques: (1) Game design and development experts inter-
views; (2) Technical experimentation and observations; and (3) Documentation
including scientific publication and technical reports. Moreover, the outcomes of
the common ML, RL, DRL algorithms implemented in the identified toolkits is
summarized in Table 2.
A Survey of RLTs for Gaming 175
5.1 Portability
5.2 Interoperability
5.3 Performance
training independent models for each task, we allow a single model to learn
to complete all of the tasks at once. In this process, the model uses all of the
available data across the different tasks to learn generalized representations of
the data that are useful in multiple contexts .
5.6 Usability
Usability is a measure of how well a specific user in a specific context can use
a ML toolkit to design and develop games effectively, efficiently and satisfacto-
rily. Game designers usually measure a toolkit design’s usability throughout the
development process-from wireframes to the final deliverable-to ensure maximum
usability.
The learning strategies are the different techniques ML/RL toolkits and frame-
works use to train the agents in game design and development. These strategies
are translated through machine learning algorithms including:
– Naı̈ve Bayes Classifier Algorithm (Supervised Learning - Classification) based
on Bayes’ theorem and classifies every value as independent of any other value.
It allows to predict a class/category, based on a given set of features, using
probability.
– K Means Clustering Algorithm (Unsupervised Learning - Clustering) is a type
of unsupervised learning, which is used to categorise unlabelled data, i.e. data
without defined categories or groups. The algorithm works by finding groups
within the data, with the number of groups represented by the variable K. It
then works iteratively to assign each data point to one of K groups based on
the features provided.
– Support Vector Machine Algorithm (Supervised Learning - Classification)
analyses data used for classification and regression analysis. It essentially
filter data into categories, which is achieved by providing a set of training
examples, each set marked as belonging to one or the other of the two cate-
gories. This algorithm then works to build a model that assigns new values
to one category or the other.
– Linear Regression (Supervised Learning/Regression) is the most basic type of
regression. Simple linear regression allows us to understand the relationships
between two continuous variables.
– Logistic Regression (Supervised learning - Classification) focuses on estimat-
ing the probability of an event occurring based on the previous data provided.
It is used to cover a binary dependent variable, that is where only two values,
0 and 1, represent outcomes.
– Artificial Neural Networks (Reinforcement Learning) comprises ‘units’
arranged in a series of layers, each of which connects to layers on either side.
ANNs are essentially a large number of interconnected processing elements,
working in unison to solve specific problems.
– Random Forests (Supervised Learning - Classification/Regression) is an
ensemble learning method, combining multiple algorithms to generate better
results for classification, regression and other tasks. Each individual classifier
is weak, but when combined with others, can produce excellent results. The
algorithm starts with a ‘decision tree’ (a tree-like graph or model of decisions)
and an input is entered at the top. It then travels down the tree, with data
being segmented into smaller and smaller sets, based on specific variables.
178 C. S. Jayaramireddy et al.
6 Evaluation Analysis
Table 3 illustrates the outcomes of the proposed qualitative evaluation analysis
with respect to the technical criteria detailed in Sect. 5. It is important to note
that OpenAI is an open-source platform and Unity is a commercial platform.
Nevertheless, Unity offers its ML Agent as a open-source toolkit. With respect to
the proposed set of eleven technical criteria, it is obvious that Unity ML-Agents
toolkit provides full support of most of these criteria with some limitations with
regards to Multitask Learning and Learning strategies. On the other hand, Ope-
nAI including its various tools, Petting Zoo, and Google Dopamine suffer from
A Survey of RLTs for Gaming 179
a critical lack of Visual Observations support. Moreover, OpenAI and its tools
fail to fully support Multi-Agent Environments.
OpenAI Gym and Unity ML-Agents underlying software architectures are
very similar and both provide comparable functionalities to game developers. In
the scientific community, OpenAI has a larger popularity compared to Unity ML-
Agents as it is developed with the intent of developing, analysing and comparing
reinforcement learning algorithms where as Unity’s main purpose is to develop
and produce enterprise-level games. OpenAI Gym and Unity ML-Agents have
been used widely for implementation of RL algorithms in the recent years. Ope-
nAI Gym doesn’t restrict itself for gaming, and has been used in various streams
like telecommunications, optical networks and other engineering fields. Because
of its wide range of options, OpenAI Gym has been used widely than Unity ML-
Agents to perform research and establish ML/RL models related benchmarking
results.
The training of the game agents can be performed both in Gym and Unity.
However, Gym only supports reinforcement learning for training the agents,
whereas with ML-Agents, it is possible to train the games using reinforcement
learning, imitation learning, and curriculum learning. The comparison between
Unity ML-Agents PPO and OpenAI Baselines’ PPO2 has proved this latter has
scored 50% higher while training 14% slower. The Actor Critic using Kronecker-
Factored Trust Region (ACKTR) algorithm and the Advantage Actor Critic
(A2C) algorithm of the OpenAI Baselines trained 33% faster than Unity ML
Agent [12].
Unity has a rich visual platform which is most helpful in building the environ-
ments even with a little programming experience. It has components designed
for each asset and can be easily configured. On the other hand, OpenAI Gym
is compatible with Tensorflow and provides rich graphs. To train more robust
agents that interact at real-time to dynamic variations of the environment such as
changes to the objects’ attributes, Unity provides randomly sample parameters
of the environment during training (also called Environment Parameter Ran-
180 C. S. Jayaramireddy et al.
7 Discussion
Unity ML-Agents offers a rich visual interface to create environments and place
assets. Consequently, it offers more usability for game developers. Moreover,
the abundant technical and functional documentation, case studies, tutorials,
and technical support increase the popularity of this platform among the game
design and development community. The OpenAI Gym platform allows users to
compare the performance of their ML/RL algorithms. In fact, the aim of the
OpenAI Gym scoreboards is not to design and develop games, but rather to
foster the scientific community collaboration by sharing projects and enabling
meaningful ML/RL algorithm benchmark [16].
On one hand, Unity ML-Agents toolkit allows multiple cameras to be used for
observations per agent. This enables agents to learn to integrate information from
multiple visual streams. This technique leverages CNN to learn from the input
images. The image information from the visual observations that are provided
by the CameraSensor are transformed into a 3D Tensor which can be fed into
the CNN of the agent policy. This allows agents to learn from spatial regularities
and terrain topology in the observation images. In addition, it is possible to use
visual and vector observations with the same agent with Unity ML-Agents. This
powerful feature provides access to vector observations such as raycasting, real
time visualization, and parallelization. Such a feature is designed with the intent
of rapid AI agents implementation in video games, not for scientific research. This
hinders its application to more realistic, complex and real-world use cases and
serious games.
OpenAI Gym lacks the ability to configure the simulation for multiple agents.
In contrast, Unity ML-Agents supports dynamic multi-agent interaction where
agents can be trained using RL models through a straightforward Python API. It
also provides MA-POCA (MultiAgent POsthumous Credit Assignment), which
is a novel multi-agent trainer.
Petting-zoo provides a multi-agent policy gradient algorithm where agents
learn a centralized critic based on the observations and actions of all agents.
However, it suffers from a performance limitation when dealing with large-scale
multi-agent environments. In fact, the input space of Q grows linearly with the
number of agents N [30].
Finally, Unity ML-Agents does not support multitask learning. However,
it offers multiple interacting agents with independent reward signals sharing
common Behavior Parameters. This technique offers game developers the ability
to mimic multitask learning by implementing a single agent model and encode
multiple behaviors using HyperNetworks.
A Survey of RLTs for Gaming 181
References
1. AlphaZero: shedding new light on chess, shogi, and go. https://deepmind.com/
blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go
2. Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker. https://
ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-in-6-player-poker/
3. MIT 6.S191: Introduction to deep learning. https://introtodeeplearning.com/
4. OpenAI
5. OpenAI five defeats dota 2 world champions. https://openai.com/blog/openai-
five-defeats-dota-2-world-champions/
6. Unity machine learning agents
7. Arulkumaran, K., Cully, A., Togelius, J. : Alphastar: an evolutionary computa-
tion perspective. In: Proceedings of the Genetic and Evolutionary Computation
Conference Companion, pp. 314–315 (2019)
8. Baby, N., Goswami, B.: Implementing artificial intelligence agent within connect 4
using unity3D and machine learning concepts. Int. J. Recent Technol. Eng. 7(6S3),
193–200 (2019)
9. Barth-Maron G., et al.: Distributed distributional deterministic policy gradients.
arXiv preprint arXiv:1804.08617, 2018
10. Bellemare, M. G., Dabney, W., Munos, R.: A distributional perspective on rein-
forcement learning. In: International Conference on Machine Learning, pp. 449–
458. PMLR (2017)
182 C. S. Jayaramireddy et al.
11. Bertens, P., Guitart, A., Chen, P. P., Periáñez, Á.: A machine-learning item recom-
mendation system for video games. In: 2018 IEEE Conference on Computational
Intelligence and Games (CIG), pp. 1–4. IEEE (2018)
12. Booth J., Booth, J.: Marathon environments: multi-agent continuous control
benchmarks in a modern video game engine. arXiv preprint arXiv:1902.09097
(2019)
13. Bornemark, O.: Success factors for e-sport games. In: Umeå’s 16th Student Con-
ference in Computing Science, pp. 1–12 (2013)
14. Borovikov, I., Harder, J., Sadovsky, M., Beirami, A.: Towards interactive training
of non-player characters in video games. arXiv preprint arXiv:1906.00535 (2019)
15. Borowy, M., et al.: Pioneering eSport: the experience economy and the marketing
of early 1980s arcade gaming contests. Int. J. Commun. 7, 21 (2013)
16. Brockman, G., et al.:Openai gym. arXiv preprint arXiv:1606.01540 (2016)
17. Cao, Z., Lin, C. -T.: Reinforcement learning from hierarchical critics. IEEE Trans.
Neural Netw. Learn. Syst. (2021)
18. Castro, P. S., Moitra, S., Gelada, C., Kumar, S., Bellemare, M. G.: A Research
framework for deep reinforcement learning, dopamine (2018)
19. Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for dis-
tributional reinforcement learning. In: International conference on machine learn-
ing, pages 1096–1105. PMLR (2018)
20. Dhariwal, P., et al.: OpenAI Baselines, Szymon Sidor (2022)
21. Frank, A. B.: Gaming AI without AI. J. Defense Mod. Simul., p.
15485129221074352 (2022)
22. Moreno, S. E. G., Montalvo, J. A. C., Palma-Ruiz, J. M.: La industria cultural
y la industria de los videojuegos. JUEGOS Y SOCIEDAD: DESDE LA INTER-
ACCIÓN A LA INMERSIÓN PARA EL CAMBIO SOCIAL, pp. 19–26 (2019)
23. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maxi-
mum entropy deep reinforcement learning with a stochastic actor. In: International
conference on machine learning, pp. 1861–1870. PMLR (2018)
24. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learn-
ing. In: Thirty-second AAAI conference on artificial intelligence (2018)
25. Ho, J., Ermon, S.: Generative adversarial imitation learning. Adv. Neural Info.
Proc. Syst. 29 (2016)
26. Juliani, A., et al.: Unity: a general platform for intelligent agents. arXiv preprint
arXiv:1809.02627 (2018)
27. Lanham, M.: Learn Unity ML-Agents-Fundamentals of Unity Machine Learning:
Incorporate New Powerful ML Algorithms Such as Deep Reinforcement Learning
for Games. Packt Publishing Ltd., Birmingham (2018)
28. Li, R.: Good Luck Have Fun: The Rise of eSports. Simon and Schuster, New York
(2017)
29. Lillicrap, T. P.: Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971 (2015)
30. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-
agent actor-critic for mixed cooperative-competitive environments. arXiv preprint
arXiv:1706.02275 (2017)
31. Lyle, D., et al.: Chess and strategy in the age of artificial intelligence. In: Lai, D.
(eds) US-China Strategic Relations and Competitive Sports, pages 87–126. Pal-
grave Macmillan, Cham (2022). https://doi.org/10.1007/978-3-030-92200-9 5
32. Mekni, M.: An artificial intelligence based virtual assistant using conversational
agents. J. Softw. Eng. Appl. 14(9), 455–473 (2021)
A Survey of RLTs for Gaming 183
33. Mekni, M., Jayan, A.: Automated modular invertebrate research environment using
software embedded systems. In: Proceedings of the 2nd International Conference
on Software Engineering and Information Management, pp. 85–90 (2019)
34. Mitchell, T. M., et al.: Machine learning (1997)
35. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Inter-
national Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
36. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602 (2013)
37. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature
518(7540), 529–533 (2015)
38. Newzoo: Global games market report (2021)
39. Nichol, A., Pfau, V., Hesse, C., Klimov, O., Schulman J.: Gotta learn fast: a new
benchmark for generalization in RL. arXiv preprint arXiv:1804.03720 (2018)
40. Nowé, A., Vrancx, P., De Hauwere, Y. M.: Game theory and multi-agent reinforce-
ment learning. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning, pp.
441–470. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-27645-3 14
41. O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V.: Combining policy gradi-
ent and Q-learning. arXiv preprint arXiv:1611.01626 (2016)
42. Palma-Ruiz, J. M., Torres-Toukoumidis, A., González-Moreno, S. E., Valles-Baca,
H. G.: An overview of the gaming industry across nations: using analytics with
power bi to forecast and identify key influencers, p. e08959. Heliyon (2022)
43. Ray, A., Achiam, J., Amodei, D.: Benchmarking safe exploration in deep reinforce-
ment learning, p. 7. arXiv preprint arXiv:1910.01708 (2019)
44. Saiz-Alvarez, J.M., Palma-Ruiz, J.M., Valles-Baca, H.G., Fierro-Ramı́rez, L.A.:
Knowledge management in the esports industry: sustainability, continuity, and
achievement of competitive results. Sustainability 13(19), 10890 (2021)
45. Samara, F., Ondieki, S., Hossain, A. M., Mekni, M.: Online social network inter-
actions (OSNI): a novel online reputation management solution. In: 2021 Interna-
tional Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6.
IEEE (2021)
46. Scholz, T. M., Scholz, T. M., Barlow: eSports is Business. Springer (2019)
47. Schrittwieser, J., et al.: Mastering atari, go, chess and shogi by planning with a
learned model. Nature 588(7839), 604–609 (2020)
48. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.:. Trust region policy
optimization. In: International Conference on Machine Learning, pp. 1889–1897.
PMLR (2015)
49. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
50. Shabbir, J., Anwer, T.: Artificial intelligence and its role in near future (2018)
51. Shao, K., Tang, Z., Zhu, Y., Li, N., Zhao, D.: A survey of deep reinforcement
learning in video games. arXiv preprint arXiv:1912.10944 (2019)
52. Silver, D., et al.: Mastering the game of go with deep neural networks and tree
search. Nature 529(7587), 484–489 (2016)
53. Silver, D., et al.: A general reinforcement learning algorithm that masters chess,
shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
54. Silver, D., et al.: Mastering the game of go without human knowledge. Nature
550(7676), 354–359 (2017)
55. Silver, T., Chitnis, R.:. PDDLGym: Gym environments from PDDL problems.
arXiv preprint arXiv:2002.06432 (2020)
56. Sweetser, P., Wiles, J.: Current AI in games: a review. Australian J. Intell. Info.
Proc. Syst. 8(1), 24–42 (2002)
184 C. S. Jayaramireddy et al.
57. Tazouti, Y., Boulaknadel, S., Fakhri, Y.: Design and implementation of ImA-
LeG serious game: behavior of non-playable characters (NPC). In: Saeed, F., Al-
Hadhrami, T., Mohammed, E., Al-Sarem, M. (eds.) Advances on Smart and Soft
Computing. AISC, vol. 1399, pp. 69–77. Springer, Singapore (2022). https://doi.
org/10.1007/978-981-16-5559-3 7
58. Terry, J., et al. Pettingzoo: Gym for multi-agent reinforcement learning. Adv. Neu-
ral Inf. Proc. Syst. 34 (2021)
59. Tucker, A., Gleave, A., Russell, S.: Inverse reinforcement learning for video games.
arXiv preprint arXiv:1810.10593 (2018)
60. Wang, Z., et al.: Sample efficient actor-critic with experience replay. arXiv preprint
arXiv:1611.01224 (2016)
61. Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., Ba, J.: Scalable trust-region method
for deep reinforcement learning using Kronecker-factored approximation. Adv.
Neural Inf. Proc. Syst. 30 (2017)
62. Yannakakis, G. N.: Game AI revisited. In: Proceedings of the 9th Conference on
Computing Frontiers, pp. 285–292 (2012)
63. Yannakakis, G.N., Togelius, J.: A panorama of artificial and computational intel-
ligence in games. IEEE Trans. Comput. Intell. AI in Games 7(4), 317–335 (2014)
64. Yohanes, D.N., Rochmawati, N.: Implementasi algoritma collision detection dan
a*(a star) pada non player character game world of new normal. J. Inf. Comput.
Sci. (JINACS) 3(03), 322–333 (2022)
Pre-trained CNN Based SVM Classifier for Weld
Joint Type Recognition
1 Introduction
Since its inception, welding using robots has always been an essential part of Advanced
Manufacturing techniques [1]. Robot-assisted welding holds the key to making the oper-
ation more precise, cutting turnaround time, increasing worker safety, and lowering
resource waste [2]. Welders are currently required to monitor and supervise the opera-
tion while equipment executes welding. They must adapt themselves to the procedure
in specific ways based on the geometry of the parts to be connected. Four labels classify
practical methods of parts connection: Butt, Vee, Lap, or Tee joint [3]. Determining the
form of the weld joint is necessary before extracting weld joint features and directing
the robot to follow it without human involvement [4]. Furthermore, the algorithms used
to determine location data for different weld joints vary; also, the relevant parameters
guiding the process differ depending upon the type of weld joint [5]. Physically updat-
ing the criteria and system elements according to the weld joint just before the welding
session is very inefficient work. As a result, the system’s productivity and ease of use
will be significantly enhanced if it can recognize the weld joint autonomously before
beginning the welding.
Welders still oversee and monitor robotic welding, which is a more advanced variant
of mechanized welding in which machines perform the welding. In some cases, robots
find it difficult, if not impossible, to adapt to complex circumstances as precisely and fast
as humans. Automated robotic welding strives to solve this issue and is now an important
field of research in industrial automation [6]. In the area of Weld joint type identification,
a few methods have been suggested, such as SVM [7, 8], deep learning classifiers [9],
Ensemble learning [10], Local Thresholding [11] and Hausdorff distance [12]. However,
even though such techniques have exhibited satisfactory detection performance, weld
joint identification remains an ongoing research subject that requires the development
of novel strategies and processes for enhancing recognition performance, run-time, and
computing cost.
Few researchers have presented conforming single weld joint recognition algorithms
in recent years. Shah et al. [11] studied the detection and identification of butt weld joints
using image segmentation and local thresholding. Fan et al. [13] researched the recog-
nition of narrow butt weld joints using a structured light vision sensor. Zou et al. [14]
studied feature recognition of lap joints using the TLD object tracking algorithm and
CCOT algorithm. In later works, Fan et al. [15] considered image-based recognition of
fillet joints. The authors employed weld likelihood calculation using a custom convo-
lutional kernel. Due to the single type of weld joint scope, these methods were meant
to identify just one type. However, there may be several weld joint types in peculiar
welding situations. As a result, recognizing different weld connections is critical in a
real welding scenario.
Recently, some researchers investigated multiple kinds of weld joint identification.
Xiuping et al. [3] recognized 4 types of weld joints using features extracted from
Laser structured light vision and Probabilistic Neural Network (PNN). The authors
pre-processed images using reduction, Wavelet Transform (WT) and binarization. The
positional element connection of the extremities and the junctions of weld joints was
used for weld joint categorization [9]. This method works; however, it has difficulty dis-
criminating amongst different weld joints with similar elemental features. Wang et al.
[10] used laser stripes features and BP-Adaboost and KNN-Adaboost to recognize six
types of weld joints. The method suffers from the fact that it requires costly hardware and
is computationally and mathematically intensive. Li et al. [12] used Hausdorff distance
as a match for measuring the match of laser stripes from the standard template. This
technique, however, suffers from computation cost issues and is not flexible.
Work has been done in academia to extract handcrafted characteristics. However,
actively extracting characteristics is resource hungry and needs manual expertise. Fur-
thermore, the handcrafted feature extraction technique involves a compromise between
economy and precision since computing unimportant features may raise computing
Pre-trained CNN Based SVM Classifier for Weld Joint 187
1. A weld joint type recognition approach is proposed to enhance the robotic welding
automation level based on extracted features from the ResNet18 Pool5 layer.
2. An SVM-based weld joint type classification model is built to detect different types
of weld joints.
2 Proposed Methodology
2.1 Pre-Trained CNN
For the categorization of weld joint type datasets, a Pre-Trained CNN based SVM Classi-
fier is developed. The presented hybrid model incorporates the essential characteristics
of both; the CNN and the classifier. Figure 1 shows the approach considered in this
paper. A CNN comprises numerous fully linked layers and uses supervised learning.
CNN operates similarly to humans and can learn invariant local characteristics quite
effectively. Therefore, it can extract the most distinguishing data from raw weld joint
photos. Convolutional neural networks are deep learning network architectures that learn
188 S. Sonwane et al.
2.2 ReLU
The ReLU function is the most often utilized activation function these days. ReLU
is preferred over tanh and sigmoid because it accelerates stochastic gradient descent
190 S. Sonwane et al.
convergence compared to the other two functions [20]. Furthermore, unlike tanh and
sigmoid, which need substantial calculation, ReLU is done by simply setting matrix
values to zero. Figure 3 shows the activation of the ReLU function as an example.
2.3 ResNet18
Pre-trained CNN architecture used in this study is ResNet18 [21]. ResNet18 comprises
72 layers consisting of 18 deep layers. This network’s architecture was designed to oper-
ate a high number of convolutional layers efficiently. The main principle behind ResNet
is using jump links, also known as bypass connections [22]. These bypasses generally
work by jumping across several layers and establishing bypasses between them. Creat-
ing these bypass connections solved the common problem of fading gradients that deep
networks face. By repurposing the activations from the prior layer, these bypass connec-
tions eliminate the vanishing gradient problem. The link-skipping technique shrinks the
network, allowing it to learn quicker. Because of its complicated, layered design and the
fact that the layers receive input from several levels and output from numerous layers,
the network is classified as a Directed acyclic graph (DAG) network.
2.4 SVM
Vapnik [23] developed the binary classifier SVM. Its goal is to discover the best fitting
hyperplane. f(w, x) = w · x + b to distinguish two categories in a target dataset, with
features x ∈ Rm . The parameters are learned using SVM by solving (Eq. 1).
1 p
min z T z + P max(0, 1 − zi (wT xi + b)) (1)
p i=1
where,
zT z-Manhattan mean (L1 mean),
P- Penalty parameter,
z -actual label,
wT x + b-predictor function.
Pre-trained CNN Based SVM Classifier for Weld Joint 191
where,
z2 -Euclidean mean (L2 mean), with the hinge loss squared.
Each data item in the SVM is represented as a point in the n-dimensional plane (where
n represents the total of characteristics). SVM seeks to describe multi-dimensional
datasets in a space where data items are separated by hyperplane into distinct classes. On
unseen data, the SVM classifier can minimize the generalization error. SVM is effective
for binary classification but bad for noisy data. However, multiple SVM can be used
in a program in the one-versus-one form to achieve multi-classification. We have used
the Error-Correcting output codecs (ECOC) multi-class model classifier in this study
[24]. This paper employs SVM as a classifier and substitutes CNN’s pooling layer. CNN
serves as a feature extractor. The SVM classifier uses the attributes of the input weld
joint types acquired in the ‘pool5’ layer as input. The SVM classifier is trained using
these newly produced picture features. Finally, the learned SVM classifier is utilized to
recognize the weld joints.
All experiments in this study were conducted on a PC with Intel i5 12th gen CPU @ 2.50
GHz and 8 GB of DDR4 RAM. The following stages are included in the experimental
setup:
The dataset used in this study combines the Kaggle dataset1 [25] and a custom dataset
generated by the authors. Table 2 shows dataset distribution as per various weld joints.
The images from the Kaggle dataset and those from the dataset generated by the
authors are in.png format of the size 640 by 480 pixels. Figure 4 shows sample weld
joint images from the dataset.
1 https://www.kaggle.com/datasets/derikmunoz/weld-joint-segments.
192 S. Sonwane et al.
This approach begins with a pre-trained system and only changes the last layer parameters
from which we get predictions when performing feature extraction. Because we use the
pre-trained CNN as a static feature extractor and simply change the output layer, it’s
called feature extraction. There’s no need to train the entire network. Figure 5 depicts
the first 25 characteristics acquired by the final pooling layer (‘pool5’) for illustration
purposes.
The suggested approach derives attributes from input photos of weld joints. First, a
weld joint image is sent into the proposed hybrid model. The network accepts images
with a resolution of 224 by 224 by 3. The function computes the output probability for
every supplied specimen joint image. The input for the successive layers is formed by a
concatenation of the previous hidden layer’s results with trainable weights and a skew
factor [21].
Similarly, a characteristic map is generated at the pool5 layer. Finally, the character-
istics map is reduced to a single data block. For the recognition task, the SVM classifier
is supplied this vector with distinct attributes.
After completing the feature extraction steps, the SVM classifier is used to classify weld
joint pictures. SVM classifier training was carried out using feature vectors recorded
in matrix form. The joints have been tested using the results of training. In contrast,
the automatically produced features in the hybrid CNN-SVM model are supplied to the
SVM classifier for training and evaluating the weld joint dataset. An error-correcting
output codes (ECOC) model simplifies a three-class classification challenge to a col-
lection of binary classification problems [26]. The coding scheme is a matrix whose
elements decide which classes each binary learner learns, i.e., how the multi-class issue
is converted to a sequence of binary problems.
Pre-trained CNN Based SVM Classifier for Weld Joint 193
on the figure’s right show a True positive rate (TPR) and a False-negative rate (FNR).
Figure 8 shows the sample output of the program.
The pre-trained CNN based SVM classifier study is in its early stages and may be
developed further. In the future, the suggested model can be enhanced to recognize dif-
ferent weld joints like the corner, single V, U, and Double U. Furthermore, to improve
classification performance, specific optimization strategies may be examined. Finally,
the method can be cross-validated using different materials under different lighting con-
ditions. For further extension of the study, the model’s performance under the influence
of splash noise will be tested and improved. In addition, the findings could be enhanced
if the image preparation method used on the datasets and the fundamental CN Network
was more complex than what is used in this research.
References
1. Reisgen, U., Mann, S., Middeldorf, K., Sharma, R., Buchholz, G., Willms, K.: Connected,
digitalized welding production—industrie 4.0 in gas metal arc welding. Welding in the World
63(4), 1121–1131 (2019). https://doi.org/10.1007/s40194-019-00723-2
2. Mahadevan, R., Jagan, A., Pavithran, L., Shrivastava, A., Selvaraj, S.K.: Intelligent welding
by using machine learning techniques. Mater. Today Proc. 46, 74027410 (2021). https://doi.
org/10.1016/j.matpr.2020.12.1149
3. Xiuping, W., Fan, X., Ying, F.: Recognition of the Type of Welding Joint Based on Line
Structured Light Vision, pp. 4403–4406 (2015)
4. Chen, X., Chen, S., Lin, T., Lei, Y.: Practical method to locate the initial weld position using
visual technology. Int. J. Adv. Manuf. Technol. 30(7–8), 663–668 (2006). https://doi.org/10.
1007/s00170-005-0104-z
196 S. Sonwane et al.
5. Hong, T.S., Ghobakhloo, M., Khaksar, W.: Robotic Welding Technology 6. Elsevier (2014).
https://doi.org/10.1016/B978-0-08-096532-1.00604-X
6. Zhang, Y.M., Feng, Z., Chen, S.: Trends in intelligentizing robotic welding processes. J.
Manuf. Process. 63, 1 (2021). https://doi.org/10.1016/j.jmapro.2020.11.012
7. Zeng, J., Cao, G.Z., Peng, Y.P., Huang, S.D.: A weld joint type identification method for
visual sensor based on image features and SVM. Sensors (Switzerland) 20(2), 471 (2020).
https://doi.org/10.3390/s20020471
8. Fan, J., Jing, F., Fang, Z., Tan, M.: Automatic recognition system of welding seam type based
on SVM method. Int. J. Advanced Manufacturing Technol. 92(1–4), 989–999 (2017). https://
doi.org/10.1007/s00170-017-0202-8
9. Tian, Y., et al.: Automatic identification of multi-type weld seam based on vision sensor with
silhouette-mapping. IEEE Sens. J. 21(4), 5402–5412 (2021). https://doi.org/10.1109/JSEN.
2020.3034382
10. Wang, Z., Jing, F., Fan, J.: Weld seam type recognition system based on structured light vision
and ensemble learning. In: Proceedings 2018 IEEE International Conference Mechatronics
Autom. ICMA 2018, no. 61573358, pp. 866–871 (2018). https://doi.org/10.1109/ICMA.2018.
8484570
11. Shah, H.N.M., Sulaiman, M., Shukor, A.Z., Kamis, Z., Rahman, A.A.: “Butt welding joints
recognition and location identification by using local thresholding,” robot. Comput. Integr.
Manuf. 51, 181–188 (2018). https://doi.org/10.1016/j.rcim.2017.12.007
12. Li, Y., Xu, D., Tan, M.: Welding joints recognition based on Hausdorff distance. Gaojishu
Tongxin/Chinese High Technol. Lett. 16(11), 1129–1133 (2006)
13. Fan, J., Jing, F., Yang, L., Long, T., Tan, M.: A precise seam tracking method for narrow
butt seams based on structured light vision sensor. Opt. Laser Technol. 109, 616–626 (2019).
https://doi.org/10.1016/j.optlastec.2018.08.047
14. Zou, Y., Chen, T.: Laser vision seam tracking system based on image processing and contin-
uous convolution operator tracker. Opt. Lasers Eng. 105(January), 141–149 (2018). https://
doi.org/10.1016/j.optlaseng.2018.01.008
15. Chen, S., Liu, J., Chen, B., Suo, X.: Universal fillet weld joint recognition and positioning
for robot welding using structured light. Robot. Comput. Integr. Manuf., 74, 102279 (2021).
https://doi.org/10.1016/j.rcim.2021.102279
16. Tang, Y.: Deep Learning using Linear Support Vector Machines (2013). https://doi.org/10.
48550/ARXIV.1306.0239
17. Agarap, A.F.: An Architecture Combining Convolutional Neural Network (CNN) and Support
Vector Machine (SVM) for Image Classification, pp. 5–8 (2017). http://arxiv.org/abs/1712.
03541
18. Jiang, S., Hartley, R., Fernando, B.: Kernel support vector machines and convolutional neural
networks. 2018 Int. Conf. Digit. Image Comput. Tech. Appl. DICTA, pp. 1–7 (2019). https://
doi.org/10.1109/DICTA.2018.8615840
19. Ahlawat, S., Choudhary, A.: Hybrid CNN-SVM classifier for handwritten digit recognition.
Procedia Comput. Sci. 167(2019), 2554–2560 (2020). https://doi.org/10.1016/j.procs.2020.
03.309
20. Kaggle: Rectified Linear Units (ReLU) in Deep Learning. https://www.kaggle.com/dansbe
cker/rectified-linear-units-relu-in-deep-learning
21. Hantos, N., Iván, S., Balázs, P., Palágyi, K.: Binary image reconstruction from a small number
of projections and the morphological skeleton. Ann. Math. Artif. Intell. 75(1–2), 195–216
(2014). https://doi.org/10.1007/s10472-014-9440-8
22. MathWorks: ResNet-18 convolutional neural network - MATLAB resnet18 - MathWorks
India. https://in.mathworks.com/help/deeplearning/ref/resnet18.html
23. Cortes, C., Vapnik, V.: Support-vector network. IEEE Expert. Syst. their Appl. 7(5), 63–72
(1992). https://doi.org/10.1109/64.163674
Pre-trained CNN Based SVM Classifier for Weld Joint 197
1 Introduction
Caused by severe acute respiratory syndrome coronavirus-2, coronavirus disease
2019 (COVID-19) has become an ongoing pandemic after it was found at the
N. F. Li—Independent Scholar.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 198–216, 2023.
https://doi.org/10.1007/978-3-031-18461-1_13
A Two-Stage Federated Transfer Learning Framework 199
end of the year 2019, due to the fast spread and infection rate of the COVID-
19 epidemic, the World Health Organization (WHO) designated it a pandemic
[51]. It will be critical to find tools, processes, and resources to rapidly identify
individuals infected.
According to several previous researches [9,20,28,53,65,67], computed
tomography (CT) offers a high diagnostic and prognostic value for COVID-19,
CT scans of individuals with COVID-19 often revealed bilateral lung lesions com-
prised of ground-glass opacity [2] and in some cases, abnormalities and changes
were observed [41]. Since CT scans are a popular diagnostic technique that is
simple and quick to get without incurring the significant expense, incorporating
CT imaging into the development of a sensitive diagnostic tool may expedite
the diagnosis process while also serving as a complement to RT-PCR [2,21,72].
However, utilizing CT imaging to forecast a patient’s individualized prognosis
may identify prospective high-risk individuals who are more likely to develop
seriously and need immediate medical attention. Researchers have realized that
developing effective methods to assist diagnosis has become critical to their suc-
cess.
As a key machine learning method, deep learning has evolved in recent years
and has achieved astonishing success in the field of medical image processing
[25,44,61]. Because of the superior capability of convolutional neural networks
(CNNs) in medical image classification, researchers have begun to concentrate
their attention on the application of CNNs in order to address an increasing num-
ber of medical image processing issues using deep learning, and some previous
researches have demonstrated the great capability of CNNs being implemented
in computed assisted diagnosis [5,16,49,57,63,64,76]. Some previous researches
have also achieved exciting results in COVID-19 classification [11,18,22,23,30–
32,35,47,55,59], however, since medical records of patients have always been
deemed as privacy and protected by laws such as GDPR in the European Union
(EU) and HIPPA in the US, collecting data needed for building high-quality clas-
sifiers such as CT scans becomes extremely difficult. In some other researches
[10,15,19,38,48,56,73,77], the authors built their Covid detection or classifica-
tion techniques utilizing federated learning. Federated learning is a decentral-
ized computation approach for training a neural network [6,24,46,68], which
is able to address privacy concerns in training neural networks. In federated
learning, rather than gathering data and keeping data in one place for central-
ized training, participating clients can process their own data and communicate
model updates to the server, where the server collects and combines weights
from clients to create a global model [6,24,46,68]. Although federated learning
can be used to handle privacy concerns, since data are distributed and located
in clients’ devices, the data size that can be accessed by each client may be lim-
ited, which may compromise the overall model performance. Transfer learning
is designed to address the issue caused by limited data, which transfers knowl-
edge learned from source task to target task [52,66,70], with transfer learning,
previous researches have also achieved decent Covid classification or detection
performance [3,4,14,17,29,33,34,42,50,54,69,75]. Even though all of the papers
200 A. S. Zhang and N. F. Li
mentioned above have proposed different methods and frameworks for Covid
classification, there is still a lack of a framework that leverages both privacy
and accuracy by integrating two-stage transfer learning introduced in [76] and
federated learning [6,24,46,68]. Therefore, in this paper, we made a further trial
to leverage both accuracy and privacy in classifying COVID-19 CT images by
combining two-stage transfer learning [76] and federated learning techniques
[6,24,46,68].
In this paper, the datasets used are obtained from the COVID-19 Radiog-
raphy Database [13,58] and Chest X-Ray Images (Pneumonia) [36]. From these
two public databases, We obtained 10192 healthy CT scans, 4273 CT scans of
bacterial or viral pneumonia that are not caused by covid, and 3616 covid CT
scans, as are shown in Table 1, Fig. 1 and Fig. 2.
2 Our Contributions
We proposed a novel two-step federated transfer learning approach on classifying
lung CT scans, with the first step classifying Pneumonia from Healthy and the
A Two-Stage Federated Transfer Learning Framework 201
3 Paper Organization
The rest of this paper is organized as follows. In Sect. 4, we talk about the
deep learning, transfer learning, and federated learning methodologies, as well as
present the algorithm used in this paper. In Sect. 5, we present our experimental
results and discussion. We then made conclusion in Sect. 6, and talk about future
directions in Sect. 7.
CNNs; in CNNs, each neuron takes input and performs weights calculation, and
passes calculation results to other neurons through activation function [43,60].
To construct a decent classifier, CNNs are trained on previously collected data
[8,37,39,71], although a large amount of high-quality data is an essential factor
in achieving better model training and testing performance, due to collecting
and labeling data being always resource-consuming, the entire model training
process could become less efficient until transfer learning [52,70] can handle this
issue. According to [52,70,76], transfer learning attempts to learn knowledge
source tasks and apply it to a target task, in contrast to the conventional model
training process, which attempts to learn each new task from scratch [52,70,76].
In this paper, we will utilize deep learning and transfer learning techniques to
assist us in the classification task.
input convolutional
layer layers fully connected
layers
output
layer
are sent to a set of clients at the beginning of each training round, clients start
training local models based on the weights received with their local accessible
data. In the particular case of t = 0, all clients start from the same weights
obtained from the server, which has either been randomly initialized or pre-
trained on other data, depending on the configuration.
The two-stage transfer learning method was first proposed by the authors in [76],
which achieved very high performance in classifying lung nodules. We further
proposed our two-stage federated transfer learning framework, which highly ref-
erences and is based on the algorithms proposed by the authors in [46,68], as is
shown in the following Algorithm 1. In the first stage, CT scans are classified into
Healthy and Pneumonia, while in the second stage, we further classify Pneumo-
nia into Covid Pneumonia and Non-Covid Pneumonia. At the beginning of the
framework, We first conducted stage one model training in a federated format,
and weights are saved as a loadable file for transfer learning use in stage two.
As is shown in Algorithm 1, training round te is the number of global feder-
ated training rounds given by the user, and federated averaging τe is the number
of local training epochs of each client before sending local weights to the Glob-
alServer, for the federated averaged weights calculation in each training round.
w(t) is the federated averaged weights obtained from calculation at the end of
training round t. Before the first training round, at t = 0, we initialize w(0)
to vector containing random values for stage one, and for stage two, we initial-
ize w(0) to the pre-trained weights obtained from stage one. At the beginning
of each training round at GlobalServer, then federated averaged weights from
previous training round w(t − 1) is sent to all clients, each client i start local
training on local data Di based on the received weights in Procedure TrainClient.
At training round t, after finishing local training for τe epochs, each client sends
their weights wτi e (t) to GlobalServer for federated averaged weights calculation.
As is discussed in [46,68], we also take the size of each client’s local dataset into
consideration and performed a weighted average for the calculation of federated
averaged weights w(t). If the currently running task is stage one, after all train-
ing rounds end, w(te ) is saved as a loadable file to be used in stage two. Please
note that for this two-stage federated transfer learning approach, the stage one
must be run prior to the stage two in order to generate pre-trained weights for
transfer learning in stage two.
As the proposed federated transfer learning framework contains two stages, two
different datasets with overlapped data need to be prepared. To create the
dataset for stage one, we combined the aforementioned 3616 Covid CT scans
204 A. S. Zhang and N. F. Li
and 4273 CT scans of non-covid bacterial or viral pneumonia into a new cate-
gory named Pneumonia, which contains 7889 CT scans in total, and the other
category is Healthy consists of 10192 CT scans. As for stage two, the 3616 covid
CT scans are in the category named Covid Pneumonia and the other category
Non-Covid Pneumonia contains 4273 CT scans of pneumonia that are not caused
by Covid, as is shown in Table 2, all CT scans are pre-processed into grayscale
and resized to 28 by 28 pixels when creating datasets, in order to be utilized by
LeNet model [40].
A Two-Stage Federated Transfer Learning Framework 205
Table 3. Dataset used in stage one: classifying Healthy and Pneumonia and stage two:
classifying Covid Pneumonia and Non-Covid Pneumonia
Dataset size of training set: 80% Dataset size of testing set: 20%
Stage one 14464 Healthy 8108 3617 Healthy 2084
Pneumonia 6356 Pneumonia 1533
Stage two 6311 Non-covid pneumonia 3408 1578 Non-Covid Pneumonia 865
Covid pneumonia 2903 Covid pneumonia 713
In this paper, we utilize LeNet as the model to first classify CT scans into Healthy
and Pneumonia in stage one, then classify Pneumonia into Covid Pneumonia
and Non-Covid Pneumonia in stage two. LeNet is one of the most classic CNN
architecture developed by Yann LeCun [40], which was used to classify data
from the MNIST dataset [40]. The LeNet architecture we used contains two
convolutional layers, two max-pooling layers, and two fully connected layers,
with softmax [7] being used in the output layer.
After training models in the centralized setting, we then start the training
model using the proposed federated transfer learning framework. In federated
learning, weights of all clients are sent to GlobalServer for federated averaging
[46] in each training round after being trained at the client side for certain
local training epochs, we then take the effect of federated averaging interval
into consideration. As is shown in Algorithm 1, in our proposed framework, the
federated averaging interval is controlling the number of epochs a local model
is trained at the client-side; in our experiments, we create five clients, and we
trained our models with the federated averaging interval being set from 1 to 10,
in order to explore how it relates to the performance. Data distribution at each
client may also become a key factor for overall performance; therefore, in our
experiments, we explore the influences of data distribution by training model in
two scenarios as is shown in Table 4: (1) distributing data in training set to each
client evenly, with 20% of data for each client, which is marked as balanced and
(2) distributing data to each client unevenly, with the five clients having access
to 30%, 25%, 20%, 15%, 10% of data respectively, which is marked as unbalanced.
Please note that in the federated model training process, the number of training
rounds is set to 20 in stage one and 10 in stage two, the learning rate is set
to 0.001, the batch size is set to 32 in both stages, which corresponds to the
parameters in the aforementioned centralized training.
Balanced Unbalanced
Client 1 20% Client 1 30%
Client 2 20% Client 2 25%
Client 3 20% Client 3 20%
Client 4 20% Client 4 15%
Client 5 20% Client 5 10%
To evaluate the performance, we tested our models on the testing set. Due to
data imbalance of different categories, traditional accuracy may be biased based
on the size of the dataset, and we then decide to utilize the ROC curve and AUC
value for a more robust model performance evaluation. ROC curves of models
trained with balanced data distribution are shown in Fig. 4, Fig. 5, Fig. 6 and
Fig. 7, and ROC curves of models trained with unbalanced data distribution are
shown in Fig. 8, Fig. 9, Fig. 10 and Fig. 11. All AUC values are recorded, and we
have also calculated precision, sensitivity, as well as specificity. When calculating
AUC, precision, and sensitivity, we consider Pneumonia as positive and Healthy
as negative in stage one, while in stage two, Covid Pneumonia is considered
as positive and Non-Covid Pneumonia is considered as negative. Precision is
calculated using the following Eq. 1,
A Two-Stage Federated Transfer Learning Framework 207
T rueP ositive
P recision = (1)
T rueP ositive + F alseP ositive
while sensitivity is calculated as is shown in Eq. 2,
T rueP ositive
Sensitivity = (2)
T rueP ositive + F alseN egative
and the following Eq. 3 calculates specificity.
T rueN egative
Specif icity = (3)
T rueN egative + F alseP ositive
The recorded confusion matrix values are shown in Table 5, and AUC, preci-
sion, sensitivity and specificity are shown in Table 6. Please note that rounding
has been applied to values in Table 6 in order to keep four decimals, resulting in
identical values shown in Table 6, which may not be equal to each other before
rounding.
Fig. 5. Upper left zoom-in of ROC curves of stage one with balanced data distribution
208 A. S. Zhang and N. F. Li
Fig. 7. Upper left zoom-in of ROC curves of stage two with balanced data distribution
5.4 Discussion
The results of experiments show that our proposed two-stage federated transfer
learning framework has achieved excellent accuracy in both stages. By compar-
ing balanced and unbalanced data distribution at the client-side, we can see
that dataset distribution at the client-side may not affect the overall model per-
formance in the current two-stage classification task. Additionally, we observed
that the models achieved very high classification performance in stage two even
with the federated averaging interval set to 1. However, the results of stage one
classification showed that the increase of federated averaging interval might help
Fig. 9. Upper left zoom-in of ROC curves of stage one with unbalanced data distribu-
tion
Fig. 10. ROC curves of stage two with unbalanced data distribution
Fig. 11. Upper left zoom-in of ROC curves of stage two with unbalanced data distri-
bution
the model achieve better performance, which could be observed from the sensi-
tivity values. However, the performance may not always be positively correlated
with the federated averaging interval, as too many local training epochs could
result in overfitting.
210 A. S. Zhang and N. F. Li
Table 6. AUC, pre. (precision), sen. (sensitivity) and spe. (specificity) values of all
models
Stage Training setting Data dist. Fed. averaging interval AUC Pre. Sen. Spe.
Stage one Centralized N/A N/A 0.9801 0.8995 0.9282 0.9237
Stage two Centralized N/A N/A 0.9992 0.9901 0.9818 0.9919
Stage one Federated Balanced 1 0.9626 0.9130 0.8219 0.9424
2 0.9761 0.9434 0.8584 0.9621
3 0.9790 0.9300 0.9008 0.9501
4 0.9812 0.9451 0.8865 0.9621
5 0.9829 0.9490 0.8982 0.9645
6 0.9807 0.9199 0.9367 0.9400
7 0.9836 0.9466 0.9022 0.9626
8 0.9818 0.9384 0.9145 0.9559
9 0.9793 0.9290 0.9054 0.9491
10 0.9820 0.9289 0.9211 0.9482
Stage two Federated Balanced 1 0.9986 0.9790 0.9790 0.9827
2 0.9987 0.9887 0.9832 0.9908
3 0.9980 0.9887 0.9804 0.9908
4 0.9972 0.9915 0.9762 0.9931
5 0.9977 0.9943 0.9776 0.9954
6 0.9981 0.9832 0.9860 0.9861
7 0.9987 0.9887 0.9790 0.9908
8 0.9987 0.9901 0.9818 0.9919
9 0.9992 0.9845 0.9818 0.9873
10 0.9986 0.9943 0.9818 0.9954
Stage one Federated Unbalanced 1 0.9655 0.9199 0.8395 0.9463
2 0.9779 0.9421 0.8708 0.9607
3 0.9816 0.9368 0.9191 0.9544
4 0.9808 0.9295 0.9113 0.9491
5 0.9815 0.9362 0.9087 0.9544
6 0.9796 0.9264 0.9035 0.9472
7 0.9821 0.9309 0.9132 0.9501
8 0.9847 0.9326 0.9295 0.9506
9 0.9808 0.9290 0.9217 0.9482
10 0.9825 0.9354 0.9256 0.9530
Stage two Federated Unbalanced 1 0.9988 0.9885 0.9663 0.9908
2 0.9981 0.9887 0.9776 0.9908
3 0.9978 0.9944 0.9874 0.9954
4 0.9983 0.9915 0.9804 0.9931
5 0.9974 0.9901 0.9790 0.9919
6 0.9972 0.9943 0.9776 0.9954
7 0.9981 0.9914 0.9748 0.9931
8 0.9994 0.9929 0.9776 0.9942
9 0.9979 0.9915 0.9846 0.9931
10 0.9979 0.9901 0.9846 0.9919
212 A. S. Zhang and N. F. Li
6 Conclusion
In this paper, we proposed the two-stage federated transfer learning framework
to address privacy concerns while achieving high accuracy. We also explored the
relationship between the performance and the number of epochs local models are
trained. The results of our experiments showed that the performance in terms
of accuracy of the proposed framework is surprisingly good compared to the
centralized learning.
7 Future Direction
In our current work, due to hardware limitations, the simulation experiments of
our proposed framework were only run on the LeNet model. Future endeavors
may be focusing on running the proposed framework on other much more com-
plicated CNNs, such as AlexNet [37], VGG [62], and ResNet [27]. In the future,
we may further explore the time or other resources consumed when increasing
the number of local training epochs at the client-side and focus on achieving
high accuracy in a resource-constrained environment.
References
1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous sys-
tems (2015). http://tensorflow.org/
2. Ai, T., et al.: Correlation of chest CT and RT-PCR testing for coronavirus disease
2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2), E32–E40
(2020)
3. Altaf, F., Islam, S., Janjua, N.K.: A novel augmented deep transfer learning for
classification of COVID-19 and other thoracic diseases from X-rays. Neural Com-
put. Appl. 33(20), 14037–14048 (2021)
4. Aslan, M.F., Unlersen, M.F., Sabanci, K., Durdu, A.: CNN-based transfer learning-
BiLSTM network: a novel approach for COVID-19 infection detection. Appl. Soft
Comput. 98, 106912 (2021)
5. Barbu, A., Lu, L., Roth, H., Seff, A., Summers, R.M.: An analysis of robust
cost functions for CNN in computer-aided diagnosis. Comput. Methods Biomech.
Biomed. Eng. Imaging Vis. 6(3), 253–258 (2018)
6. Bonawitz, K., et al.: Towards federated learning at scale: system design. Proc.
Mach. Learn. Syst. 1, 374–388 (2019)
7. Bridle, J.: Training stochastic model recognition algorithms as networks can lead
to maximum mutual information estimation of parameters. In: Advances in Neural
Information Processing Systems, vol. 2 (1989)
8. Çalik, R.C., Demirci, M.F.: Cifar-10 image classification with convolutional neural
networks for embedded systems. In: 2018 IEEE/ACS 15th International Conference
on Computer Systems and Applications (AICCSA), pp. 1–2. IEEE (2018)
9. Carotti, M., et al.: Chest CT features of coronavirus disease 2019 (COVID-19)
pneumonia: key points for radiologists. Radiol. Med. 125(7), 636–646 (2020)
10. Cetinkaya, A.E., Akin, M., Sagiroglu, S.: A communication efficient federated learn-
ing approach to multi chest diseases classification. In: 2021 6th International Con-
ference on Computer Science and Engineering (UBMK), pp. 429–434. IEEE (2021)
A Two-Stage Federated Transfer Learning Framework 213
11. Chen, J.I.-Z.: Design of accurate classification of COVID-19 disease in X-ray images
using deep learning approach. J. ISMAC 3(02), 132–148 (2021)
12. Chollet, F., et al.: Keras (2015). https://keras.io
13. Chowdhury, M.E.H., et al.: Can AI help in screening viral and COVID-19 pneu-
monia? IEEE Access 8, 132665–132676 (2020)
14. Das, N.N., Kumar, N., Kaur, M., Kumar, V., Singh, D.: Automated deep trans-
fer learning-based approach for detection of COVID-19 infection in chest X-rays.
IRBM (2020)
15. Dayan, I., et al.: Federated learning for predicting clinical outcomes in patients
with COVID-19. Nat. Med. 27(10), 1735–1743 (2021)
16. Duran-Lopez, L., Dominguez-Morales, J.P., Conde-Martin, A.F., Vicente-Diaz, S.,
Linares-Barranco, A.: PROMETEO: a CNN-based computer-aided diagnosis sys-
tem for WSI prostate cancer detection. IEEE Access 8, 128613–128628 (2020)
17. El Gannour, O., Hamida, S., Cherradi, B., Raihani, A., Moujahid, H.: Performance
evaluation of transfer learning technique for automatic detection of patients with
COVID-19 on X-ray images. In: 2020 IEEE 2nd International Conference on Elec-
tronics, Control, Optimization and Computer Science (ICECOCS), pp. 1–6. IEEE
(2020)
18. Elzeki, O.M., Shams, M., Sarhan, S., Abd Elfattah, M., Hassanien, A.E.: COVID-
19: a new deep learning computer-aided model for classification. PeerJ Comput.
Sci. 7, e358 (2021)
19. Feki, I., Ammar, S., Kessentini, Y., Muhammad, K.: Federated learning for
COVID-19 screening from chest X-ray images. Appl. Soft Comput. 106, 107330
(2021)
20. Francone, M., et al.: Chest CT score in COVID-19 patients: correlation with disease
severity and short-term prognosis. Eur. Radiol. 30(12), 6808–6817 (2020)
21. Gietema, H.A., et al.: CT in relation to RT-PCR in diagnosing COVID-19 in the
Netherlands: a prospective study. PLoS ONE 15(7), e0235844 (2020)
22. Gilanie, G., et al.: Coronavirus (COVID-19) detection from chest radiology images
using convolutional neural networks. Biomed. Signal Process. Control 66, 102490
(2021)
23. Gupta, A., Gupta, S., Katarya, R., et al.: InstaCovNet-19: a deep learning classifi-
cation model for the detection of COVID-19 patients using chest X-ray. Appl. Soft
Comput. 99, 106859 (2021)
24. Hard, A., et al.: Federated learning for mobile keyboard prediction. arXiv preprint
arXiv:1811.03604 (2018)
25. Haskins, G., Kruger, U., Yan, P.: Deep learning in medical image registration: a
survey. Mach. Vis. Appl. 31(1), 1–18 (2020). https://doi.org/10.1007/s00138-020-
01060-x
26. He, J., Yang, H., He, L., Zhao, L.: Neural networks based on vectorized neurons.
Neurocomputing 465, 63–70 (2021)
27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
28. Herpe, G., et al.: Efficacy of chest CT for COVID-19 pneumonia in France. Radi-
ology 298(2), E81–E87 (2020)
29. Horry, M.J., et al.: COVID-19 detection through transfer learning using multimodal
imaging data. IEEE Access 8, 149808–149824 (2020)
30. Hussain, E., Hasan, M., Rahman, M.A., Lee, I., Tamanna, T., Parvez, M.Z.:
CoroDet: a deep learning based classification for COVID-19 detection using chest
X-ray images. Chaos Solit. Fractals 142, 110495 (2021)
214 A. S. Zhang and N. F. Li
31. Ibrahim, A.U., Ozsoz, M., Serte, S., Al-Turjman, F., Yakoi, P.S.: Pneumonia clas-
sification using deep learning from chest X-ray images during COVID-19. Cogn.
Comput. 1–13 (2021)
32. Ibrahim, D.M., Elshennawy, N.M., Sarhan, A.M.: Deep-chest: multi-classification
deep learning model for diagnosing COVID-19, pneumonia, and lung cancer chest
diseases. Comput. Biol. Med. 132, 104348 (2021)
33. Jaiswal, A., Gianchandani, N., Singh, D., Kumar, V., Kaur, M.: Classification of
the COVID-19 infected patients using DenseNet201 based deep transfer learning.
J. Biomol. Struct. Dyn. 39(15), 5682–5689 (2021)
34. Katsamenis, I., Protopapadakis, E., Voulodimos, A., Doulamis, A., Doulamis, N.:
Transfer learning for COVID-19 pneumonia detection and classification in chest X-
ray images. In: 24th Pan-Hellenic Conference on Informatics, pp. 170–174 (2020)
35. Keidar, D., et al.: COVID-19 classification of X-ray images using deep neural net-
works. Eur. Radiol. 31(12), 9654–9663 (2021)
36. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases by
image-based deep learning. Cell 172(5), 1122–1131 (2018)
37. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, vol. 25 (2012)
38. Kumar, R., et al.: Blockchain-federated-learning and deep learning models for
COVID-19 detection using CT imaging. IEEE Sens. J. 21(14), 16301–16314 (2021)
39. Kussul, E., Baidyk, T.: Improved method of handwritten digit recognition tested
on MNIST database. Image Vis. Comput. 22(12), 971–981 (2004)
40. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
41. Lee, E.Y.P., Ng, M.-Y., Khong, P.-L.: COVID-19 pneumonia: what has CT taught
us? Lancet Infect. Dis. 20(4), 384–385 (2020)
42. Li, C., Yang, Y., Liang, H., Boying, W.: Transfer learning for establishment
of recognition of COVID-19 on CT imaging using small-sized training datasets.
Knowl.-Based Syst. 218, 106849 (2021)
43. Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J.: A survey of convolutional neural
networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn.
Syst. (2021)
44. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image
Anal. 42, 60–88 (2017)
45. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous
activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
46. McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-
efficient learning of deep networks from decentralized data. In: Artificial Intelli-
gence and Statistics, pp. 1273–1282. PMLR (2017)
47. Narin, A., Kaya, C., Pamuk, Z.: Automatic detection of coronavirus disease
(COVID-19) using X-ray images and deep convolutional neural networks. Pattern
Anal. Appl. 24(3), 1207–1220 (2021)
48. Nguyen, D.C., Ding, M., Pathirana, P.N., Seneviratne, A., Zomaya, A.Y.: Feder-
ated learning for COVID-19 detection with generative adversarial networks in edge
cloud computing. IEEE Internet Things J. (2021)
49. Okamoto, T., et al.: Feature extraction of colorectal endoscopic images for
computer-aided diagnosis with CNN. In: 2019 2nd International Symposium on
Devices, Circuits and Systems (ISDCS), pp. 1–4. IEEE (2019)
50. Oluwasanmi, A., et al.: Transfer learning and semisupervised adversarial detection
and classification of COVID-19 in CT images. Complexity 2021 (2021)
A Two-Stage Federated Transfer Learning Framework 215
51. World Health Organization: Laboratory testing for coronavirus disease (COVID-
19) in suspected human cases: interim guidance, 19 March 2020. Technical report,
World Health Organization (2020)
52. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2009)
53. Parry, A.H., Wani, A.H., Shah, N.N., Yaseen, M., Jehangir, M.: Chest CT features
of coronavirus disease-19 (COVID-19) pneumonia: which findings on initial CT can
predict an adverse short-term outcome? BJR Open 2, 20200016 (2020)
54. Pathak, Y., Shukla, P.K., Tiwari, A., Stalin, S., Singh, S.: Deep transfer learning
based classification model for COVID-19 disease. IRBM (2020)
55. Pham, T.D.: A comprehensive study on classification of COVID-19 on computed
tomography with pretrained convolutional neural networks. Sci. Rep. 10(1), 1–8
(2020)
56. Qayyum, A., Ahmad, K., Ahsan, M.A., Al-Fuqaha, A., Qadir, J.: Collaborative
federated learning for healthcare: multi-modal COVID-19 diagnosis at the edge.
arXiv preprint arXiv:2101.07511 (2021)
57. Qiu, Y., et al.: A new approach to develop computer-aided diagnosis scheme of
breast mass classification using deep learning technology. J. X-ray Sci. Technol.
25(5), 751–763 (2017)
58. Rahman, T., et al.: Exploring the effect of image enhancement techniques on
COVID-19 detection using chest X-ray images. Comput. Biol. Med. 132, 104319
(2021)
59. Sakib, S., Tazrin, T., Fouda, M.M., Fadlullah, Z.M., Guizani, M.: DL-CRC: deep
learning-based chest radiograph classification for COVID-19 detection: a novel app-
roach. IEEE Access 8, 171575–171589 (2020)
60. Sharma, S., Sharma, S., Athaiya, A.: Activation functions in neural networks.
Towards Data Sci. 6(12), 310–316 (2017)
61. Shen, D., Guorong, W., Suk, H.-I.: Deep learning in medical image analysis. Annu.
Rev. Biomed. Eng. 19, 221–248 (2017)
62. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
63. Sun, W., Zheng, B., Qian, W.: Computer aided lung cancer diagnosis with deep
learning algorithms. In: Medical Imaging 2016: Computer-Aided Diagnosis, vol.
9785, pp. 241–248. SPIE (2016)
64. Tanaka, H., Chiu, S.-W., Watanabe, T., Kaoku, S., Yamaguchi, T.: Computer-
aided diagnosis system for breast ultrasound images using deep learning. Phys.
Med. Biol. 64(23), 235013 (2019)
65. Tenda, E.D., et al.: The importance of chest CT scan in COVID-19: a case series.
Acta Med. Indones. 52(1), 68–73 (2020)
66. Torrey, L., Shavlik, J.: Transfer learning. In Handbook of research on machine
learning applications and trends: algorithms, methods, and techniques, pp. 242–
264. IGI Global (2010)
67. Ufuk, F., Savaş, R.: Chest CT features of the novel coronavirus disease (COVID-
19). Turk. J. Med. Sci. 50(4), 664–678 (2020)
68. Wang, S., et al.: When edge meets learning: adaptive control for resource-
constrained distributed machine learning. In: IEEE INFOCOM 2018-IEEE Con-
ference on Computer Communications, pp. 63–71. IEEE (2018)
69. Wang, S.-H., Nayak, D.R., Guttery, D.S., Zhang, X., Zhang, Y.-D.: COVID-19 clas-
sification by CCSHNet with deep fusion using transfer learning and discriminant
correlation analysis. Inf. Fusion 68, 131–148 (2021)
216 A. S. Zhang and N. F. Li
70. Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big
Data 3(1), 1–40 (2016). https://doi.org/10.1186/s40537-016-0043-6
71. Wu, M., Chen, L.: Image recognition based on deep learning. In: 2015 Chinese
Automation Congress (CAC), pp. 542–546. IEEE (2015)
72. Xie, X., Zhong, Z., Zhao, W., Zheng, C., Wang, F., Liu, J.: Chest CT for typical
coronavirus disease 2019 (COVID-19) pneumonia: relationship to negative RT-
PCR testing. Radiology 296(2), E41–E45 (2020)
73. Yan, B., et al.: Experiments of federated learning for COVID-19 chest X-ray
images. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds.) ICAIS 2021. CCIS,
vol. 1423, pp. 41–53. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
78618-2 4
74. Zhang, Q.: Convolutional neural networks. In: Proceedings of the 3rd International
Conference on Electromechanical Control Technology and Transportation, pp. 434–
439 (2018)
75. Zhang, R., et al.: COVID19XrayNet: a two-step transfer learning model for the
COVID-19 detecting problem based on a limited number of chest x-ray images.
Interdiscip. Sci. Comput. Life Sci. 12(4), 555–565 (2020)
76. Zhang, S., et al.: Computer-aided diagnosis (CAD) of pulmonary nodule of thoracic
CT image using transfer learning. J. Digit. Imaging 32(6), 995–1007 (2019)
77. Zhang, W., et al.: Dynamic-fusion-based federated learning for COVID-19 detec-
tion. IEEE Internet Things J. 8(21), 15884–15891 (2021)
Graph Emotion Distribution Learning
Using EmotionGCN
1 Introduction
very much suitable to work with the image-based data. For the emotion recog-
nition, the CNN will be trained with the set of images containing the faces of
person with different emotions. But the feature selection and the network layers
are hidden and randomly assigned in a trial-and-error method respectively [19].
This process is called as a supervised learning approach, as the training requires
the labels of different set of emotions in the dataset. The distribution formed
after training the CNN will be hidden and cannot be reused. So, when the data
increases, this will lead to the overfitting of the model.
The traditional algorithm for the emotion recognition mainly focuses on the
personalized emotion categories obtained from some psychological models such
as Mikel’s wheel. Facial emotion recognition is used in digital forensics to detect
the suspicious activity, Research in psychology etc. CNN has high accuracy state-
of-the-art model for image processing [6]. It enables automatic feature selection.
It outputs a single dimensional vector which is easily activated. Emotions are
highly subjective; hence CNN can be used. It can be used as CNN, but on the
generic graphs, followed by a linear transformation operation. It aggregates the
features of all adjacent nodes for each node. Specifically used to identify the
categorical relationships and semantic embeddings [11]. It automatically selects
the fixed number of neighboring nodes for each feature. As a human behavior,
expressing a personalized emotion differs from person to person. When the num-
ber of classes is large, high correlation between two or more emotions results in
misconception. The traditional CNN network used for image emotion (person-
ality) recognition has multiple layers and the number of layers in CNN should
be high to get better performance, which requires high computational cost [20].
Hence, it is also difficult to fine tune for a specific application. The distribution
obtained from the CNN after training is not in a proper form to analyze or
compare with other unspecified emotions. So here there is a need for a model
to consider the whole probability distribution of each emotion so that it can be
easily compared with the distribution of the training data.
The main motivation of this research is to ease the process of image emo-
tion recognition by determining the graph probability ground truth distribution
for the psychological analysis and to train other models built for the emotion
recognition [23]. These graph probability distributions are a general representa-
tion of each emotion in the psychological models, such as Mikel’s wheel. These
graph distributions are represented in the form of undirected graphs with dif-
ferent degrees, nodes, and edges. To extract more information, these graphs will
be analyzed further with some graph analysis tool.
2 Related Work
As our work in this paper targets at finding the emotion correlation graph and
its analysis, we will briefly discuss the following architectures:
Two Model Architecture. CNN-GCN architecture model consist of convo-
lutional neural network for feature extraction from the images and the graph
convolutional network for weight generation from the annotated text [3]. But
Graph Emotion Distribution Learning Using EmotionGCN 219
the problem with this model is it is a combination of two different models. One
for the feature selection which will be fed back as the input features of CNN
model. So here the training time will be large and if there is a need in training
with each emotion, the same process is repeated for each one which consumes
a lot of time [2]. And if the psychological model varies, the retraining should
be done to retain with the changes in the pattern. The major drawback in this
model is that the dataset should contain both the images and the text captions,
in which there should be an establishment of relation between the text corre-
sponding to the image. But it is difficult to scale this type of datasets for further
training [1].
Text based Models. Most of the emotion recognition model involves a text-
based annotation for determining the emotion from the image. This type of
process requires text-based models. Some of the most used text-based word
embedding models are word2vec model and the Glove model (Global vector).
These models are called as pretrained word vectors [8]. The word embedding
models are used to understand the context of words, whereas in the previous
models each word will be treated as a separate feature. So, if the vocabulary is
very large, then the number of features will also be increased in the previous
NLP models. In word embedding, each word will be represented as a vector.
From this, any word in the corpus can be represented with a single vector with
ten features. Here, the mapping of features to the words is not possible. The
process of every NLP model is to convert each word to a numeric. The word2vec
model can be used as a continuous bag of words models (CBOW) or a Skip
Gram model [5]. In the CBOW model the context words are predefined, and the
target word is predicted, but in the Skip Gram model, the target word is known,
and the context words are predicted from the target word. In the word2vec
model, only the local property of the words is considered, but in Glove model
the global property is considered [4]. For implementing these models, the Graph
convolutional network model is used [7].
Graph Convolutional Network. Graph convolutional networks are the archi-
tectures for the graph which are used in the process of message passing. This
algorithm is the same as that of the label propagation algorithm, which passes
the labels through the message passing, whereas the GCN algorithm passes the
input features through the message passing [8]. Hence, the GCN provides the
feature smoothing. GCN algorithm first gathers all the attribute vectors from the
neighboring nodes and perform some aggregation function. The feature size will
remain the same even after the aggregation is performed [9]. Then this aggre-
gated feature vector will be passed through the dense neural network layer. The
output from this layer is the new vector representation of that corresponding
node. This process will be repeated for every other node on that graph. If there
are more than one layers in the GCN, this updated vector will be aggregated
with the neighbors and passed to the dense layer to form the replacement vec-
tor of that node [10]. The dense layer processing is as the same as that of the
convolutional neural network, but the only difference is the pre-processing step
before each convolution operation using an aggregation function. Here, the size
220 A. Revanth and C. P. Prathibamol
of the node vectors coming out of the GCN layer is determined by the number
of units in the neural network layer [13].
The document modelling architecture uses the GCN for extracting the fea-
tures from the text. The major drawback in this model is that the dataset
should contain both the images and the text captions, in which there should be
an establishment of relation between the text corresponding to the image. But it
is difficult to scale this type of datasets for further training [14]. Here there will
be three separate models which forms a continuous pipeline. So due to this, the
training time will be high, and the dataset should contain the captions which
should be related to the image dataset [15].
The number of layers in the GCN should not be more, even if any optimiza-
tion algorithm is used. Since the probability distribution is expressed in terms of
linear data, when the dataset become large, the distribution will automatically be
considered as a Gaussian function and the difference between different personality
cannot be determined [16]. The word corpus is also a necessary factor required to
train the network, which is not available is most of the raw datasets we obtained
from the images. This model doesn’t involve any human emotions. [17]
3 Method
3.1 Preprocessing
Pre-processing the given data is necessary and mandatory for this model. As
model architecture consist of only GCN, it will accept only a graph-based data
as a feature. So in the pre-processing step, the given dataset of images will be
converted into network graph data structure. This is possible by converting all
the pixel intensities values to graph node and forming edges between them and
removing the nodes with self-loops using feature selection process [18,25]. The
algorithm for pre-processing be as follows,
Figure 1 gives the pictorial representation of how the image is being con-
verted.
3.2 Algorithm
The architecture of this algorithm consists of a set of single layers GCN which
are used to determine the graph distribution of each emotion. This approach
Graph Emotion Distribution Learning Using EmotionGCN 221
considers only four emotions such as angry, happy, sad and surprise. By splitting
the data based on the labels and training separately, the final graph distribution
can be obtained for each emotion. These four models will be evaluated separately
to analyze their performance. A separate two-layer GCN model is used to train
with the whole dataset of all emotions, which can be compared with the ground
truth distribution to determine how well the dataset perform with the GCN
model. The ground truth distribution is nothing, but the graph trained on the
consecutive dense layer [21]. The algorithm of single layer and two-layer GCN
are explained below:
1. For an image Ii , its feature fi will be determined by fi = F (Ii ).
2. The emotion correlation matrix g will be obtained from the sparse diagonal
matrix from the graph and fed to GCN.
3. Here the operation performed by GCN be,
f H(l) , A = σ AH(l) W(l) (1)
Fig. 1. The pre-processing stages of EmotionGCN. The dataset used for the training
consist only of the monochrome images. But even if the images are color an exception
handling algorithm is used to convert the three channel pictures (Red, Green, and
Blue) to single channel image (Black and White). Then each pixel will be replaced by
a certain node whereas the arrangement of the nodes along with the edges forms a
three-dimensional graph representation of the image. The distance between object in
the image is represented with a real distance between the graphs. The edges and nodes
which didn’t have more correlation to the face (Surroundings) will be removed using a
threshold [22].
graph structure will be in the form of sparse diagonal matrix. This will produce
a node level matrix in which each row represents the output feature fi . Here,
the operation perform by every GCN layer is:
f H(l) , A = σ AH(l) W(l) (2)
where,
– H(l) be the input node feature matrix.
– W(l) be the transformation matrix with learnable weights.
– σ be the nonlinear activation function.
In practice, the performance of the single layer GCN will be not up to the
mark. So multiple GCN layers can be stacked to form the multilayer GCN where
the output feature from the ith GCN layer will normalize and fed to the (i + 1)th
GCN layer. The output feature vectors from the final GCN layer will be applied
to a nonlinear activation function such as softmax or reLu to perform the task
specified. [24] The output from the two-layer GCN will be in the form:
Z = Softmax Aσ AXW (0)
W(1) (3)
After the graph convolution operation is done, the next thing is to find the
emotion correlation graph. This graph can be constructed by following one of
Graph Emotion Distribution Learning Using EmotionGCN 223
Fig. 3. The evaluation metrics used in the models. a) gives the confusion matrix of the
two-layer GCN model which is trained on all the four different emotions (angry, happy,
sad, surprise). b) is the confusion matrix of the fully connected neural network model
trained on all four emotions. c) is the confusion matrix of the multi-layer convolutional
neural network model, which is trained on the all four different emotions.
repeated for all other emotions in the set [26,27]. This method similar to finding
the probability of ith emotion given j th emotion p(i—j) which is defined as,
1 d(i, j)2
p(i | j) = √ exp − (4)
2πσ 2σ 2
where,
Graph Emotion Distribution Learning Using EmotionGCN 225
Fig. 4. Degree distributions of four different emotions angry, happy, sad and surprise.
4 Experimentation
4.1 Dataset
The experiments are conducted in the dataset called as FER-2013. This dataset
consists of, 35685 examples of 48 by 48-pixel gray scale images of faces. There
are seven emotions in this dataset and for this research only four emotions are
considered due to the shortage of resources. The four different emotions used for
this research is angry, happy, sad and surprise. Among this data, 80% of the data
will be considered as the training data and 20% of the data will be considered as
the test data [28]. All the images in these datasets are monochromatic. But an
explicit inclusion of color to grayscale conversion is present in the pre-processing
algorithm in order to avoid the error.
226 A. Revanth and C. P. Prathibamol
References
1. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment
ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM
International Conference on Multimedia, pp. 223–232 (2013)
2. Chiang, W.L., Liu, X., Si, S., Li, Y., Bengio, S., Hsieh, C.J.: Cluster-GCN: an
efficient algorithm for training deep and large graph convolutional networks. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pp. 257–266 (2019)
3. Miranda-Correa, J.A., Abadi, M.K., Sebe, N., Patras, I.: Amigos: a dataset for
affect, personality and mood research on individuals and groups. IEEE Trans.
Affect. Comput. 12(2), 479–493 (2018)
4. Farnadi, G., et al.: Computational personality recognition in social media. User
Model. User-Adap. Inter. 109–142 (2016). https://doi.org/10.1007/s11257-016-
9171-0
5. Gao, H., Zhengyang, W., Shuiwang, J.: Large-scale learnable graph convolutional
networks. In: Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp. 1416–1424 (2018)
228 A. Revanth and C. P. Prathibamol
6. Gautam, K.S., Senthil Kumar, T.: Video analytics-based facial emotion recognition
system for smart buildings. Int. J. Comput. Appl. 43(9), 858–867 (2021)
7. Giannopoulos, P., Perikos, I., Hatzilygeroudis, I.: Deep learning approaches for
facial emotion recognition: a case study on FER-2013. In: Hatzilygeroudis, I.,
Palade, V. (eds.) Advances in Hybridization of Intelligent Methods. SIST, vol.
85, pp. 1–16. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-66790-
41
8. Grattarola, D., Alippi, C.: Graph neural networks in tensorflow and keras with
spektral. arXiv preprint arXiv:2006.12138 (2020)
9. Jonathon, S.H., Paul, H.L.: Automatically annotating the mir flickr dataset: exper-
imental protocols, openly available data and semantic spaces. In: Proceedings of
the International Conference on Multimedia Information Retrieval, pp. 547–556
(2010)
10. He, T., Xiaoming, J.: Image emotion distribution learning with graph convolutional
networks. In: Proceedings of the 2019 on International Conference on Multimedia
Retrieval, pp. 382–390 (2019)
11. Keshari, T., Palaniswamy, S.: Emotion recognition using feature-level fusion of
facial expressions and body gestures. In: 2019 International Conference on Com-
munication and Electronics Systems (ICCES), pp. 1184–1189. IEEE (2019)
12. Kumar, M.P., Rajagopal, M.K.: Facial emotion recognition system using entire
feature vectors and supervised classifier. In: Deep Learning Applications and Intel-
ligent Decision Making in Engineering, pp. 76–113. IGI Global (2021)
13. Li, G., Zhang, M., Li, J., Lv, F., Tong, G.: Efficient densely connected convolutional
neural networks. Pattern Recognit. 109, 107610 (2021)
14. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document
modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)
15. Melekhov, I., Juho, K., Esa, R.: Siamese network features for image matching. In:
2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383.
IEEE (2016)
16. Pennington, J., Richard, S., Christopher, D.M.: Glove: global vectors for word
representation. In: Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
17. Pinson, M.H., Choi, L.K., Bovik, A.C.: Temporal video quality model accounting
for variable frame delay distortions. IEEE Trans. Broadcast. 60(4), 637–649 (2014)
18. Prathibhamol, C.P., Ashok, A.: Solving multi label problems with clustering and
nearest neighbor by consideration of labels. In: Advances in Signal Processing
and Intelligent Recognition Systems. AISC, vol. 425, pp. 511–520. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-28658-7 43
19. Raj, K.S., Kumar, P.: Automated human emotion recognition and analysis using
machine learning. In: 2021 12th International Conference on Computing Commu-
nication and Networking Technologies (ICCCNT), pp. 1–9. IEEE (2021)
20. Sachin Saj, T.K., Babu, S., Reddy, V.K., Gopika, P., Sowmya, V., Soman, K.P.:
Facial emotion recognition using shallow CNN. In: Thampi, S., Trajkovic, L., Li,
KC., Das, S., Wozniak, M., Berretti, S. (eds.) Machine Learning and Metaheuristics
Algorithms, and Applications. SoMMA 2019. Communications in Computer and
Information Science, vol. 1203, pp. 144–150. Springer, Singapore (2019). https://
doi.org/10.1007/978-981-15-4301-2 12
21. Subramanian, R., Julia, W., Abadi, M.K., Vieriu, R.L., Winkler, S., Sebe, N.:
Ascertain: emotion and personality recognition using commercial sensors. IEEE
Trans. Affect. Comput. 9(2), 147–160 (2016)
Graph Emotion Distribution Learning Using EmotionGCN 229
22. Thushara, S., Veni, S.: A multimodal emotion recognition system from video.
In: 2016 International Conference on Circuit, Power and Computing Technologies
(ICCPCT), pp. 1–5. IEEE (2016)
23. Sai Prathusha, S., Suja, P., Tripathi, S., Louis, R.: Emotion recognition from facial
expressions of 4D videos using curves and surface normals. In: Basu, A., Das,
S., Horain, P., Bhattacharya, S. (eds.) IHCI 2016. LNCS, vol. 10127, pp. 51–64.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52503-7 5
24. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package
for graph neural networks. arXiv preprint arXiv:1909.01315 (2019)
25. Wang, X., Yufei, Y., Abhinav, G.: Zero-shot recognition via semantic embeddings
and knowledge graphs. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6857–6866 (2018)
26. Wang, Y., Yanzhao, X., Yu, L., Lisheng, F.: G-cam: graph convolution network
based class activation mapping for multi-label image recognition. In: Proceedings
of the 2021 International Conference on Multimedia Retrieval, pp. 322–330 (2021)
27. Wang, Y., Yanzhao, X., Yu, L., Ke, Z., Xiaocui, L.: Fast graph convolution network
based multi-label image recognition via cross-modal fusion. In: Proceedings of the
29th ACM International Conference on Information & Knowledge Management,
pp. 1575–1584 (2020)
28. Yang, J., Dongyu, S., Ming, S.: Joint image emotion classification and distribution
learning via deep convolutional neural network. In: IJCAI, pp. 3266–3272 (2017)
On the Role of Depth Predictions for 3D
Human Pose Estimation
1 Introduction
3D Human Pose Estimation (HPE) is the process of producing a 3D body land-
marks from sensor input that matches the spatial position and configuration of
the individuals of interest. In the case of single view HPE, the sensor input is
one image or camera view containing human subjects, and the goal of HPE is to
predict the 3D coordinates of the joints from the subjects’ skeletons. There are
many approaches to single view human pose estimation, among them we high-
light two important dichotomies: Top-Down vs Bottom-up and frame-by-frame
vs sequence-to-sequence.
Top-down [4,25,27,32,35,39,40,48] approaches first create a bounding box
for each subject in an image and then apply a pose estimation module to the
bounded image. Bottom-up [12,18,26,29,49,53] approaches start by predicting
3D joint locations and then assign these key-points to individual actors using
clustering algorithms. Top-down approaches are more common for 3D HPE as
most publicly available data sets contain only one person per frame.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 230–247, 2023.
https://doi.org/10.1007/978-3-031-18461-1_15
Depth in 3D Pose Estimation 231
Hypothesis 2. Depth estimations that are more strongly correlated with the
z-coordinates of joint locations lead to a lower average mm error. A higher cor-
relation between depth values and joint z-coordinates indicates a lower degree
polynomial can be used to model this relationship. In this case, a lower capacity
network can be used to model this dependence, which is displayed in our investi-
gations of hypothesis one. When compared to the network developed in [27], our
network which utilizes depth values is less complex, yet produces more accurate
results.
We test and validate this hypothesis by calculating correlation and statisti-
cal significance levels for individual joints sampled by camera and action. Sub-
sequently we compare the same joints average mm error for high correlation
sub-samples versus low correlation sub-samples. This statistical analysis lends
232 A. Diaz-Arias et al.
credence to our hypothesis. We discuss short comings of the data such as occlu-
sions, and suggest directions to further explore this theory. Our experiments
provide evidence to the idea that the community should utilize correlation as a
metric by which to gauge the efficacy of depth estimators as to not lose sight of
depth maps desired utility.
The results of our investigation emphasize the role of depth in 3D reconstruc-
tion. Our contributions can be summarized as follows:
2 Methodology
Our goal is to estimate three-dimensional body joint locations given a three-
dimensional input where the third dimension of the input is “orthogonal” to the
pixel coordinate joint locations. We will argue later that this increased dimen-
sionality is critical to our network’s success. Additionally, we show this extra
dimension is correlated with the z-coordinate, thus justifying its utility. Formally,
let x ∈ R3J denote our input and denote the output as y ∈ R3J , where J is the
number of joints to be estimated. We seek to learn a function f : R3J → R3J
that minimizes the joint reconstruction error:
J
1
min ||f (xi ) − yi ||22 .
f J i=1
and let D denote the depth map, then xi = (u, v, D(u, v)). We estimate the
3D joint locations in the camera reference frame. We aim to approximate the
reconstruction function f using a neural networks as a function estimator.
Figure 1 provides a diagram with the basic building blocks of our network.
Our network is modeled after [25]. In this direction, the network uses low dimen-
sional input (i.e. the Cartesian coordinates of the skeleton) and is based on their
deep multi-layer neural network with batch-normalization (BN) and dropout
(Drop) to reduce over fitting. Our network has approximately 7 million param-
eters, but due to the low dimensional input the network is easily integrated into
a real-time system.
We further note that there are four intrinsic parameters required for the
reconstruction (assuming only one camera is used), but the z-value varies depend-
ing on the pixel coordinate inputs. This is the major hurdle, in our opinion, to
achieve high quality reconstruction. We believe independent of the complexity
or novelty of the neural network topology these networks will under-perform any
counterpart using additional input that is “orthogonal” to the pixel level infor-
mation. In the presence of the true depth value the problem is still “ill-posed”,
but significantly more tractable. Furthermore, what we exploit is a noisy depth
value that at most has varying correlation for differing joints when considered
across our sampling procedure (moderately high to weak). Nonetheless we still
achieve state-of-the-art results on the H36M dataset.
We train the depth estimation network with respect to the following losses:
n
1 2
LMSE (y, ŷ) = y − ŷ2
n p=1
and the gradient loss defined over the depth image, i.e.,
n
1 2 2
Lgrad (y, ŷ) = ∇x (yp , ŷp )2 + ∇y (yp , ŷp )2 ,
n p=1
where y is the ground truth, ŷ is the predicted output, n is the number of joints,
and ∇x , ∇y are the gradients in the direction of x, y, respectively.
3 Results
Before moving to the numerical results, we first provide examples of our recon-
struction relative to the ground truth from differing camera angles and distinct
poses.
Fig. 3. Blue skeletons are ground truth poses from H36M while the corresponding red
skeletons are the predicted poses produced by our network. (Color figure online)
joints using a rigid alignment (RA) and without rigid alignment (W/O RA). We
note that when no rigid alignment is used, the root joint is centered for both the
prediction and the ground truth 3D pose.
We perform rigid alignment by finding R ∈ SO(3) and t ∈ R3 such that
J
2
(Rpi + t) − yi 2 is minimized where pi is the predicted joint location and yi
i=1
is the ground truth.
Results on H36M. Qualitative results on H36M are shown in Fig. 3, while
Table 1 provides a quantitative comparison between previous methods’ errors
on the H3.6M data set subdivided by each action. Amongst the frame-by-frame
methods, our network outperforms in a majority of the action sequences and
achieves the lowest average error by over 1mm. We also compare our network
to Sequence-to-Sequence methods. This approach is becoming more prevalent
as it use temporal information, i.e. neighboring video frames, to improve 3D
reconstruction results. Intuitively, the data of neighboring frames should provide
extra depth information to the network, but it is not as clear through which
mechanisms this is achieved. Given the advantage of temporal networks, our
frame-by-frame approach compares favorably to the state-of-the-art Sequence-to-
Sequence methods. Lastly, we note that our network architecture is largely based
on [25], yet with the inclusion of depth information provides a significant decrease
in 3D reconstruction error, namely over 10mm reduction in reconstruction error.
Next, we move to the correlation analysis that led us to using depth values of
specific joints as additional input. While noisy depth has intuitive utility and
238 A. Diaz-Arias et al.
Table 1. Detailed results on the H3.6M data set [16] under protocol 1. We report with
Rigid Alignment (RA) and without Rigid Alignment (W/O RA). MPJPE is subdivided
by action and the average across all actions is provided. We boldface the values that are
the best among frame-to-frame, while underlined values denote the best results among
all methods. We indicate the sequence-to-sequence methods using input window size
greater than 1 with †.
Methods Dir Dis Eat Gre Phon Pose Pur Sit SitD Smo Phot Wait Walk WalkD WalkP Avg
RA
Zhou [52] 37.8 49.4 37.6 40.9 45.1 41.4 40.1 48.3 50.1 42.2 53.5 44.3 40.5 47.3 39.0 43.8
Ours 23.5 27.2 27.6 27.2 26.2 27.9 24.8 27.9 41.4 32.9 39.4 28.6 20.1 37.2 25.0 29.1
W/O RA
Zhou [55] 54.8 60.7 58.2 71.4 62.0 65.5 53.8 55.6 75.2 112 64.2 66.1 51.4 63.2 55.3 64.9
Moreno [28] 53.5 50.5 65.8 62.5 56.9 60.6 50.8 56.0 79.6 63.7 80.8 61.8 59.4 68.5 62.1 62.2
Chen [9] 53.3 46.8 58.6 61.2 56.0 58.1 48.9 55.6 73.4 60.3 76.1 62.2 35.8 61.9 51.1 57.5
Pavllo [33] 47.1 50.6 49.0 51.8 53.6 61.4 49.4 47.4 59.3 67.4 52.4 49.5 55.3 39.5 42.7 51.8
Martinez [25] 37.7 44.4 40.3 42.1 48.2 54.9 44.4 42.1 54.6 58.0 45.1 46.4 47.6 36.4 40.4 45.5
Zheng [54] † 49.2 49.7 38.7 42.7 40.0 40.9 50.7 42.2 47.0 46.1 43.4 46.7 39.8 36.4 38.0 43.5
Hossain [15] 35.2 40.8 37.2 37.4 43.2 44.0 38.9 35.6 42.3 44.6 39.7 39.7 40.2 32.8 35.5 39.2
Lee [19] † 32.1 36.6 34.3 37.8 44.5 49.9 40.9 36.2 44.1 45.6 35.3 35.9 30.3 37.6 35.5 38.4
Pavllo [33] † 35.2 40.2 32.7 35.7 38.2 45.5 40.6 36.1 48.8 47.3 37.8 39.7 38.7 27.8 29.5 37.8
Cai [6] † 32.9 38.7 32.9 37.0 37.3 44.8 38.7 36.1 41.0 45.6 36.8 37.7 37.7 29.5 31.6 37.2
Xu [45] 35.8 38.1 31.0 35.3 35.8 43.2 37.3 31.7 38.4 45.5 35.4 36.7 36.8 27.9 30.7 35.8
Shan [36] † 34.8 38.2 31.1 34.4 35.4 37.2 38.3 32.8 39.5 41.3 34.9 35.6 32.9 27.1 28.0 34.8
Liu [23] † 34.5 37.1 33.6 34.2 32.9 37.1 39.6 35.8 40.7 41.4 33.0 33.8 33.0 26.6 26.9 34.7
Zhan [51] † 31.2 35.7 31.4 33.6 35.0 37.5 37.2 30.9 42.5 41.3 34.6 36.5 32.0 27.7 28.9 34.4
Zeng [50] † 34.8 32.1 28.5 30.7 31.4 36.9 35.6 30.5 38.9 40.5 32.5 31.0 29.9 22.5 24.5 32.0
Ours 29.9 33.1 32.3 30.8 33.5 37.1 27.1 34.5 48.7 35.7 49.3 37.3 23.5 40.6 27.8 34.8
ground truth depth maps have been used in the literature – it was unclear to
what extent noisy depth maps could aid in the reconstruction problem.
We conducted a statistical analysis to assess the extent of possible correlation
between depth values and the z-coordinate of joints in the camera reference
frame. We chose to sub-sample our data by camera and action as this allows us
to observe the extremes of correlation values. This method also better indicates
the affect of joint occlusion in the reconstruction. Thus, we believe to get an
accurate picture of how much correlation is present this is a natural sampling
procedure.
First, we explored descriptive statistics of the distributions to check for the
normality assumption. In the case that the distribution is non-normal, then we
are forced to apply non-parametric methods. Since accuracy of different normal-
ity tests can depend on sample size and other characteristics of the distribution,
we decided to perform several tests, namely, the Shapiro-Wilk, Andersen-Darling
and D’Agostino tests [2,11,37] for normality. The tests were done for data sam-
ples of sizes 500, 1,000, 5,000 and 100,000.
Tables 2 and 3 present the results of the tests for a distribution of 5,000 depth
values by specific joints in an attempt to achieve the most accurate p-values. As
shown in the Shapiro-Wilk and D’Agostino tests, the reported p-values war-
ranted rejection of the null hypothesis of the normality assumption, i.e., the
Depth in 3D Pose Estimation 239
p-values of almost all tests were lower than confidence level α = 0.05. There
were only two cases when the D’Agostino test produced p-values that indicated
that the distribution was normal (see underlined values in Table 2 and 3). All
values of statistic in the Andersen-Darling tests were greater than correspond-
ing critical values, which indicated violation of the normality assumption. The
consensus conclusion was that the distribution of depth values was not normally
distributed. This conclusion was also supported by skewness and kurtosis. All
kurtosis tests confidently rejected the null hypothesis that the shape of the dis-
tribution matched the shape of the normal distribution (e.g. peaked shape with
light tails). It can also be seen from histograms in Fig. 5 that some distributions
were multi-modal.
Table 2. Results of statistical tests for normality for depth values by joints. The upper
value for all tests is the test statistic. Lower value for shapiro-wilk and D’Agostino tests
is the p-value and for Andersen-Darling test is the critical value.
Table 3. Result of statistical tests for normality for depth values by joints cont
Table 4. Examples of moderately high spearman and kendall tau rank correlation.
RK, LK, and LA denote right-knee, left-knee and left-ankle respectively and (C,A)
denote camera and action
Table 5. Examples of weak spearman and kendall tau rank correlation correlation that
are statistically insignificant. RA, LA, RW, LW, and LE denote right-ankle, left-ankle,
right-wrist, left-wrist and left-elbow respectively and (C,A) denotes the camera and
action.
Table 6. Examples of high negative spearman and kendall tau rank correlation corre-
lation that are statistically significant. TH, RH, LE, RK, MH, and RW denote top of
head, right-hip, left-elbow, right-knee, mid-head and right wrist respectively and (C,A)
denotes the camera and action.
Fig. 4. Correlation vs average mm error example plots, with best fit line demonstrating
negative correlation as desired
4 Discussion
This work has begun to uncover the role of depth in the 3D human pose estima-
tion problem. Yet, from both an analytic and statistical perspective more work
is needed to fully understand the limit with which depth can play to improve
reconstruction. While we study the amount of correlation present between the
depth value and a joint’s z-coordinate, further investigation is required to under-
stand what is occurring in individual frames to cause certain joints to have lower
correlation as compared to others. We hypothesize that a large contributor to
an individual joint having vastly different correlation values is the result of joint
occlusions occurring due to camera location in relation to the action. It is well
known that depth maps are imprecise with respect to occlusions. The presence
of additional inputs, e.g. temporal information, can further shrink the search
space of the network leading to more robust 3D pose estimation.
Depth in 3D Pose Estimation 243
We uncovered that in the event of perfect correlation, i.e. taking the depth
values to be the ground truth z-coordinates, that our network can achieve 11mm
average joint reconstruction error on the validation set. This value acts as an
absolute minimum for reconstruction error. Thus there is room for up to a 70%
improvement for our network. We are optimistic that a better understanding of
the depth value z-coordinate relationship will help close this gap.
5 Conclusions
To the best of our knowledge, it is atypical to report the average millimeter
error on a per joint basis, so it is difficult to compare our network in these terms
to our predecessors. However, given of the substantial reduction we have seen
it is relatively safe to assume that on a per joint basis there was a substantial
lowering of the per joint errors. We note that joints with the highest correlation
values across the entire data set, e.g. the right-ankle (RA) with value 0.345,
does not necessarily imply that this joint will have the lowest millimeter error
(RA has error 48.79mm under protocol 1 while the minimum is observed at
9.99 mm for the right and left hips) which had correlation of 0.276. This is easily
explainable by the variances of the joint positions which are naturally much
larger for extremities compared to joints located near the root.
The proposed system outperforms previous frame-by-frame top-down 3D
pose estimators by a significant margin. It furthermore compares favorably to
state-of-the-art Sequence-to-Sequence methods when using input sequences of
length one frame. We hope that this study invigorates researchers to improve
upon the state-of-the-art for depth map estimation so the full potential human
pose estimation can be realized.
References
1. Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous dis-
tributed systems. arXiv preprint arXiv:1603.04467 (2016)
244 A. Diaz-Arias et al.
18. Kundu, J.N., Revanur, A., Waghmare, G.V., Venkatesh, R.M., Babu, R.V.: Unsu-
pervised cross-modal alignment for multi-person 3D pose estimation. In: Vedaldi,
A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp.
35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0 3
19. Lee, K., Lee, I., Lee, S.: Propagating LSTM: 3d pose estimation based on joint
interdependency. In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 119–135 (2018)
20. Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep
convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H.
(eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015). https://
doi.org/10.1007/978-3-319-16808-1 23
21. Li, S., Zhang, W., Chan, A.B.:. Maximum-margin structured learning with deep
networks for 3d human pose estimation. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 2848–2856 (2015)
22. Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal con-
texts with strided transformer for 3d human pose estimation. IEEE Trans. Multi-
media (2022)
23. Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.C., Asari, V.: Attention mecha-
nism exploits temporal contexts: real-time 3d human pose reconstruction. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pp. 5064–5073 (2020)
24. Marin-Jimenez, M.J., Romero-Ramirez, F.J., Munoz-Salinas, R., Medina-Carnicer,
R.: 3d human pose estimation from depth maps using a deep combination of poses.
J. Vis. Commun. Image Representation 55, 627–639 (2018)
25. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for
3d human pose estimation. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 2640–2649 (2017)
26. Mehta, D., et al.: Xnect: real-time multi-person 3d motion capture with a single
RGB camera. ACM Trans. Graph. (TOG) 39(4), 1–82 (2020)
27. Moon, G., Chang, J.Y. and Lee, K.M.: Camera distance-aware top-down approach
for 3d multi-person pose estimation from a single RGB image. In: Proceedings of
the IEEE and CVF International Conference on Computer Vision, pp. 10133–10142
(2019)
28. Moreno-Noguer, F.: 3d human pose estimation from a single image via distance
matrix regression. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2823–2832 (2017)
29. Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In:
Proceedings of the IEEE and CVF International Conference on Computer Vision,
pp. 6951–6960 (2019)
30. Parameswaran, V., Chellappa, R.: View independent human body pose estimation
from a single perspective image. In: Proceedings of the 2004 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004,
vol. 2, pp. II–II. IEEE (2004)
31. Park, S., Hwang, J., Kwak, N.: 3D human pose estimation using convolutional
neural networks with 2d pose information. In: Hua, G., Jégou, H. (eds.) ECCV
2016. LNCS, vol. 9915, pp. 156–169. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-49409-8 15
32. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric
prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
246 A. Diaz-Arias et al.
33. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in
video with temporal convolutions and semi-supervised training. In: Proceedings of
the IEEE and CVF Conference on Computer Vision and Pattern Recognition, pp.
7753–7762 (2019)
34. Peng, B., Luo, Z.: Multi-view 3d pose estimation from single depth images. Tech-
nical report, Technical report, Technical report, Stanford University, USA, Report,
Course ... (2016)
35. Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net++: multi-person 2d and 3d pose
detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1146–
1161 (2019)
36. Shan, W., Lu, H., Wang, S., Zhang, X., Gao, W.: Improving robustness and accu-
racy via relative information encoding in 3d human pose estimation. In: Proceed-
ings of the 29th ACM International Conference on Multimedia, pp. 3446–3454
(2021)
37. Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete
samples). Biometrika 52(3/4), 591–611 (1965)
38. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33715-4 54
39. Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In:
Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–
2611 (2017)
40. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression.
In: Proceedings of the European Conference on Computer Vision (ECCV), pp.
529–545 (2018)
41. Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction
of 3d human pose with deep neural networks. arXiv preprint arXiv:1605.05180
(2016)
42. Véges, M., Lőrincz, A.: Multi-person absolute 3d human pose estimation with
weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN
2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-61609-0 21
43. Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person
ordinal relations for monocular multi-person 3d pose estimation. In: Vedaldi, A.,
Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp.
242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8 15
44. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines.
In: Proceedings of the IEEE conference on Computer Vision and Pattern Recog-
nition, pp. 4724–4732 (2016)
45. Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose esti-
mation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 16105–16114 (2021)
46. Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose esti-
mation in the wild by adversarial learning. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)
47. Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for
3d pose estimation from a single image. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4948–4956 (2016)
Depth in 3D Pose Estimation 247
48. Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estima-
tion of multiple people in natural scenes-the importance of multiple scene con-
straints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2148–2157 (2018)
49. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network
for the integrated 3d sensing of multiple people in natural images. Adv. Neural Inf.
Process. Syst. 31 (2018)
50. Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: SRNet: improving gener-
alization in 3d human pose estimation with a split-and-recombine approach. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12359, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58568-6 30
51. Zhan, Y., Li, F., Weng, R., Choi, W.: Ray3d: ray-based 3d human pose estimation
for monocular absolute 3d localization. arXiv preprint arXiv:2203.11471 (2022)
52. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convo-
lutional networks for 3d human pose regression. In: Proceedings of the IEEE and
CVF Conference on Computer Vision and Pattern Recognition, pp. 3425–3435
(2019)
53. Zhen, J., et al.: SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation.
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12360, pp. 550–566. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58555-6 33
54. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose
estimation with spatial and temporal transformers. In: Proceedings of the IEEE
and CVF International Conference on Computer Vision, pp. 11656–11665 (2021)
55. Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Weakly-supervised transfer for 3d
human pose estimation in the wild. In: IEEE International Conference on Com-
puter Vision, ICCV, vol. 3, p. 7 (2017)
56. Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression.
In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 186–201. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-49409-8 17
AI-Based QOS/QOE Framework
for Multimedia Systems
1 Introduction
The increased availability of the Internet and new technologies such as LTE have enabled
the use of all types of services, including video streaming, which come with stringent
requirements in terms of network performance and capacity demands. Streaming appli-
cations have grown exponentially in recent years to be the most dominant contributor to
global internet traffic, and this is because they satisfy a wide variety of customer needs
such as video conferencing, video surveillance, e-learning and stored-video streaming.
With the proliferation of streaming applications, users are continuously raising their
expectation on better services while telecommunication industries jostle to ensure good
quality of experience (QoE), which is a principal measure of users’ perceived quality of
mobile Internet service. Consequently, the issue of adequately supporting all those users
and all those services is a nontrivial one for network operators, as users expect certain
levels of quality to be upheld. This leads to improving the Quality of Service (QoS) that
operators monitor and manage and the Quality of Experience (QoE) that the users get
through the implementation of deep learning models for optimization of network in the
overall network quality as experienced by telecommunication users [7]. QoS define the
overall characteristics of a service such as communication that affect the satisfaction
of user’s service needs [9]. QoS deals with performance aspects of physical systems in
which the QoS must be ensured by the network providers. QoE is the overall acceptabil-
ity of an application or service, as perceived subjectively by the end user including the
complete end-to-end system effect [5].
Recent studies show that telecommunication industry lose millions of dollars due
to poor QoE experienced by end users. Accenture also carried out a recent survey
that showed that about 82% of customer defection was due to poor QoE. Hence, with
185million active subscribers in Nigeria, as recorded in December 2019, by the Nigerian
Communications Commission (NCC), the task of improving the QoE of consumers is
inevitable. An improved QoE would generate huge revenue for the country and further
sustain trust and satisfaction of the consumers. The aim of this work is to quantitatively
measure the quality of user’s experience (QoE) on multimedia using different variables
that affects quality of service (QoS). That is Predicting QoE from QoS. The benefit of
this work when deployed is that it will help telecommunication industry to determine
factors that will help customers to have satisfaction while using multimedia system.
In literature some work has been done to improve the performance of streaming appli-
cations using deep learning. The research in [8], Data analysis on video streaming QoE
over mobile networks, K-mean clustering and logistic regression method, improve QoE
prediction accuracy. The author in [10] used genetic algorithm (GA) and random neu-
ral networks (RNN), to maximise QoE of video traffic streaming over LTE networks.
They used QoS; (SINR), delay and throughput QoE; MOS. Min delay and max delay as
input. [3], in their work titled the Effect of network QoS on user QoE for a mobile video
streaming service using H.265/VP9 codec, introduced two new QoS factor variables ini-
tial delay and buffering delay. Other QoS parameters used includes Packet loss, jitter and
bandwidth. In [7], Modelling QoE of internet video streaming by controlled experimen-
tation and machine learning, used bandwidth and latency parameters. Machine learning
was used to reduce the number of experimentation while maintaining accuracy.
The authors in [12] in their research Evaluating the quality of service in VOIP and
comparing various encoding techniques, used throughput, load, delay, MOS, jitter and
packet loss parameters. Small- and large-scale network Encoding techniques used are
G.711, 723 and 729. These techniques were used to evaluate QoS of VOIP technology.
250 L. N. Onyejegbu et al.
In his work [1], QOS For Multimedia Applications with Emphasize on Video Con-
ferencing, used Packet loss, E2E delay, packet delay variation metrics, to analyse QoS
performance and its effects when video is streamed over a GBR (Guaranteed bit rate)
and non-GBR bearers over LTE network standard.
Implementation of Basic QoS Mechanisms on Videoconferencing Network Model
[2], used Packet loss, delay, and Jitter To create a model for the quality-of-service perfor-
mance measurement based on the CARNet network technique. They used Scheduling
algorithm for implementation.
The author in [11], did a multi-dimensional prediction of QoE based on network-
related SIFs (System Influence Factors), using machine learning technique, in their work
titled Machine learning-based QoE prediction for video streaming over LTE network.
The QoS parameters used are delay, jitter, and loss.
The authors in their work [6], measured and predicted quality of experience of
DASH-based video streaming over LTE, used two machine learning models, to improve
and design DASH adaptive bitrate control algorithm over LTE network.
The authors in [4], Subjective and Objective QoE Measurement for H.265/HEVC
Video Streaming over LTE, examined the impact of media-related system on QoE for
video streaming and compared the results of subjective and objective QoE measurements.
Fig. 1. Deep belief networks architecture (Yulita, Fanany, and Arymurthy 2017)
4 Experimental Methodology
Quantitative measurement was used in this work.
1. Data Collection Method: The datasets used for both QoS and QoE, were collected
online.
2. Sampling Method: Active sampling of the experimental space is adopted to ensure
lower training cost and uncompromised accuracy of the QoS-QoE Model.
Fig. 3. First few rows of the datasets (QoS and QoE). Source http://jeremie.leguay.free.fr/qoe/
Figure 3 shows the first row of the Qos and QoE datasets used in training the proposed
model.
Scatter plot for each QoS against the QoE features considered was plotted as shown
in Fig. 4 and 5. The scatter plot helped to visualize the relationship between the variables
and targets.
AI-Based QOS/QOE Framework for Multimedia Systems 253
The correlation matrix was plotted to visualize and detect multicollinearity as seen
in Fig. 6. Some were found and since the features were much, Variance Inflation factor
(VIF) approach was employed to take care of that.
254 L. N. Onyejegbu et al.
The rectified linear(relu) activation function was used for the prediction since it is
a regression problem. The regression of the model was plotted for all the QoS features
on each of the selected QoE. The actual versus predicted value for StartUpDelay and,
AvgBufferLevel are shown in Fig. 7 and 8, respectively.
From Fig. 7 and 8, it can be seen from the plot that multicollinearity exists among
the QoS features.
Using VIF requires that the contribution of any feature whose VIF is above 5 or
tolerance less than 0.2 should be removed. Note, Tolerance = 1/VIF. The VIF analysis
was carried out and such features were identified, that is, those whose VIF is greater
than 5. They were removed to get a more accurate prediction.
From Fig. 9, we predicted the value of the startupdelay using our test data and
compared the predicted on the right with our actual value on the left, the ground truth
results. The predicted percentages are close to the actual ones.
From Fig. 10 after removing multicollinearity, the model behaved better than when
there was multicollinearity. From the prediction using some QoE measurements, we
were able to get an accuracy of 99% as shown in Fig. 10 and 11 for StartUpDelay and
AVGDownLoadBitRate, respectively using DBN.
A Multilayer Perceptron (MLP) was also built with two dense layers and all its
hyperparameters fine-tuned and fit. The result was visualized with that of the DBN
in Fig. 10 and 11. The observed result is equally good. There was varying degree of
accuracy with different QoE measurements. From the result and the plots, as shown in
Fig. 12 and 13 for MLP, greater accuracies were achieved with DBN than MLP.
Figure 12 and 13, shows the MLP plot of Actual versus Predicted value for
StartUpDelay and AvgVideoBitRate.
256 L. N. Onyejegbu et al.
Fig. 10. Plot of actual versus predicted value for StartUpDelay using DBN
AI-Based QOS/QOE Framework for Multimedia Systems 257
Fig. 12. Plot of actual versus predicted value for StartUpDelay using MLP
7 Discussion of Result
In this work, we predicted QoE using the required QoS variables that gave good pre-
diction as seen in Fig. 10 and 11. This result is significant because multimedia service
providers need to work with the right variables for better outputs, thereby improving
quality of experience (QoE). The observed results are equally good, as we got various
degree of accuracy with different QoE measurements. We got greater accuracy using
DBN algorithm than MLP algorithm.
8 Conclusion
This research explored two deep learning hybrid techniques: Deep Belief Network
(DBN) and Multi-level Perceptron (MLP) to optimize QoS and QoE in multimedia
applications. A novel predictive QoE model where relevant QoS features and how they
influence QoE was presented, and finally a rich QoS-QoE datasets was presented, which
can be further used as a framework to ensure responsible AI in the area of multimedia
systems. Deep Belief Network algorithm performed better than Multi-Level Perceptron
algorithm.
References
1. Khalifeh, A., Gholamhosseinian, A., Hajibagher, N.Z.: QOS for multimedia applications with
emphasize on video conferencing. Modern Communication System and Networks. Halmstad
University (2011)
2. Caslay, L., Zagar, D., Job, J.: Implementation of basic Qos mechanisms on videoconferencing
network model. Techn. Gazette 19(1), 123–130 (2012)
3. Debajyoti, P., Vajirasak, V.: Effect of network QoS on user QoE for a mobile video streaming
service using H.265/VP9 codec. In: Procedia of Computer Science 8th International Confer-
ence on Advances in Information Technology, IAIT, Macau, China, vol. 111, pp. 214–222
(2017)
4. Baraković Husić, J., Baraković, S., Osmanović, I.: Subjective and objective QoE measurement
for H.265/HEVC video streaming over LTE. In: Avdaković, S., Mujčić, A., Mujezinović, A.,
Uzunović, T., Volić, I. (eds.) IAT 2019. LNNS, vol. 83, pp. 428–441. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-24986-1_34
5. Javier, L., Nieto, A.: Security and QoS Tradeoffs: towards a FI perspective (2014)
AI-Based QOS/QOE Framework for Multimedia Systems 259
6. Jia, K., Guo, Y., Chen, Y., Zhao, Y.:Measuring and predicting quality of experience of DASH-
based video streaming over LTE. In: 19th International Symposium on Wireless Personal
Multimedia Communications (2016)
7. Khokhar, M.J.: Modelling quality of experience of internet video streaming by controlled
experimentation and machine learning. Networking and Internet Architecture. Publisher Hal
(2020)
8. Wang, Q., Dai, H.-N., Wu, D., Xiao, H.: Data analysis on video streaming QoE over mobile
networks. EURASIP J. Wirel. Commun. Netw. 2018(1), 1 (2018). https://doi.org/10.1186/
s13638-018-1180-8
9. Rana, F., Ghani1, A., Sufiuh, A.: Quality of experience metric of streaming video: a survey.
Iraqi J. Sci. 59(3), 1531–1537 (2018)
10. Tarik, G., HadiLarijania, O., Ali, S.: QoE-aware optimization of video stream downlink
scheduling over LTE networks using RNNs and genetic algorithm. Proc. Comput. Sci. 94,
232 – 239 (2016). 11th International Conference on Future Networks and Communications
11. Tarik, B., Barakovic, S., Jasmina, B.: Machine learning-based QoE prediction for video
streaming over LTE network. In: 17th International Symposium Infoteh-Jahorina (2018)
12. Vadivelu, S.: Evaluating the quality of service in VIOP and comparing various encoding
techniques. Msc thesis. University of Bedfordshire (2011)
Snatch Theft Detection Using Deep Learning
Models
1 Introduction
Crime exhibits in many forms like drive-by killing, murder, drug trafficking, money
laundering, black market, fraud, and many more. There is a need to understand these
categories of crimes to reveal the most prevalent crimes and the states with the highest
frequencies of crime. This is a necessary precursor to further research into the prediction
and detection of crime in urban cities [1–3].
Crime analysis is a function that involves systematic analysis for identifying and
analyzing patterns and trends in crime. Information on crime patterns will help law
enforcement agencies deploy resources more effectively and help the officers identify
and seize the suspects. Crime analysis also plays a part in formulating solutions to
crime problems and developing crime prevention schemes. Crime analysts review crime
reports, arrest reports, and police calls for service to quickly diagnose emerging pat-
terns, series, and trends [4]. They analyzed these phenomena for all significant factors,
occasionally predicted or forecasted future circumstances, and produced documents like
reports, newsletters, and bulletins alerting authorities and related bureaus [5]. This is
also one of the strategies or action plans to combat crimes and law-breakings.
Conversely, it is also significant to comprehend why crimes sometimes happen in
specific regions. Crime is not randomly happening. It is intended, prearranged or on
purpose. Crime transpires if the victims’ space of activity transects with the criminal
activity space. The space of activity includes the daily locations like office, work, home,
shopping vicinity, school, entertainment areas and many more [7]. Crime will occur if
a site provides the opportunity to commit a felony and it exists within an offender’s
awareness space. Shopping malls, recreation areas, and restaurants have a higher crime
rate. Thus, the analyst can gain information on the theory of crime patterns in a more
systematic approach and further investigate and analyze behavior patterns.
One of the most common crimes in urban cities is street crime. Street crime is a
criminal act in public spaces. Examples of street crime are pickpocketing, mugging and
snatch theft [8]. Moreover, snatch theft is a criminal act that is forcefully stealing a
pedestrian’s personal belongings, such as a necklace, mobile phone, and handbag, using
run and rob tactics. Typically, two men work where one drives the motorcycles, and the
other does the stealing [9]. This event is a worrisome problem among pedestrians because
it can cause an accident and cause fear, ordeal and anxiety. Sometimes pedestrians lack
the sensitivity of their belongings; thus, it creates the criminal’s opportunity to conduct a
crime. The government authorities have made several efforts to raise awareness of street
crime, but the crime rate shows otherwise. As technology proliferates, the crime style
is also advancing. One of the approaches is surveillance, or observing the area can be
done using CCTV cameras. Many regions have already installed CCTV that allows users
to monitor or record their daily activities, but this can be tiresome to watch the whole
prolonged video of CCTV manually. They can easily overlook or miss too. Hence it can
be pretty challenging to detect the crime scene if it happens.
Recently, numerous researches have been done in crime areas. Each study used its
unique features that can be considered physical traits. The most commonly used parts
are facial information by Y. Xia et al. [10] and a scene of the incident by R. Mandal et al.
[11]. Classifying the incident area can provide an advantage because it often does not
require the subject’s awareness and attention. Since snatch theft offenders usually wear
helmets while committing a crime, extracting facial information cannot be applied.
As for the intelligent techniques, Artificial Intelligence (AI) [6] is one of the methods
that can be used for crime detection and pattern prediction. The AI can effortlessly learn
to classify snatch theft videos or images. The crime pattern of snatch theft can be used as a
feature in snatch theft classification. This paper is organized as follows; Sect. 1 presented
the introduction on crime activities, Sect. 2 elaborated previous researches related to
262 N. F. M. Zamri et al.
crimes and recognition. Next, Sect. 3 explained the theories of transfer learning methods
proposed on snatch theft detection and each algorithm used in this study. Section 4
detailed the experimental protocol along with data collection, optimization followed
by classification. Discussion on experimental analysis and results are given in Sect. 5.
Finally, Sect. 6 concludes our findings.
2 Related Work
Traditionally, the police officer will do a routine patrol in a random area. However, some
studies have been conducted to assist the authority officers in planning a strategy routine
patrol area [12–19]. As mentioned earlier, CCTV is extensively utilized for inspecting
public areas with a high potential for crimes. Here CCTV images are utilized for prob-
ing after the crime happened, specifically as part of post-incident investigation since the
CCTV recorded all activities, including the crimes that can be used as evidence. Addi-
tionally, CCTV has many cameras and thus could also control the existence of crimes.
The CCTV allows users to monitor activities at several locations at one time. Hence
their videotape can also be accessed remotely. However, this method of the surveillance
demands many human resources and is not economical. Hence, an intelligent technique
is vital to replace this conventional CCTV approach for monitoring, identifying or detect-
ing a subject or similar, which is time-consuming since the extensive videos from the
recorded CCTV need to be inspected to detect any anomaly activities.
Firstly, Norazlin et al. [20] utilized an optical flow vector to analyze the perpetrator’s
movement. For this study, the total optical flow is computed for every frame sequence. If
there are two people moved independently, the flow vector of each subject represents the
full optical flow. However, the optical flow vector value will gradually be lowered when it
comes to the intersection point between the two subjects. The movement skeleton’s flow
before the juncture is higher than during the intersection. Meanwhile, Suriani et al. [21]
used the sum of vector (SOV) flow movement to observe the interaction or overlapping.
The snatching action occurs as the subjects move toward one another but not during the
subjects are separated or scattered. For a movement to be considered reliable, the flow
is usually constant prior to and upon completion of the intersection. Nevertheless, it is
recognizably different once snatching activity occurs. The flow was rudely preoccupied at
the intersection. Here, the Kalman filter approach was utilized in determining the frames
once they started to crisscross and ended at the intersections [20]. As for classification, the
supervised SVM was used to validate both the snatch and non-snatch activity accuracy
based on the features extracted. In their study, the accuracy obtained was 91% [20],
while as reported by [21] that utilized the motion vector flow (MVF), the classification
accuracy obtained was 89.43%.
Conversely, deep learning (DL) is a promising method for predicting and detecting
crimes. Umair Muneer et al. [22] implemented DL to detect and predict snatch theft
crimes. The proposed method used was the Convolutional Neural Network (CNN) model
along with VGG19 algorithms. The image dataset collected was categorized as the snatch
and non-snatch. The proposed model obtained accuracy in detecting snatch theft at 81%.
The author found that numerous amounts of moving targets that imitated the movement
and speed in anomalous images are more challenging to detect. This study focuses on
Snatch Theft Detection Using Deep Learning Models 263
detecting and predicting the snatch theft event by developing a new DL-based model
using crime event videos. Therefore, based on the limitations and research gap identified
from previous work, several CNN models will be evaluated and validated: AlexNet,
GoogleNet, ResNet18, ResNet50, ResNet101, and InceptionV3. The best DL model
will be determined for snatch theft detection purposes.
As we know, transfer learning is one of the machine learning techniques which utilizes
the knowledge learned in a specific setting and further enhances its performance during
modelling the associated or new sets of assignments [23]. With transfer learning, a
dense machine learning model can be established with a reasonable number of training
datasets since the model has gone through the pre-trained session. As much information
was transferred from the earlier task to the trained model during transfer learning.
In this study, a pre-trained series network and directed acyclic graph (DAG) networks
are used, namely fine-tuning algorithms as well as feature extraction algorithms. The
fine-tuning algorithm complements the feature extraction algorithm by fine-tuning and
updating the weights of the new dataset while maintaining the convolution base of pre-
trained CNN during the training process. The pre-trained DAG network uses both fine-
tuning and feature extraction algorithms, although the DAG network has an exclusive
hierarchical architecture. The hierarchical architecture of a series network makes it easier
for training and generalization.
Layer (end-1): Output Layer Classification Layer (end+3): Output Layer Classification
Originally, the three last layers of AlexNet consist of a fc8, prob and output layer.
These layers can be used to classify up to 1000 classes. Therefore, the three last layer is
decided to remove and add a new three layer since it will be used to classify two categories
namely ‘anomaly’ representing for snatch theft occurrence and ‘normal’ otherwise.
Figure 2(a) shows the original layer contained in AlexNet. The three-layer labelled as,
fc1000, prob, and output will be removed and a new fully connected layer (FullyCon),
softmax and classification layer (classoutput) is added as in Fig. 2(b).
Fig. 2. (a) Original layer of AlexNet; (b) New three last layers for AlexNet
Snatch Theft Detection Using Deep Learning Models 265
4 Experiment Protocol
In this study, the experimental analysis and results are implemented using the Lenovo
Legion 720T, desktop with 16 GB RAM and 8G graphic card. This include the data
acquisition stage, training and testing stages for all the CNN models.
The process of data collection was divided into three parts: data searching, converting
into images, and data sorting. The YouTube and Google are some of the online platforms
used in this study for videos related to snatch theft. A thorough search was done for every
page on YouTube and Google. Overall, 120 videos related to snatch theft or otherwise
(normal) were identified and compiled. These videos range from a minimum of six
seconds to a maximum of eight seconds. The condition in searching the database for
snatch theft is that the scene must include an act of snatch utilizing a motorbike or
running. Figure 4 shows some examples of these two categories of activities. Overall, a
total of 13000 images were obtained upon converting into the frame of images.
Next, the data was sorted into two categories: normal and anomaly. For example, as
depicted in Fig. 5, ‘normal’ is categorised as a person riding a motorcycle through a busy
road while ‘anomaly’ is during the snatch theft incident, which depicted the perpetrator
trying to snatch. Overall, there are 6500 images for each category.
Snatch Theft Detection Using Deep Learning Models 267
As stated earlier, data augmentation can be used for preventing overfitting to the DL
models. In this study, the augmentation methods used are rotation, Y reflection, trans-
lation and scaling during data augmentation. Figure 6 shows some samples of data
augmentation used during training the DL models.
Table 2. Performance measure of each DL model with random data augmentation method during
training
Further, Fig. 7 shows an example of the training and testing accuracy and loss plots
for Alexnet and Googlenet. Although data augmentation was utilized, it is observed that
overfitting still occurs in these two models and the other four models as well. These
models were tested using 30% of the unseen database. Hence in this study, to overcome
the overfitting, each model was re-train using an early stopping criterion, which involves
Snatch Theft Detection Using Deep Learning Models 269
pausing the training process before the model starts learning the noise within the model.
Next, upon completion of re-train the models, the results obtained are as tabulated in
Table 3. It was observed that the highest testing accuracy owned by ResNet50, with
100% sensitivity, while highest specificity of 97.7% still belongs to ResNet101, that
showed similar results before re-trained.
Next, examples of plots of the re-train models, namely Alexnet and Googlenet, as in
Fig. 8, showed that upon re-train, the overfitting no longer occurred. Note that all models
have showed similar properties of testing lines that indicated a good fit of the testing
accuracy. Thus, it can be proven that using both data augmentation and early stopping
criterion successfully overcome the overfitting issue in these CNN models.
Table 3. Performance measure of each DL model with early stopping criterion to combat
overfitting
DLNN Iteration during Training accuracy Testing accuracy Sens (%) Spec (%)
model early stopping (%) (%)
AlexNet 701 89.1 89.1 100 82.5
GoogleNet 451 99.9 89.4 100 82.5
InceptionV3 803 95.4 93.9 100 89.1
ResNet18 721 97.9 92.4 100 86.8
ResNet50 930 96.2 98.9 100 96.9
ResNet101 930 99.8 98.7 99.8 97.7
Based on the experimental analysis done, the accuracy and performance of each DL
model using the random data augmentation method during training still contributed to
overfitting. Hence, the early stopping criterion was utilized to combat this issue.
270 N. F. M. Zamri et al.
Fig. 7. Example of Alexnet (Top) and Googlenet (Bottom) based on accuracy during training and
testing that leads to overfitting even though data augmentation was utilized
Snatch Theft Detection Using Deep Learning Models 271
Fig. 8. Example of Alexnet (Top) and Googlenet (Bottom) upon completion of the re-train process
that showed a good-fit of testing accuracy based on an early-stopping criterion
6 Conclusion
As a conclusion, snatch theft detection using six DL models was conducted in this study
in evaluating and validating the ability of each model to classify the snatch theft and non-
snatch activities. Datasets of snatch theft and non-snatch theft activities were extracted
272 N. F. M. Zamri et al.
and pre-proceed from google and YouTube that comprised of 120 videos and generated
13000 images of both categories. These images acted as databases to train and test these
DL models. Based on the original training and testing accuracy, the ResNet101showed
perfect accuracy during training and testing. In addition, all models excluding Incep-
tionV3 obtained perfect sensitivity as well, that indicated these CNN models are capable
to classify all non-snatch database as ‘normal’. However, based on the plots of train-
ing and testing accuracy of all models, it was detected that overfitting happened for all
models. To overcome this, all models were re-trained with an early stopping criterion.
Results obtained upon completion of re-trained showed that overfitting was overcome.
Finally, it can be concluded that the highest testing accuracy of 98.9% was obtained by
ResNet50 along with perfect sensitivity. As for specificity, ResNet101 attained the high-
est with 97.7%, even upon re-trained. Still belongs to ResNet101, that showed similar
results before re-trained. Future work includes validating and testing these CNN models
in a real-time environment.
Acknowledgment. This research was funded by the Ministry of Higher Education (MOHE)
Malaysia, Grant No: 600-IRMI/FRGS 5/3 (394/2019), Sponsorship File No: FRGS/1/2019/
TK04/UITM/01/3. The authors would like to thank the College of Engineering, Universiti
Teknologi MARA (UiTM), Shah Alam, Selangor, Malaysia for the facilities provided in this
research.
References
1. Department of Statistics Malaysia Official Portal. https://www.statistics.gov.my/index.php?r=
column/cone&menu_id=dDM2enNvM09oTGtQemZPVzRTWENmZz09. Accessed 11 Apr
2020
2. Crime in England and Wales - Office for National Statistics. https://www.ons.gov.uk/people
populationandcommunity/crimeandjustice/bulletins/crimeinenglandandwales/yearendingma
rch2019%0A. Accessed 11 Apr 2020
3. Crime Rates in the United States, 2020 — Best and Worst States – SafeHome. https://www.
safehome.org/resources/crime-statistics-by-state-2020/?msclkid=de9f2788c3a211ecbf018
f3304be6d50. Accesed 11 Apr 2020
4. Rudin, C.: Predictive policing: Using machine learning to detect patterns of crime. In: ECML
PKDD, pp. 515–530 (2013)
5. Crime Analysis_ Defined - Threat Analysis Group. https://www.threatanalysis.com/2020/05/
13/crime-analysis-defined/?msclkid=1b000502c3a611ec8302b3f6443f5834. Accessed 24
Apr 2022
6. What Is Prediction, Detection, And Forecasting In Artificial Intelligence? https://www.ana
lyticsinsight.net/prediction-detection-forecasting-artificial-intelligence/?msclkid=25af24adc
3a711ec9370812fb338cf16. Accessed 13 Oct 2020
7. Crime Pattern Theory - Crime and intelligence analysis_ an integrated real-time approach
8. The Crime Analyst’s Blog_ Crime Patterns, Crime Sprees, and Crime Series.
9. Truntsevsky, Y.V., Lukiny, I.I., Sumachev, A.V., Kopytova, A.V.: A smart city is a safe city:
the current status of street crime and its victim prevention using a digital application. In:
MATEC Web of Conferences 2018, vol. 170 (2018)
10. Xia, Y., Zhang, B., Coenen, F.: Face occlusion detection based on multi-task convolution
neural network. In: 2015 12th International Conference on Fuzzy Systems and Knowledge
Discovery, FSKD 2015, pp. 375–379 (2016). https://doi.org/10.1109/FSKD.2015.7381971
Snatch Theft Detection Using Deep Learning Models 273
11. Mandal, R., Choudhury, N.: Automatic Video Surveillance for theft detection in ATM
machine: an enhanced approach. In: 2016 3rd International Conference on Computing for
Sustainable Global Development (INDIACom), pp. 2821–2826 (2016)
12. Md Sakip, S.R.B., Moihd Salleh, M.N.: Linear street pattern in urban cities in Malaysia
influence snatch theft crime activities. In: Asia-Pacific International Conference, vol. 3, no.
8, p. 189 (2018)
13. Khalidi, S., Shakeel, M.: Spatio-temporal analysis of the street crime hotspots in faisalabad
city of Pakistan. In: 23rd International Conference on Geoinformatics, Wuhan, China, pp. 3–6
(2015). https://doi.org/10.1109/GEOINFORMATICS.2015.7378693
14. Lee, I., Jung, S., Lee, J., Macdonald, E.: Street crime prediction model based on the physical
characteristics of a streetscape: analysis of streets in low-rise housing areas in South Korea.
Environ. Plann. B Urban Anal. City Sci. 46(5), 862–879 (2019). https://doi.org/10.1177/239
9808317735105
15. Takizawa, A., Koo, W., Katoh, N.: Discovering distinctive spatial patterns of snatch theft in
Kyoto City with CAEP. J. Asian Archit. Build. Eng. 9(1), 103–110 (2010). https://doi.org/10.
3130/jaabe.9.103
16. Laouar, D., Mazouz, S., Van Nes, A.: Space and crime in the North-African city of Annaba.
In: Proceedings of the 11th Space Syntax Symposium, pp. 196.1–196.9 (2017)
17. Lu, J., Tang, G.A.: The spatial distribution cause analysis of theft crime rate based on GWR
Model. In: 2011 International Conference on Multimedia Technology, ICMT 2011, pp. 3761–
3764 (2011). https://doi.org/10.1109/ICMT.2011.6002711
18. Zhuang, Y., Almeida, M., Morabito, M., Ding, W.: Crime hot spot forecasting: a recurrent
model with spatial and temporal information. In: 2017 IEEE International Conference on
Big Knowledge on Proceedings, ICBK pp. 143–150 (2017). https://doi.org/10.1109/ICBK.
2017.3
19. Hanaoka, K.: New insights on relationships between street crimes and ambient population:
use of hourly population data estimated from mobile phone users’ locations. Environ. Plann.
B Urban Anal. City Sci. 45(2), 295–311 (2018). https://doi.org/10.1177/0265813516672454
20. Ibrahim, N., Mokri, S.S., Siong, L.Y., Marzuki Mustafa, M., Hussain, A.: Snatch theft detec-
tion using low level features. In: World Congress on Engineering 2010 on Proceedings,
London, UK, pp. 862–866 (2010)
21. Suriani, N. S., Hussain, A., Zulkifley, M. A.: Multi-agent event detection system using k-
nearest neighbor classifier. In: 2014 International Conference on Electronics, Information
and Communications, ICEIC, Kota Kinabalu, Malaysia, pp. 1–2 (2014). https://doi.org/10.
1109/ELINFOCOM.2014.6914382
22. Butt, U.M., Letchmunan, S., Hassan, F.H., Zia, S., Baqir, A.: Detecting video surveillance
using VGG19 convolutional neural networks. Int. J. Adv. Comput. Sci. Appl. 11(2), 674–682
(2020). https://doi.org/10.14569/ijacsa.2020.0110285
23. Razak, H.A., Almisreb, A.A., Tahir, N.M.: Detection of anomalous gait as forensic gait in
residential units using pre-trained convolution neural networks. In: Arai, K., Kapoor, S.,
Bhatia, R. (eds.) FICC 2020. AISC, vol. 1130, pp. 775–793. Springer, Cham (2020). https://
doi.org/10.1007/978-3-030-39442-4_57
24. Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems 25 (NIPS 2012) on
Proceedings (2012). https://doi.org/10.1201/9781420010749
25. Guo, Z., Chen, Q., Wu, G., Xu, Y., Shibasaki, R., Shao, X.: Village building identification
based on ensemble convolutional neural networks. Sensors 17(11), 1–22 (2017). https://doi.
org/10.3390/s17112487
26. What is Overfitting in Deep Learning and How to Avoid It. https://www.v7labs.com/blog/ove
rfitting. Accessed 02 Mar 2022
274 N. F. M. Zamri et al.
27. Li, H., Li, J., Guan, X., Liang, B., Lai, Y., Luo, X.: Research on overfitting of deep learn-
ing. In: 2019 15th International Conference on Computational Intelligence and Security on
Proceedings, CIS, Macao, China, pp. 78–81 (2019). https://doi.org/10.1109/CIS.2019.00025
28. Bilbao, I., Bilbao, J.: Overfitting problem and the over-training tin the era of data. In: The 8th
IEEE International Conference on Intelligent Computing and Information Systems, ICICIS,
Cairo, Egypt, pp. 173–177 (2018)
29. Almisreb, A.A., Tahir, N.Md., Turaev, S., Saleh, M.A., Al Junid, S.A.M.: Arabic handwriting
classification using deep transfer learning techniques. Pertanika J. Sci. Technol. 30(1), 641–
654 (2022). https://doi.org/10.47836/pjst.30.1.35
Deep Learning and Few-Shot Learning
in the Detection of Skin Cancer: An Overview
Abstract. Skin cancer is a severe condition that should be detected early. The
two most prevalent types of skin cancer include melanoma and non-melanoma.
Melanoma has been identified as the utmost dangerous skin cancer. Yet, discrim-
inating melanoma lesions from non-melanoma lesions has proven challenging.
Several artificial intelligence-based strategies have been introduced in the litera-
ture to handle skin cancer detection, including deep learning and few-shot learning
strategies. According to the evidence in the literature, deep learning algorithms
are reported to perform well when trained on large datasets. However, they are
only effective when the target domain has enough labeled samples; they do not
ensure adequate network activation variables to adjust to new target regions rapidly
when the target domain has insufficient data. Consequently, few-shot learning
paradigms have been presented in the literature to promote learning from such
limited amounts of labeled data. A search on PubMed from inception to 7 June
2022 for studies investigating the review of the application of deep learning and
few-shot learning in the detection of skin cancer was performed via the use of title
terms “Deep Learning” AND “Few-Shot Learning” AND “Skin Cancer Detec-
tion” AND “Review,” combined with title terms or MeSH terms “Deep Learning”
AND “Few-Shot Learning” AND “Skin Cancer Detection” AND “Review,” with
no limits on language or date of publication. We found no paper that has reviewed
the application of deep learning and few-shot learning in detecting skin cancer.
This paper, therefore, presents a brief overview of some of the most critical appli-
cations of deep learning and few-shot learning schemes in the detection of skin
cancer lesions from skin image data.
1 Introduction
Cancer, a medical disease characterized by unregulated cell proliferation in bodily tis-
sues, is considered one of the foremost healthcare burdens worldwide [1]. Because the
skin is the broadest organ in the body, skin cancer diseases are one of the most com-
mon and hazardous among other several cancer types [2]. Unfixed deoxyribonucleic
acid (DNA) in skin cells causes genetic defects or skin changes, leading to skin cancer
[3]. Skin cancer is grouped into two broad classes: melanoma and non-melanoma. [4].
While a majority of skin cancer cases fall into the category of non-melanoma and have
minimal likelihood of spreading to other regions of the body, melanoma cancers are
an uncommon, dangerous, and fatal form of skin cancer and develop in surface cells
called melanocytes. According to the American Cancer Society, melanoma skin cancer
accounted only for 1% of all cases, and it is connected with a greater mortality rate since
it could only be treated when detected early [5]. As a result, it is preferable to recognize
it from the onset in order to ensure it is curable and reduce the cost of therapy. According
to Esteva [6], around 5.4 million increased incidents of skin cancer are documented per
year in the United States, and the speed at which it is expanding is deeply troubling on
a global scale.
Considering the gravity of the situation, scientists have devised technologies for the
detection of skin cancer that can aid in early diagnosis, good accuracy, and sensitivity.
Dermoscopy imaging is used to identify melanoma in the skin of humans. Experts (der-
matologists) examining the dermoscopic images to determine if there are skin lesions
or not. Like many other medical fields, dermatology has immensely benefitted from
advances in the computer vision sub-field of artificial intelligence, which focuses on
enabling computer systems’ ability to extract information from digital images and auto-
mate the identification tasks that human visual systems could do [6]. The performance of
deep learning algorithms in the computer vision task has improved over the years since
the first convolutional neural network emerged [7] and deep learning has been applied
in computer-aided skin cancer recognition tasks recently [5]. The availability of vast
amounts of labeled image data has really helped deep learning algorithms in achieving
the ends improvements in computer vision tasks on medical images over the last decade
[8].
However, when there are limited amounts of labeled image data at the disposal of
deep learning algorithms to learn from, the deep networks trained from such a small
amount of image data tend to fail because they overfit and are less likely to generalize
appropriately. As a result, to aid learning from small amounts of labeled data, few-shot
classification techniques have emerged [9–12]. With only a few training instances, few-
shot learning approaches provide the computer vision models with the ability to quickly
adjust to innovative tasks and settings.
A search on PubMed [13] from inception to 7 June 2022 for studies investigating
the review of the application of deep learning and few-shot learning in the detection
of skin cancer was performed via the use of title terms “Deep Learning” AND “Few-
Shot Learning” AND “Skin Cancer Detection” AND “Review”, combined with title
terms or MeSH terms “Deep Learning” AND “Few-Shot Learning” AND “Skin Cancer
Detection” AND “Review”, with no limits on language or date of publication. We found
no paper that has reviewed the application of deep learning and few-shot learning in the
detection of skin cancer. This paper, therefore, presents a review on the application of
deep learning and few-shot learning approaches in the identification and detection of
skin cancer (melanoma) which has a very small amount of data available. Section 2 of
this paper presents a brief overview of the general application of deep learning in skin
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 277
cancer detection while Sect. 3 highlights the basis of the few-shot learning concept and
Sect. 4 presents a review of the application of few-shot in detecting skin cancer diseases.
Artificial intelligence (AI), a field of computer science that employs technologies and
programs to simulate human cognitive abilities, has a variety of applications in health
care, dermatology inclusive. Machine learning is an AI paradigm that permits computers
to do data-driven learning without explicit programming. In other words, the purpose
of machine learning is to create systems that automatically do learning based on obser-
vations of the real world (referred to as “training data”), without the need for people
to explicitly define rules or reasoning [14]. Machine learning (ML) plays a vital role
in skin cancer diagnosis. A subclass of machine learning is deep learning which builds
deep neural network models with neurons that have a variety of parameters and layers
between input and output. The Convolutional Neural Network (CNN) is a deep learning
method that may be used to solve difficult computer vision problems like skin lesion
analysis and overcomes the limitations of traditional machine learning approaches [15].
Deep learning models conveniently learn patterns well on a large amount of data and
have been well suited for use to learn inherent patterns on large clinical imaging data
including cancer lesion images [16].
Melanoma or probable skin lesions are identified primarily using dermoscopy imag-
ing, which detects pigmented skin lesions. The procedure is non-invasive and detects
any lesions early on. Dermatologists can evaluate skin lesions with their very own eyes
as the dermoscopic images do have good resolution and great visual qualities [17]. The
examination process by dermatologists takes time, necessitates a high level of profes-
sional expertise, and is sometimes opinionated. The Convolutional neural networks are
just as good as dermatologists at detecting melanoma or skin lesions, if not better [18].
There are various architectures of deep learning including the AlexNet, VGGNets,
MobileNets, and ResNet among others. Figure 1 depicts a design of the ResNet-50 deep
learning architecture designed for the diagnosis of skin cancer by Medhat et al. [19].
In the Resnet-50 design, each stage has two blocks, the first of which is a convolution
block and the second of which is an identity block. There are three convolution layers
in each convolution block, and there are three convolution layers in each identity block.
When transfer learning is used, the fully linked layer and the classification output layer
are substituted by three new layers for two classes.
Several existing artificial intelligence (AI) algorithms, especially deep learning algo-
rithms, have been proposed in the literature for the identification and detection of skin
cancer. Some of these works are here presented; firstly, by identifying some works
that simply used deep learning in the diagnosis of skin cancer and then identifying
some studies that show that deep learning methods are capable of outperforming human
dermatologists in the detection of skin cancer.
For melanoma detection, a tremendously deep CNN was proposed in [20]. To
improve efficiency, a fully convolutional residual network (FCRN) that has 16 blocks
that are residual in nature was utilized in the segmentation stage. For the stage of clas-
sification, the proposed scheme used a mean value of SVM and softmax classifiers. It
278 O. Akinrinade et al.
has a melanoma classification accuracy of 85.5% in the segmentation task and 82.8%
without the task of segmentation.
According to [21], the vision-based classification of melanoma based on deep learn-
ing proposed the use of VGG-16, a pre-trained deep Convolution Neural Network archi-
tecture comprising 5 convolutional blocks and 3 fine-tuned layers. Having a 78% accu-
racy, the VGG-16 models successfully recognized lesion images as melanoma skin
cancer. 1200 skin photos that are normal and 400 photographs of skin lesions were used
to train the deep learning model. With 86.67% accuracy, the suggested system catego-
rized the input pictures into two elementary categories, which included the normal skin
images and the lesion images.
Fig. 1. A diagram of the ResNet-50 CNN architecture for skin cancer diagnosis (Source: Medhat
et al. [19])
For skin lesions categorization, [22] suggested a technique for extracting deep fea-
tures from several pre-trained convolutional neural networks. The deep feature genera-
tors employed in the study included pre-trained VGGnet 16, AlexNet, and ResNet-18,
which were then fed into a multiple class Support Vector Machines model. Eventually,
to carry out classification, the classifiers’ results were pooled. On the ISIC 2017 dataset,
the suggested technique classified seborrheic keratosis (SK) and melanoma with 97.55%
and 83.83% AUC, accordingly.
The study in [23] proposed a deep learning strategy for automatically recognising and
segmenting melanoma lesions that circumvent certain constraints. There were two new
subnetworks in each network that were connected to each other. This made it easier for
people to learn about and extract features from data because it brought the semantic depth
of encoder function maps closer to the space of the decoder. The technique used Softmax
classifiers to categorize melanoma tumors in pixels using a multiple-stage, multiple-
scale methodology. They developed a novel approach known as the lesion classifier, that
divides skin lesions into melanoma and non-melanoma categories based on the results
of pixel classification. The proposed technique is clearly more satisfactory than several
advanced techniques, as demonstrated on two popular datasets of the Hospital Pedro
Hispano (PH2) and the International Symposium on Biomedical Imaging (ISBI) 2017.
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 279
The method correctly predicted 95% of the ISIC 2017 data set and 92% of the PH2 data
set, as well as 95% of dice and 93% of the PH2 data set. The dice coefficients were 95%
and 93%, respectively.
The work in [24] focused on deep learning techniques for the detection and diag-
nosis of skin cancer. The purpose of this work existed to improve the CNN model for
skin cancer diagnosis that could distinguish between separate types of skin cancer and
aid in timely detection. The segmentation and recognition approaches were established
in Python using Keras and Tensorflow as the foundation. The prototype is created and
verified to make use of a series of network topologies and layer sizes, together with
convolutional layers, dropout layers, pooling layers, and dense layers. Transfer Learn-
ing techniques were employed to allow rapid interconnection. The data was compiled
from the archives of an International Skin Imaging Collaboration (ISIC) competition
and was used to evaluate and generate the model. By merging the ISIC 2018 and ISIC
2019 databases, a new dataset was generated. The dataset has been cleaned up and the
seven most common types of skin lesions have been kept. The models were capable of
learning more rapidly as a consequence of the greater range of instances provided per
class. Melanocytic Nevus, Basal Cell Carcinoma, Benign Keratosis, Melanoma, Der-
matofibroma, Actinic Keratosis, Vascular Lesion, and Squamous Cell Carcinoma are
the seven groups. The input pictures were modified to make the model more resilient
to unknown data, which led to a large improvement in testing accuracy. The image was
normalised using the image normalisation technique. The CNN was generated utilising
transfer learning and ImageNet classification weights. The model was trained and eval-
uated on advanced CNNs, such as Inception V3, ResNet50, VGG16, Inception Resnet,
and MobileNet, to accomplish seven-class categorization of skin lesion photos. The
Inception V3 and Inception Resnet CNN models obtained a favorable assessment, with
90 and 91% accuracy, respectively. It is sufficiently robust to classify lesion pictures into
one of seven categories [22].
The research in [25] utilised a fine-grained classification concept to develop a clas-
sifier model that addressed the difficulties inherent in the automatic recognition of der-
moscopy image lesions due to the extensive classification background and lesion fea-
tures. The model was constructed using MobileNet and DenseNet and incorporated two
standard feature extraction components derived from a lesion classification approach,
as well as a feature discriminating network. In the suggested technique, two types of
training images were fed into the identification model’s feature extraction module. The
resulting two sets of feature space were utilised to construct both binary classification
networks and feature discriminating networks for the detection job. Using this strategy,
the new identification methodology can extract more discriminatory lesion characteris-
tics and increase the model’s performance in a small proportion of model parameters.
When certain model parameters were changed, the proposed method was shown to pro-
duce better segmentation results, with the suggested model getting 96.2% of the time
right.
Research work in [26] suggested a multi-scale Convolutional Neural Network based
on ImageNet-trained inception v3 CNN. They fine-tuned the pre-trained inception v3 in
order to perform the classification of skin cancer on abrasive and fine scales of the lesion
images. The abrasive-scale image resolution was used to collect the shape attributes
280 O. Akinrinade et al.
of lesions as well as their overall environment. The finer scale image resolution, on the
other hand, collected textual information about the lesion in order to distinguish amongst
diverse forms of skin lesions.
In order to compare how deep learning techniques perform side-by-side with human
dermatologists, researchers have done a number of studies. Using the International Skin
ISIC-2016 dataset, Codella et al. [27] constructed an ensemble of deep learning tech-
niques, contrasting them with the abilities of 8 dermatologists in identifying whether 100
skin lesions are malignant or benign. The ensemble which included the trio of convo-
lutional neural networks, deep residual networks, and fully convolutional U-Net archi-
tecture, was able to segment skin lesions and detect melanoma in the identified area
and surrounding tissue. The aggregation of the deep learning algorithms’ performance
exceeded those of the dermatologists, with 76% accuracy and 62% specificity as against
the respective 70.5% and 59% for the dermatologists [27].
Haenssle et al. [26], trained a deep learning model using the popular InceptionV4
deep learning architecture on a huge dermoscopic dataset containing over a hundred
thousand benign lesions and malignant pictures and 58 dermatologists were used to
evaluate the deep learning model’s performance. Two levels of diagnosis were used to
categorize the patients. Only dermoscopy was employed at the first level, while der-
moscopy was employed in conjunction with medical data and patient photographs at
the second level. In the study, dermatologists documented 86.6% sensitivity and 71.3%
specificity at the first level. In the second level, the sensitivity and specificity climbed to
88.9% and 75.7%, respectively. The specificity enhancement was significant (with p =
0.05). The increase in the sensitivity, however, was insignificant statistically (with p =
0.19). The CNN model had vastly greater specificity than that of the dermatologists in the
first level (with p = 0.01) and second level (with p = 0.01). The CNN surpassed several
dermatologists in the trial, implying a possible part in the identification of melanoma
via dermoscopic images [28].
In a similar study, Haenssle et al. [29] evaluated InceptionV4-based deep learning
architecture with dermatologists on a dermoscopic test set of 100 examples. There were
two levels to this study: A dermoscopy photo was included in stage 1, and a medical
similar photo, a dermoscopy photo, and patient information were included in stage 2.
The sensitivity and specificity of the dermatologists in stage 1 were 89% and 80.7%,
respectively, compared to 95% and 76.7% for the CNN system. Dermatologists’ mean
sensitivity climbed to 94.1% with more information in stage 2, while their average
specificity was still unchanged [29].
Similar findings were found in another investigation by Brinker et al. [30].
Researchers employed ResNet-50, a CNN architecture to evaluate the performance of
157 dermatologists on 100 dermoscopy photos. The performance of the dermatologists
was 74.1% overall sensitivity and 60% specificity, while the performance of the deep
learning model was 84.2% sensitivity and 69.2% specificity. It was reported that the deep
learning model performed better than 86.6% of the dermatologists in the research in a
head-to-head matchup. As a result, it was concluded that the deep learning model has a
huge potential to help physicians make accurate melanoma diagnoses [30].
Research in [31] utilized a heterogeneous dataset of 7895 dermoscopic images and
5829 close-up lesion images to detect non-pigmented skin cancers via InceptionV3
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 281
and ResNet50 CNNs. The findings of the CNNs were evaluated against those of 95
dermatologists who were divided into three categories based on their level of expertise.
The CNN algorithms outperformed the human categories with beginner and intermediate
raters, achieving accuracy comparable to humans. The findings of the study revealed
that proper diagnoses were made in a greater number of instances compared to all
dermatologists in the study, but not when matched to professionals with over a decade
of experience [31].
With the participation of 112 German dermatologists, [32] investigated the specificity
and sensitivity of a ResNet50 CNN for multiple-class classification of skin lesions. In the
major end-point, the performance of detection of skin lesions by dermatologists stood at
about 74.4% sensitivity with 59.8% specificity. The deep learning algorithm’s specificity
was 91.3% at a comparable level of sensitivity. Dermatologists had 56.5% sensitivity
and 89.2% specificity, with the secondary end goal of properly identifying a particular
picture into a class out of the five considered classes. The deep learning algorithm had
98.8% specificity at the same sensitivity level. The deep learning system outperformed
the dermatologists significantly (p < 0.001). In all classes, the deep learning system out-
performed the 112 dermatologists, with the exception of basal cell carcinoma, when the
system performed similarly to the 112 dermatologists [32].
From the evidence in the literature, it could be seen that deep learning algorithms
perform well when trained on big data. However, they are really only efficient when
there are enough labeled samples in the region of interest, but they do not assure appro-
priate network configuration settings which can swiftly adjust to new regions of interest
in situations where there is insufficient data in the target domain [8].
With the focus on using few-shot learning for the diagnoses of dermatological dis-
ease, Prabhu et al. [39] stated that conventional off-the-shelf methods of identifying
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 283
a dermatological disease from photographs confront two major obstacles. One problem
is that real-world dermatological data populations are frequently long-tailed, and there is
a great deal of intra-class heterogeneity. They characterized the first problem as low-shot
learning, wherein a base classifier must adapt quickly to detect new situations after being
deployed with a minimal number of labeled examples. They demonstrated Prototypical
Clustering Networks (PCN), a Prototypical Networks extension [36], which successfully
reflects intra-class variability by training a fusion of “prototypes” for each class. Proto-
types for every class were started using clustering, then they were refined using an online
updating technique. The prototypes were classified by matching their resemblance within
each class to a weighed blend of prototypes, with the weights representing the projected
cluster operations. They trained a 50-layer ResNet-v2, an advanced CNN architecture
for image categorization, using ImageNet pre-training. They demonstrated the capability
of the suggested method in the effective identification on a real dataset of dermatolog-
ical illnesses. They gave the MCA of the test set for the highest 200 classes obtainable
throughout the test time for 2 low shot setups with 5-shots and 10-shot in the training set
and 5-shot of test set for two different low shot setups. Their strategy, PCN, has an MCA
of 30.0 ± 2.8 for novel courses in the 5-shot task, and an MCA of 49.6 ± 2.8 for new
classes in the 10-shot task. Since the approach leverages on episode training to build dis-
criminative visual features that can be extrapolated to new classes with minimal sample
size, it accomplishes substantially better (nine percent) actual enhancements over the
previous approaches) in producing new categories generalizations.
In low-data and heavy-tailed data distribution domains, [8] investigated meta-
learning-based algorithms like Prototypical networks, a distance metric-based learn-
ing approach, and Reptile, a gradient-based approach for identifying skin lesions from
clinical images. The suggested network is called “Meta-DermDiagnosis,” and it uses a
meta-learning approach to allow deep neural networks learned on the dataset of popu-
lar diseases to quickly adapt to unusual conditions with considerably less labeled data.
It comprises a meta-learner that trains the neural network on a series of few-shot pic-
ture classification tasks using a baseline set of class labels. To select optimal network
initialization weights, class labels are sampled from the top of the class distribution,
and the model is then customized to classify images on a new set of unknown classes
with only a few occurrences. Additionally, they showed that in the case of skin lesion
image classification, utilizing Group Equivariant convolutions (G-convolutions) in Meta-
DermDiagnosis considerably enhances the network’s efficiency because orientation is
often not a significant aspect in such images. Using three publicly accessible skin lesion
classification datasets; Derm7pt, SD-198, and the ISIC 2018, they evaluated the per-
formance of the proposed technique employing Reptile and Prototypical networks and
compared it to the pre-trained transfer learning baseline. The study’s findings showed
that Reptile with G-convolutions outperformed extra techniques in skin lesion classi-
fication in little-data circumstances, such as pre-training and prototype networks with
82.1 average accuracy on the ISIC 2018 dataset, 76.9 average accuracy on the Derm7pt
dataset and 83.7 average accuracy on the SD-198 dataset.
In the work “Few-shot learning for skin lesion image classification”, [40] used the
improved Relational Network for metric learning to achieve the categorization of skin
disease based on a limited amount of annotated skin lesion image data. The technique
284 O. Akinrinade et al.
employed a relative position network (RPN) and a relative mapping network (RMN), in
which the RPN collects and extracts feature representations via an attention mechanism,
and the RMN uses a weighted sum of attention mapping distance to determine image
categorization similarity. On the public ISIC melanoma dataset, the average classification
accuracy obtained is 85%, demonstrating the technique’s efficacy and practicability.
For skin lesion segmentation, [41] proposed a few-shot segmentation network that
only needs minimal pixel-level labeling. Firstly, the co-occurrence area between the
support image and the query image was collected, and this was then utilized as a prior
mask to remove extraneous background areas. Secondly, the findings are combined and
submitted to the inference module, which predicts query image segmentation. Thirdly,
using the symmetrical structure, the network was retrained by inverting the support and
query roles. Extensive tests on ISIC-2017, ISIC-2019, and PH2 show that the method
provides a promising framework for few-shot skin lesion segmentation.
Lastly, relying on the Internet of Medical [42] developed a few-shot prototype net-
work to alleviate the paucity of annotated samples. First, a contrast learning branch
was designed to improve the feature extractor’s capabilities. Second, for comparison
learning, a unique technique for creating positive and negative sample pairings was pro-
posed, and it was reported to reduce the need to explicitly maintain a sample queue.
Finally, the dissimilarity learning division was utilized to correct corrupted data and
develop the category prototype. Finally, to increase classification accuracy and conver-
gence speed, the hybrid loss, which combines prototype and contrasting losses, was
applied. On the mini-ISIC-2i and mini-ImageNet datasets, their technique was reported
to have performed admirably considerably well [42].
5 Conclusion
This paper has provided a brief overview of some of the relevant applications of deep
learning and few-shot learning techniques in the diagnosis of skin cancer lesions using
skin imaging data. As pointed out from evidence in the literature and in practice, it has
been identified that deep learning algorithms perform admirably when trained on huge
datasets. The deep learning algorithms have however been seen to be mostly successful
whenever the target domain has just enough annotated instances; they do not ensure
adequate network activation variables that could swiftly adapt to novel target domains
when the target domain lacks data. This makes deep learning unsuitable for model
building in circumstances where data is limited and thus necessitates the development
of appropriate techniques for such data-scarce situations. In this regard, this paper has
explored the application of few-shot learning methods in the detection of various classes
of melanoma, banking on the ability of the few-shot learning techniques to learn from
limited amounts of labeled data in the classes. The models trained in these few-shot-
based approaches have been shown to have considerably good detection performance.
In future work, attempts will be made at employing few-shot learning potentials and
deep-learning feature extraction capabilities in the skin cancer detection domain with a
view to improving the detection performance.
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 285
References
1. Das, K., et al.: Machine learning and its application in skin cancer. Int. J. Environ. Res. Public
Health 18, 1–10 (2021)
2. Ferlay, J., et al.: Cancer statistics for the year 2020: an overview. Int. J. Cancer (2021). https://
doi.org/10.1002/ijc.33588
3. Ashraf, R., et al.: Region-of-interest based transfer learning assisted framework for skin cancer
detection. IEEE Access 8, 147858–147871 (2020)
4. Elgamal, M.: Automatic skin cancer images classification. Int. J. Adv. Comput. Sci. Appl. 4
(2013)
5. Dildar, M., et al.: Skin cancer detection: a review using deep learning techniques. Int. J.
Environ. Res. Public Health 18 (2021)
6. Li, C.X., et al.: Artificial intelligence in dermatology: past, present, and future. Chin. Med. J.
132, 2017–2020 (2019)
7. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning for computer
vision: a brief review. Comput. Intell. Neurosci. 2018 (2018)
8. Mahajan, K., Sharma, M., Vig, L.: Meta-dermdiagnosis: few-shot skin disease identification
using meta-learning. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Workshops, pp. 3142–3151, June 2020
9. Koch, G.: Siamese neural networks for one-shot image recognition (2011)
10. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K.: Matching networks for one shot
learning
11. Santoro, A., Botvinick, M., Lillicrap, T., Deepmind, G., Com, C.G.: One-shot learning with
memory-augmented neural networks (2016)
12. Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning
13. National Center for Biotechnology Information (NCBI) [Internet]. No Title. Bethesda (MD):
National Library of Medicine (US), National Center for Biotechnology Information
14. Khan, S., Rahmani, H., Shah, S.A.A., Bennamoun, M.: A guide to convolutional neural
networks for computer vision. Synth. Lect. Comput. Vis. 8, 1–207 (2018)
15. Indolia, S., Goswami, A.K., Mishra, S.P., Asopa, P.: Conceptual understanding of convo-
lutional neural network-a deep learning approach. Procedia Comput. Sci. 132, 679–688
(2018)
16. Oyetade, I.S., Ayeni, J.O., Ogunde, A.O., Oguntunde, B.O., Olowookere, T.A.: Hybridized
deep convolutional neural network and fuzzy support vector machines for breast cancer
detection. SN Comput. Sci. 3(1), 1–14 (2021). https://doi.org/10.1007/s42979-021-00882-4
17. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning algorithm
for automated detection of skin lesions. Appl. Sci. 11 (2021)
18. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks.
Nature 542, 115–118 (2017)
19. Medhat, S., Abdel-Galil, H., Aboutabl, A.E., Saleh, H.: Skin cancer diagnosis using convo-
lutional neural networks for smartphone images: a comparative study. J. Radiat. Res. Appl.
Sci. 15, 262–267 (2022)
20. Yu, L., Chen, H., Duo, Q., Qin, J., Heng, P.-A.: Automated melanoma recognition in der-
moscopy images via very deep residual networks. IEEE Trans. Med. Imaging 36, 994–1004
(2017)
21. Kalouche, S.: Vision-based classification of skin cancer using deep learning. Stanford’s
machine learning course (CS 229) (2016)
22. Mahbod, A., Schaefer, G., Wang, C., Ecker, R., Ellinge, I.: Skin lesion classification using
hybrid deep neural networks. In: ICASSP, IEEE International Conference on Acoustics,
Speech and Signal Processing - Proceedings, 1229–1233, May 2019
286 O. Akinrinade et al.
23. Adegun, A.A., Viriri, S.: Deep learning-based system for automatic melanoma detection.
IEEE Access 8, 7160–7172 (2020)
24. Nahata, H., Singh, S.P.: Deep learning solutions for skin cancer detection and diagnosis. In:
Jain, V., Chatterjee, J.M. (eds.) Machine Learning with Health Care Perspective. LAIS, vol.
13, pp. 159–182. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40850-3_8
25. Wei, L., Ding, K., Hu, H.: Automatic skin cancer detection in dermoscopy images based on
ensemble lightweight deep learning network. IEEE Access 8, 99633–99647 (2020)
26. DeVries, T., Ramachandram, D.: Skin lesion classification using deep multi-scale convolu-
tional neural networks (2017)
27. Codella, N.C.F., et al.: Deep learning ensembles for melanoma recognition in dermoscopy
images. IBM J. Res. Dev. 61, 1–28 (2017)
28. Haenssle, H.A., et al.: Man against machine: diagnostic performance of a deep learning
convolutional neural network for dermoscopic melanoma recognition in comparison to 58
dermatologists. Ann. Oncol. 29, 1836–1842 (2018)
29. Haenssle, H.A., et al.: Man against machine reloaded: performance of a market-approved
convolutional neural network in classifying a broad spectrum of skin lesions in comparison
with 96 dermatologists working under less artificial conditions. Ann. Oncol. 31, 137–143
(2020)
30. Brinker, T.J., et al.: Deep learning outperformed 136 of 157 dermatologists in a head-to-head
dermoscopic melanoma image classification task. Eur. J. Cancer 113, 47–54 (2019)
31. Tschandl, P., et al.: Expert-level diagnosis of nonpigmented skin cancer by combined
convolutional neural networks. JAMA Dermatol. 155, 58–65 (2019)
32. Maron, R.C., et al.: Systematic outperformance of 112 dermatologists in multiclass skin cancer
image classification by convolutional neural networks. Eur. J. Cancer 119, 57–65 (2019)
33. Garcia, S.I.: Meta-learning for skin cancer detection using deep learning techniques, pp. 1–7
(2021)
34. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for
classification: when to warp? In: 2016 International Conference on Digital Image Computing:
Techniques and Applications, DICTA 2016 (2016). https://doi.org/10.1109/DICTA.2016.779
7091
35. Mikołajczyk, A., Grochowski, M.: Data augmentation for improving deep learning in image
classification problem. In: 2018 International Interdisciplinary PhD Workshop, IIPhDW 2018,
pp. 117–122 (2018). https://doi.org/10.1109/IIPHDW.2018.8388338
36. Kumar, V., Glaude, H., de Lichy, C., Campbell, W.: A closer look at feature space data augmen-
tation for few-shot intent classification. In: DeepLo@EMNLP-IJCNLP 2019 - Proceedings
of the 2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language
Processing, pp. 1–10 (2021). https://doi.org/10.18653/v1/d19-6101
37. Duan, R., et al.: A survey of few-shot learning: an effective method for intrusion detection.
Secur. Commun. Netw. 2021 (2021)
38. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances
in Neural Information Processing Systems, pp. 4078–4088, December 2017
39. Prabhu, V., et al.: Few-shot learning for dermatological disease diagnosis. Proc. Mach. Learn.
Res. 106, 1–15 (2019)
40. Liu, X.J., Li, K., Luan, H., Wang, W., Chen, Z.: Few-shot learning for skin lesion image
classification. Multimedia Tools Appl. (2022). https://doi.org/10.1007/s11042-021-11472-0
41. Xiao, J., Xu, H., Zhao, W., Cheng, C., Gao, H.: A prior-mask-guided few-shot learning for
skin lesion segmentation. Computing (2021)
42. Xiao, J., Xu, H., Fang, D., Cheng, C., Gao, H.: Boosting and rectifying few-shot learning
prototype network for skin lesion classification based on the internet of medical things. Wirel.
Netw., 0123456789 (2021)
Enhancing Artificial Intelligence Control
Mechanisms: Current Practices, Real Life
Applications and Future Views
Abstract. The popularity of Artificial Intelligence has grown lately with the
potential it promises for revolutionizing a wide range of different sectors. To
achieve the change, whole community must overcome the Machine Learning (ML)
related explainability barrier, an inherent obstacle of current sub symbolism-based
approaches, e.g. in Deep Neural Networks, which was not existing during the last
AI hype time including some expert and rule-based systems. Due to lack of trans-
parency, privacy, biased systems, lack of governance and accountability, our soci-
ety demands toolsets to create responsible AI solutions for enabling of unbiased
AI systems. These solutions will help business owners to create AI applications
which are trust enhancing, open and transparent and also explainable. Properly
made systems will enhance trust among employees, business leaders, customers
and other stakeholders. The process of overseeing artificial intelligence usage and
its influence on related stakeholders belongs to the context of AI Governance.
Our work gives a detailed overview of a governance model for Responsible AI,
emphasizing fairness, model explainability, and responsibility in large-scale AI
technology deployment in real-world organizations. Our goal is to provide the
model developers in an organization to understand the Responsible AI with a com-
prehensive governance framework that outlines the details of the different roles
and the key responsibilities. The results work as reference for future research is
aimed to encourage area experts from other disciplines towards embracement of
AI in their own business sectors, without interpretability shortcoming biases.
1 Introduction
From business to healthcare, sustainability, product design, industrial and educational
context alike, innovations on AI innovation and Industry 4.0 are delivering new opportu-
nities to improve people’s lives, all across the globe [1, 3, 10, 19, 23, 26, 32]. This tough
does raise problems for fairness inclusions, system security and how to bring privacy,
into these systems effectively [2]. User-centered AI systems should consider both gen-
eral and solutions to machine learning (ML) specific issues [4–8, 37]. Understanding
the genuine impact of a system’s estimates, suggestions operational models, process
principles and decisions depends on how users interact with systems and operations [8,
9, 52, 53]. Also, design attributes such as adequate disclosures, clarity, and control are
required for good user experience and are usual parameters explaining estimations for
revenue streams and value of AI solutions [11]. Considering augmentation and assis-
tance, a single solution is suitable if designed to service a broad range of users and use
cases. In certain circumstances, giving the user a limited set of alternatives is beneficial
to the system. Precision across many answers is significantly more challenging achiev-
ing than accuracy over a single solution [9]. Think about click-through rate data and
customer lifetime value, as well as subgroup-specific false positive and false negative
rates, to evaluate overall system performance and short and long-term product health
(e.g., click-through rate and customer lifetime value) [12]. It should be ensured that the
metrics are suitable for the context and purpose of the system. E.g. a fire alarm system
should have a high recall, even if there are some false alarms [13, 14]. We can double-
check the comprehension of raw data since ML models reflect the data they are trained
on. If this isn’t possible, such as with sensitive raw data, it is advised to make the most
of your information while maintaining privacy by distributing aggregated summaries.
Most model is used that satisfies the performance objectives [15]. Also the users need
to be made of aware of system limits. E.g. application utilizing ML, made to identify
selected bird species reveals that the model was trained on a tiny sample of images from
a particular area of the globe [16]. The quantity of feedback increases by better educating
people about the product or application. The most acceptable test processes are learned
and quality engineering from software engineering to ensure that the AI system operates
as expected and can be trusted. To include a wide variety of consumer needs into devel-
opment cycles, iterative user testing should be done [17]. Then the quality engineering
approach is used to have quality checks into a system, ensuring that unplanned errors
are prevented or handled swiftly (for example, if a critical feature is suddenly absent,
the AI system will not generate a prediction). The model will be constantly monitored
to consider real-world performance and user input (e.g., happiness tracking surveys,
HEART framework) [13, 18].
In base definition, all world models are flawed, there is no such thing as 100% perfect
system. It is recommended to make time in the product strategy for troubleshooting.
Both short- and long-term solutions should be taken into account. While a fast remedy
can temporarily solve the issue, it is usually not a long-term answer for the problem.
Long-term learning solutions should tough be linked with quick learning solutions.
The candidate and deployed model’s variation must be considered and how the change
will affect overall system quality and user experience before altering a deployed model
[20, 21]. The critical point is that AI success is contingent on a group, not a single
person or position [22]. According to Collins those disciplines increase scientific process
awareness and thinking in complicated, interacting systems. They generally have the
critical thinking abilities needed to conduct good experiments and analyze the results
of ML applications, says the author. Having a varied staff has several advantages [14].
Enhancing Artificial Intelligence Control Mechanisms 289
Sitting back and hoping for diversity to come to an individual isn’t a realistic team-
building approach [7, 24, 25]. The following are the contributions of this manuscript:
2 Literature Review
AI’s ethical deployment and governance are essential for permitting the large-scale
deployment of AI systems required to improve people’s welfare and safety. Yet, AI
development typically outpaces regulatory development in many areas. The technique
also enables AI developers to address typical challenges in automated systems, such as
reducing social bias reinforcement, keeping people’s jobs and talents, resolving respon-
sibility to ensure confidence in an algorithm’s results, and more [27]. Commercial AI
systems in radiation clinics have just lately been developed, in contrast to the aerospace
sector, with efforts concentrated on showing performance in academic or clinical set-
tings, as well as product approval [28, 29]. Until previously, commercial AI systems
for radiation were only available as static goods, allowing cancer specialists to analyze
their effectiveness. Although an agile lifecycle management strategy, where AI-based
290 U. A. Usmani et al.
segmentation models are updated with new patient data regularly, sounds appealing, it
is unlikely to be accessible anytime soon.
Continuous quality monitoring of linear accelerators and vital software systems for
treatment planning and radiotherapy department operations benefit from the same qual-
ity assurance and monitoring. In general, it is a good practice to study different fields
for their monitoring practices [38, 47, 49], in addition of the specific field where AI
is applied to. Before an AI can be deployed, it must first go through a thorough and
transparent examination of the ethical implications of its proposed activities, particu-
larly in terms of social impact, but also in terms of safety and bias, before moving on
to a five-layer high-frequency checking system to ensure that the AI’s decisions are
correct and trustworthy. These characteristics are expectation confinement, synthetic
data exercise, independence, comprehensiveness, and data corruption assurance. Dose
calculation in therapeutic decision support systems, atlas-based auto-segmentation, and
magnetic resonance imaging benefit from similar methodologies. [30].
In recent years, (deep) neural networks and machine learning (ML) approaches have
complimented and, in some cases, surpassed symbolic AI solutions. As a consequence,
its social importance and impact have skyrocketed, bringing the ethical discussion to a
far larger audience. The argument has focused on AI ethical principles (nonmaleficence,
autonomy, fairness, beneficence, and explainability) rather than acts the “how”. Even if
the AI community is becoming more aware of potential issues, it is still in the early stages
of being able to take steps to mitigate the risks. The purpose of this study is to bridge the
gap between principles and practices by developing a typology that can help developers
apply ethics at each level of the Machine Learning development pipeline while also
alerting researchers to areas where further research is needed. Although the research is
limited to Machine Learning, the findings are predicted to be readily transferrable to other
AI domains. The following is the difference between ethics via design and pro-ethical
design: Because it does not rule out a course of action and requires agents to choose
it, the nudge is less paternalistic than pro-ethical design. The nudge is less paternalistic
than pro-ethical design since it does not prohibit but rather compels agents to choose a
path of action. A simple illustration may help you comprehend the distinction. Because
it enables drivers to pay a fine, a speed camera is both pro-ethical and nudging in the
event of an emergency. Speed bumps, on the other hand, are a kind of traffic-calming
device used to slow down automobiles and increase safety. They may seem to be a good
concept, but they need a long-term route adjustment, leaving motorists with few options.
This implies that even while responding to emergency, emergency vehicles such as a
medical ambulance, police car, or fire engine must slow down.
3 AI Governance Development
While AI moral governance offers promise, it also has limits and risks becoming in-
effectual if not used correctly. AI restrictions must be understood and followed. E.g.
who has the last say on what constitutes “ethical” AI? Companies are coming up with
their ideas and methodologies for establishing what it means to use these technologies
ethically and what “ethical” AI implies for society. Those at the top of organizations,
mostly white males, set the tone [4, 31]. The AI ethics board should be diverse, reflecting
Enhancing Artificial Intelligence Control Mechanisms 291
the views of people who AI systems may impact. Bias reduction when the company’s
and leadership’s aims are at odds: Being the first to market is highly prized by businesses.
On the other side, eradicating prejudice and building responsible AI conflict with this
aim necessitates extra procedures and stop points, lengthening the time it takes to bring a
product to market. Traditional goals clash with ethical and responsible AI goals, placing
the company in jeopardy and losing money [33].
In addition to the governance in companies the governments and different coun-
tries unions should also step in and set their views on table for ethics, governance and
methodologies of regulating AI, just as they have done in manufacturing and waste
management, to restrict events like Boeing 737 MAX groundings, effectually a result of
failed industry self-regulation practices. Examining current goals and making sure that
responsible AI is a significant focus is critical. Finally, the interplay between ethical and
economic gains demonstrates a fundamental knowledge of market success.
Cutting shortcuts on ethics has long-term and legal ramifications, particularly essen-
tial given AI’s quickly shifting regulatory framework. In inadequate accountability and
training, when it comes to putting ideas into practice, there is sometimes a lack of precise
instruction. Furthermore, there is no responsibility when a company’s ideals are broken.
In many circumstances, corporate culture and the market in general [34] prioritizes effi-
ciency above fairness and prejudice reduction, making it impossible to put the ideas into.
When combining bias and fairness criteria, it is critical to what is “fair” for a particular
AI system and who defines “fair”.
The EGAL brief on ML fairness dives further into the subject, laying out methods
and roadblocks. The technical solutions and “ethical washing” should be a priority. Most
principles and associated ideas are based on the assumption that technology can solve
problems, and they tend to have technical bias. E.g. a variety of qualitative techniques
are included in the EGAL brief on ML fairness, which can be helpful [35]. Particularly
for higher-risk applications, initial high-level assessments of the technology’s potential
for damage, as well as a record of decisions made throughout the AI system’s con-
struction, are critical. The terms management and governance are not interchangeable.
Governance is responsible for supervising how decisions are made, while management
is in charge of making them. By extending the same concept to AI governance, we arrive
at the following AI Governance for company’s definition. The more dangerous an AI
application is, for a description of some of these threats, see AI hazards, the more critical
AI governance becomes [36, 37]. Because AI enabled robots collect data and informa-
tion on a continuous basis, biases or unwanted outcomes, such as a chatbot that learns
inappropriate or violent language over time, are quite probable. If the data it receives
is biased, the system will provide skewed outcomes. Individual topic experts must first
inspect the data that enters these computers for biases, and then maintain governance to
guarantee that biases do not arise over time.
Companies may use additional visibility, a better understanding of their data, and
AI governance to assess the machine’s business rules or learnt patterns before adopting
and spreading them to staff and consumers. When it comes to ethical AI applications,
it’s all about trust. Customers, companies, and government authorities all want to know
that these smart systems are assessing data and making the best judgments they can.
They seek to demonstrate that the business outcomes produced by these machines are
292 U. A. Usmani et al.
in everyone’s best interests. Some of the tactics recommended in this article may assist
businesses in becoming more trustworthy. They can also enhance how AI offers options to
customers, aid with regulatory compliance, improve command and control, and provide
total transparency and the ability to make the best decisions possible. In this Section,
we explain the Three Lines Model (shown in Fig. 2) where we explain the role of the
different execution lines and how the governance can be effective using the end to end
governance model for Responsible AI.
A few of the key’red flag’ AI applications include facial recognition, AI for recruitment,
and bias in AI-based assessments and recommendations. There should be a way to keep
track of everything that’s going on. It is desirable to control the significant development
and approval method if AI is included in any current law or expert bodies for autonomous
self-governance. If there is no precedent to follow, there can be additional risks that have
not been considered. Corporate governance and risk management are ideas that have
been around for a long time. [44] Standard procedures, norms, and conventions are
typical to guarantee that businesses run smoothly. The Federation of European Risk
Management Associations (FERMA) and the European Confederation of Institutes of
Internal Auditing (ECIIA) developed the three lines of defense concept in 2008–10 as
guidance on Article 41 of the 8th EU Company Law Directive.
In a position paper titled The Three Lines of Defense in Effective Risk Management
and Control, released in 2013, the Institute of Internal Auditors (IIA) endorses this
strategy. Risk analysis and management, as well as governance adoption, have all become
industry standards [39]. In June 2020, the IIA revised its recommendations to include a
position paper on the IIA’s Three Line Model. This model covers the six basic governance
concepts, essential roles in the three-line model, role connections, and use the three-line
model. It includes information on management’s tasks and obligations, the internal audit
function, and the governing body. The three-line model has been revised to consider
a wider variety of financial services organizations, including technology and model
risk. Banks often use these three defense lines to manage credit risk, market risk, and
operational risk models. We modified this paradigm for AI Governance by creating a new
governance organization, methodology, roles, and responsibilities. Managers, Executors,
and Inventors: The creators, executors, and operations teams develop, build, deploy, and
execute the data, AI/ML models, and software, respectively; the data, software, and
models are maintained and monitored by the operations team. Managers, supervisors,
and quality assurance staff are responsible for identifying and executing the plan’s risks
connected to data, AI/ML models, automation, and software. Continuous monitoring is
examined as the second line of defense. The second line is also responsible for ensuring
that the first line’s systems are configured correctly.
The Auditors are responsible for ensuring that the organization’s rules, regulations,
and goals are followed and that technology is utilized responsibly and ethically [40].
The ethics board consists of a diverse group of corporate leaders and workers. Specific
organizations can appoint external members to the Board of Directors. Companies will
have to work with external auditors, other assurance providers, and regulators, in addition
to their internal duties. In the diagram below, the features of the numerous roles and the
critical tasks of each function are shown.
It is critical to monitor the IEEE Global Initiative on Ethics of Autonomous and
Intelligent Systems. The next stage is to devise a strategy [44]. Software and AI models,
as previously said, need entirely distinct techniques. While some of the research may
not be beneficial, a portfolio strategy increases the likelihood that at least some of them
will be correct. At any one-time, established AI adoption and firms have a portfolio of
models in different stages of development, including conception, testing, deployment,
production, and retirement. The ROI must be monitored and adjusted as required across
the portfolio to ensure the best mix of business use cases, efficiency vs. effectiveness
initiatives, and so on. Ten human talents and four intelligence are necessary to profit
from human-centered AI. The distribution strategy must be carefully evaluated because
of the convergence of data, software, and AI models.
To deliver AI-embedded software, or Software 2.0, both waterfall and agile software
development methodologies must be updated and interlaced. The essential indicators
that must be recorded and monitored for supervision will be identified in the specific
delivery plan. The next level of governance is the ecosystem as a whole. The ecosystem
in which AI models will be incorporated, as well as the context in which they will be
used by personnel both within and outside the company, is what we’re talking about.
The societal impact of the company’s AI should be evaluated. In this area, IEEE’s Well-
being Measures are a strong contender [41].
294 U. A. Usmani et al.
The governing board is in charge of setting the company’s vision, mission, values, and
organizational appetite for risk. Following that, management is given the duty of fulfilling
the organization’s objectives and obtaining the necessary resources. The governing body
receives management reports on planned, actual, and projected performance and risk
and risk management information. The degree of overlap and separation between the
governing body’s and management’s tasks varies depending on the organization. The
governing body can be more or less “hands on” when it comes to strategic and operational
matters. The strategic plan can be created solely by the governing body or management,
or it can be a shared effort. Also, the CEO can be a member of the board of directors and
even the chairman. Effective communication between management and the governing
body is required in every circumstance. Although the CEO is often the principal point
of contact for this communication, other senior executives also interact regularly with
the governing body.
Second-line executives, such as a Chief Risk Officer (CRO) and a Chief Compliance
Officer (CCO), are sought and required by organizations and authorities. This is entirely
compatible with the Three Lines Model’s concepts. Management and internal auditing
encompassing occupations on the first and second lines. Internal audit’s independence
from control allow it to plan and carry out its activities without fear of being influenced or
interfered with. It has unlimited access to the people resources and information it requires.
It makes recommendations to the governing body. Independence, on the other hand, does
not imply loneliness. Internal audit and management must for internal audit’s work to be
relevant and consistent with the organization’s strategic and operational goals. Internal
audit broadens its expertise and understanding of the firm via all its activities, increasing
the assurance and direction as a trusted and strategic partner [26]. Coordination and
communication between the first and second lines of management and internal audit are
essential to minimize excessive duplication, overlap, or gaps. Because it reports to the
governing body, internal audit is frequently referred to as the organization’s “eyes and
ears”.
The governing body is in charge of internal audit oversight, which includes hiring
and firing the Chief Audit Executive (CAE), approving and resourcing the audit plan,
receiving and considering CAE reports, and providing the CAE with unrestricted access
to the governing body, including private sessions without management present. Second-
line employment can be created by delegating essential responsibilities and reporting
lines to the governing body to give first-line employees and senior management some
autonomy. The Three Lines Model allows for as many reporting lines between manage-
ment and the governing body as are required. For compliance or risk management, for
example, as many persons reporting directly to the board are as necessary and organized
to give a degree of independence. Second-line roles offer third-line employees with the
same advice, monitoring, analysis, reporting, and assurance, but with less discretion.
Lower-level employees who make risk management choices—devising and implement-
ing policies, setting boundaries, establishing targets, and so on—are obviously “in the
kitchen making sausages” and part of management’s actions and responsibilities. Most
notably regulated financial institutions, some organizations must have these in place to
maintain true independence.
Enhancing Artificial Intelligence Control Mechanisms 295
Risk management is still the duty of first-line management in these instances. Risk
management monitoring, counseling, directing, testing, assessing, and reporting are
examples of second-line responsibilities. Second-line jobs are a component of man-
agement’s responsibilities. They are never really independent of management regardless
of reporting lines since they assist and challenge those in first-line positions. They are
critical to management decisions and actions. Third-line occupations are distinguished
by their independence from management. Internal audit’s independence, which distin-
guishes it from other activities and gives different assurance and recommendations, is in
the Three Lines Model Principles. Internal audit preserves its independence by refusing
to make decisions or conduct actions that are part of management’s responsibilities, such
as risk management, and refusing to assure activities that internal audit is responsible
for now or previously. The CAE is expected to take on more decision-making respon-
sibilities for jobs that require similar skills, such as statutory compliance or enterprise
risk management (ERM), specially to look for upkeeping the best practices for company
performance [42].
help to emphasize the Three Lines Model’s Principles. Stakeholders understand that the
governing body is in charge of organizational monitoring.
Second-line posts help with risk management. It is possible to combine or split the
first and second lines. Specialists are appointed to specialized second-line occupations to
give extra knowledge, supervision, and challenge to first-line activities. Second-line risk
management jobs may concentrate on internal control, information, technology security,
sustainability, and quality assurance, among other risk management goals. Internal audit
offers unbiased assurance and advice on the appropriateness and effectiveness of gover-
nance and risk management. This is accomplished by the skillful use of rigid and disci-
plined methodologies and knowledge and insight [50]. To encourage and support future
development, it informs its findings to management and the governing body. Formula-
tion, execution, and continuous improvement of risk management procedures, including
internal control at the process, system, and entity levels are examples of second-line jobs
[48].
Risk management goals are achieved through adherence to laws, norms, and accepted
ethical behavior; internal control; information and technology security; sustainability;
and quality assurance—analyzes and reports on risk management’s effectiveness and
appropriateness, including internal control. Internal audit is separate from management
and reports to the governing body. For example, The President reports to the Audit and
Risk Management Committee of the Board of Governors, while they report to the Execu-
tive Director, University Governance, and University Secretary, respectively. Assurance
and advice on the adequacy and usefulness of governance and risk management with
the internal control are communicated to management and the governing body inde-
pendently and objectively to support the achievement of organizational objectives and
to promote and facilitate continuous improvement. Any threats to impartiality or inde-
pendence are brought to the governing body’s attention, which takes appropriate action.
Further assurance is provided to meet legal and regulatory duties that safeguard stake-
holders’ interests, and internal assurance sources are augmented to meet management
and governing body needs.
Data architects are also vital in the governance of AI systems. In order to model AI,
businesses must have a solid data or metadata pipeline. Keep in mind that the success
of AI is contingent on well-organized data architecture devoid of mistakes and noise.
There will be a need for data standards, data governance, and business analytics. The
development of the AI governance function necessitates the use of human resources.
They may, for example, seek for employees who “fit” into the company’s present AI
framework and provide existing staff training tools to help them learn how to build eth-
ical AI applications. When AI technology is deployed, it is critical to guarantee that no
legal boundaries are breached. AI solutions meet organizational and industry-specific
regulatory standards. There is no such thing as a one-size-fits-all plan that takes into
account all legal and regulatory issues. Customers’ perceptions of ethical behavior in
the financial services industry, for example, may vary significantly from corporate ethics.
The AI governance function’s integration of legal and regulatory teams gives a diverse
set of decision-making inputs. Marketing, sales, human resources, supply chain, and
finance efforts all realize the advantages of AI. As a consequence, subject knowledge
is required not just for app creation but also for app evaluation. As a consequence,
298 U. A. Usmani et al.
having a strong business presence on the core AI governance council may aid in improv-
ing outcomes. People from various backgrounds should be represented on a company’s
governing board. It also contributes to inclusive and smooth governance by considering
all of the company’s issues. Product-based businesses provide a diverse variety of AI-
enabled products. When a business purchases a product that isn’t primarily based on AI,
it often falls beyond the purview of the AI regulatory agency. But what if the business
introduces an AI-assisted process, service, or product? Procurement and finance depart-
ments should ideally have AI professionals on staff to help with product onboarding. A
well-functioning AI governance position will provide a framework for monitoring AI
algorithms and products more effectively. In addition, developing an agile and cross-
functional AI governance committee would bring a diverse set of perspectives to the
table and help in the spread of AI knowledge.
4 Future Views
Despite the development of ethical frameworks, AI systems are nevertheless being
quickly implemented across a wide range of vital areas in the public and commer-
cial sectors—including healthcare, education, criminal justice, and many more—with
no protections or accountability procedures in place. There are a number of challenges
that must be addressed, and no one endeavor, nation, or corporation will be able to
do it by themselves. Emerging technologies are becoming more cross-border, and if
norms and practices impacting technical development and implementation in various
countries do not coincide, significant possibilities can be missed (WTO, 2019) [46].
New conflicts can erupt both inside and between states in a divided globe. In terms
of economic prosperity, it is feasible that the development of certain technical systems
can grow more costly, postponing innovation. This can lead to injustice and new divides
between technologically advanced and technologically disadvantaged nations or regions.
Additionally, major differences in how new technologies (particularly AI) are handled
and utilized in terms of human rights can make guaranteeing people’s equal access to
rights and opportunities across borders more difficult. New technologies can be used
as new digital surveillance tools, allowing governments to automate citizen monitoring
and tracking; they can also help policymakers allocate public goods and resources more
efficiently; and they can even be powerful mechanisms for private companies to forecast
our behavior [50].
A personal data can be retained and used for AI in an open or hidden manner. It can be
voluntarily offered as a kind of remuneration, or it can be taken without the agreement or
knowledge of the owner. Overall, arguments about who has access to our data, who has
the right to make decisions about it, and who has the instruments to enforce that authority
haunt the path to the digital future. This isn’t to say that all technological governance
should be done at the global level. It is critical for regions, states, and cities to be able
to adapt to their residents’ social, economic, and cultural needs. While the majority of
research has focused on wealthy nations, there is a need for additional information about
the geographically specific effect of AI systems on developing countries, as well as how
new technology can perpetuate historical inequalities in these areas.
Global processes, on the other hand, are essential even if they do not result in inte-
grated systems, since inequity thrives in the absence of universal laws. To manage the
Enhancing Artificial Intelligence Control Mechanisms 299
digital transition and achieve social inclusion, it will be required to create internationally
identical ethical, humanitarian, legal, and political normative frameworks. Furthermore,
while taking into account geopolitical and cultural disparities, there will be a growing
need to focus on algorithmic criteria rather than ethical principles. The G20’s role in
aligning interests and organizing such projects will be critical in the coming years.
The G20 brings together some of the most powerful political and economic forces on
the planet. It encompasses the whole globe and includes some of the world’s most strong
economies. It is the perfect place to examine the future of digital governance and respond
to one of the most major contemporary difficulties and concerns facing our world today
since it is a crucial venue for dialogue and involvement, both executive and legislative
[51]. Right now, there is no one-size-fits-all solution for the best AI technique, but there
are lots of options. We must all work together to determine which choice will benefit the
most people. By participating in and leading this discussion, the G20 has the potential to
become the spinal column of a new architecture for the 21st century, ensuring a brighter
future for everyone.
Although recognition and classification aren’t the only tasks given to AI systems,
they are the most popular. The flexibility of AI approaches might be seen as a sign of
variety. When research focuses on small, local gains in well-known, well-suited tasks like
identification and classification and then applies the effort to comparable difficulties over
a wide variety of domains, however, the predominance of a few activities may pose a risk.
It’s also feasible that the consequences of a system failure will be completely unexpected.
The systems may result in a wide variety of symptoms, from mild discomfort to death.
This backs with prior research that shows AI piques people’s interest in a broad variety
of topics, regardless of whether they are beneficial. On the other hand, mechanisms
that protect people’s privacy or remove the potential of discrimination are few and few
between. Furthermore, not every system requires extensive testing: a system that causes
pain is much more forgiving than one that causes death. As a result, the severity of a
system failure should be considered while designing a system. Despite their significant
dispersion throughout a broad range of categories, the systems are primarily defined
by their limited application within those categories. In each facet, just a few, if any,
systems represented several categories. For example, several domains were considerably
underrepresented and underrepresented in an inconsistent manner.
In a variety of industries, e.g. agriculture and medical applications, robots have lately
overtaken humans. The most critical system tasks were recognition and classification.
This might be due to researchers’ access to resources like robots, the overwhelming pop-
ularity of particular applications like self-driving vehicles, or technology’s increasing
capacity to apply itself to these more difficult fields. However, less well-known chal-
lenges should not be overlooked in future research, since less well-known does not always
imply lower value, and AI might be useful in situations like crop maturity assessment,
disease diagnosis, and MRI scan help. This is particularly significant since practical
research in these potentially extremely beneficial areas may aid in the dissemination
of findings and should be included in studies. In this discipline, software engineering
research hasn’t always prospered. Similarly, the majority of the tasks entrusted to the
systems are of moderate complexity.
300 U. A. Usmani et al.
The most significant category, for example, ‘recognition & classification,’ is a dif-
ficult task, yet it is often required due to the difficulties it solves: Is there any mold on
the product? Is the person in both pictures the same person? Is it now time to reap the
benefits of your labor? Assembly, as compared to recognition, is a combination of the
two: identifying and actively assembling key components. As a result, it seems that only
a small portion of genuine AI system development is concentrated on very difficult chal-
lenges. Many model-centered validations, as well as data-driven ML testing in general,
are plagued by data problems [61]. If sufficient high-quality data is available, model-
centered methods, on the other hand, allow more efficient and maybe faster validation
of model-centered systems, or at least the models used in the systems. If a system has
several components, we suggest carefully assessing whether the model-centered app-
roach is the optimum validation technique. In general, the study seems to place a higher
priority on first validation than ongoing validation. This isn’t surprising; it’s always been
done this way: a system is tested before being released when it seems to be functional.
Given the demand discrepancies between AI and traditional systems, this may not be
the case [62].
The first validation of a self-configuring system often includes self-configurability
and first configuration validation. If the system reconfigures itself during deployment, it
should almost likely be assessed to ensure that it continues to satisfy the system’s needs.
The same can be said for a video streaming platform’s recommendation algorithm, which
may have a constantly shifting user base: without continuing validation, the system
may fail to meet its requirements, which developers and users may be unaware of. Our
validation and continuing validation categories are based on original research. As a
consequence, the completeness of the categories will not be examined in this study.
Different taxonomies can be beneficial for recognizing and grasping AI validation; as a
consequence, researching, expanding, and improving these taxonomies for validity and
utility will be a future job. Monitoring the system’s outputs, thresholds, and other factors
on a continuous basis may help AI systems improve their accuracy and efficiency. The
first step in the oversight process could be to create a list of all AI systems in use at the
company, along with their specific uses, techniques used, names of developers/teams and
business owners, and risk ratings – such as calculating the potential social and financial
risks that could arise if such a system fails. Examining the AI system’s inputs and outputs,
as well as the AI system itself, may need a different methodology. Although data quality
standards aren’t exclusive to AI/ML, they do have an influence on AI systems that learn
from data and provide output depending on what they’ve learned. Training data may be
used to assess a data collection’s quality and biases.
If practical and appropriate, benchmarking of other models and existing approaches
to improve model interpretability can be incorporated in AI system assessment. Under-
standing the elements influencing AI system outputs helps to boost AI system confi-
dence. Drift in AI systems might cause a plethora of problems. A shifting link between
goal variables and independent variables, can lead to poor model accuracy. As such,
drift detection is useful tool in AI problems, e.g. in the security, privacy fairness of a
model, as avoidance measures. By evaluating whether input data varies considerably
from the model’s training data, monitoring may assist discover “data drift”. Accounting
for the model’s data collected in production and analyzing the model’s correctness is
Enhancing Artificial Intelligence Control Mechanisms 301
one way for acquiring insight into the model’s “accuracy drift”. In lending institutions,
compliance, fair lending, and system governance teams are prevalent, and they seek
for signs of bias in input variables and procedures. As a consequence of technology
advancements and the deployment of de-biasing AI, a portion, if not the majority, of
this labor can be automated and simplified in future. Fair AI, on the other hand, may
need a human-centered approach. The generalist knowledge and experience of a well-
trained and varied group probing for discriminatory bias in AI systems is unlikely to
be completely replaced by an automated procedure. As a result, human judgment might
be utilized as a first line of defense against biased artificial intelligence. By the recent
research, discrimination-reducing approaches have found to be able to minimize dispar-
ities in class-control context and still keeps good predictive quality. In order to reduce
inequities, mitigation algorithms design the “optimal” system for a certain degree of
quality and discriminating measures.
The algorithms look for alternatives when there isn’t another system with a greater
degree of quality for a certain level of discrimination. However, no solution has been
devised that completely removes bias for any given level of quality. Before such algo-
rithms utilization in production environment, one needs more testing and validation
studies. E.g. traditional algorithm searches and feature specifications for valid and less
discriminating systems, as well as more modern approaches adjusting input data or the
algorithms’ optimization functions themselves, are the two sorts of methodologies. To
reduce disparate impact, feature selection may be used, which involves removing one or
two disparate-effect components from the system and replacing them with a few addi-
tional variables. In complicated AI/ML systems, these tactics have been demonstrated to
be ineffective. For bias reduction, one needs new strategies in pre-processing and inside
the decision-making phase of the algorithm, continuing to the output post-processing
phase. The legal context, in which technology is used, as well as how it is used, has an
impact on whether specific tactics are allowed in a certain circumstance.
Accuracy drift detection might be useful in the business sector since it can detect
a decrease in model accuracy before it has a major effect on the company. Precision
drift may lead to a loss of precision in your model. Data drift, on the other hand, aids
companies in determining how data quality varies over time. It may be challenging for
many businesses to guarantee that AI/ML explanations are both accurate and useful
(explainability). AI/ML explanations, like the underlying AI/ML systems, may be poor
approximations, wrong, or inconsistent. In the financial services industry, consistency
is crucial, especially when it comes to unfavorable action letters for credit lending deci-
sions. To lessen explainability issues, explanatory procedures may be tested for accuracy
and stability in human assessment studies or on simulated data, depending on individual
implementations. According to a new study, providing explanations and forecasts about
how AI systems function may aid criminal actors.
Businesses should only provide information with customers when they directly
request it or when it is mandated by law, to prevent security concerns. Traditional security
techniques such as real-time anomaly detection, user authentication, and API throttling
may be employed to secure AI/ML systems trained on sensitive data and producing
predictions available to end users, depending on the implementation and management
302 U. A. Usmani et al.
5 Conclusions
in order for AI to be used ethically and effectively. We infer that the model generated can
aid businesses in the task of identifying the structures, processes, and responsibilities
that best support goal attainment while simultaneously ensuring robust governance and
risk management. The concept of our proposed model can help manage and regulate
risks effectively.
References
1. Collins, C., Dennehy, D., Conboy, K., Mikalef, P.: Artificial intelligence in information sys-
tems research: a systematic literature review and research agenda. Int. J. Inf. Manag. 60,
102383 (2021)
2. Smuha, N.A.: Beyond a human rights-based approach to AI governance: promise, pitfalls,
plea. Philos. Technol. 34(1), 91–104 (2021). Yang, Q
3. Ghoreishi, M., Happonen, A.: New promises AI brings into circular economy accelerated
product design: a review on supporting literature. In: E3S Web Conference, vol. 158, pp. 1–10
(2020). https://doi.org/10.1051/e3sconf/202015806002
4. Tigard, D.W.: Responsible AI and moral responsibility: a common appreciation. AI Ethics
1(2), 113–117 (2020). https://doi.org/10.1007/s43681-020-00009-0
5. Shneiderman, B.: Responsible AI: bridging from ethics to practice. Commun. ACM 64(8),
32–35 (2021)
6. Berlin, S.J., John, M.: Particle swarm optimization with deep learning for human action recog-
nition. Multimedia Tools Appl 79(25–26), 17349–17371 (2020). https://doi.org/10.1007/s11
042-020-08704-0
7. Rakova, B., Yang, J., Cramer, H., Chowdhury, R.: Where responsible AI meets reality:
practitioner perspectives on enablers for shifting organizational practices. Proc. ACM Hum.
Comput. Interact. 5(CSCW1), 1–23 (2021)
8. Wearn, O.R., Freeman, R., Jacoby, D.M.: Responsible AI for conservation. Nat. Mach. Intell.
1(2), 72–73 (2019)
9. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, oppor-
tunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
10. Ghoreishi, M., Happonen, A., Pynnönen, M.: Exploring industry 4.0 technologies to enhance
circularity in textile industry: role of Internet of Things. In: Twenty-first International Working
Seminar on Production Economics, Austria, 24–28 February 2020, pp. 1–16 (2020). https://
doi.org/10.5281/zenodo.3471421
11. Metso, L., Happonen, A., Rissanen, M.: Estimation of user base and revenue streams for novel
open data based electric vehicle service and maintenance ecosystem driven platform solution.
In: Karim, R., Ahmadi, A., Soleimanmeigouni, I., Kour, R., Rao, R. (eds.) IAI 2021. LNME,
pp. 393–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93639-6_34
12. Usmani, U.A., Haron, N.S., Jaafar, J.: A natural language processing approach to mine online
reviews using topic modelling. In: Chaubey, N., Parikh, S., Amin, K. (eds.) COMS2 2021.
CCIS, vol. 1416, pp. 82–98. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-767
76-1_6
13. Trocin, C., Mikalef, P., Papamitsiou, Z., Conboy, K.: Responsible AI for digital health: a
synthesis and a research agenda. Inf. Syst. Front., 1–19 (2021)
14. Peters, D., Vold, K., Robinson, D., Calvo, R.A.: Responsible AI—two frameworks for ethical
design practice. IEEE Trans. Technol. Soc. 1(1), 34–47 (2020)
15. Clarke, R.: Principles and business processes for responsible AI. Comput. Law Secur. Rev.
35(4), 410–422 (2019)
304 U. A. Usmani et al.
16. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning based
adaptive ROI generation for video object segmentation. IEEE Access 9, 161959–161977
(2021)
17. Sambasivan, N., Holbrook, J.: Toward responsible AI for the next billion users. Interactions
26(1), 68–71 (2018)
18. Butler, L.M., Arya, V., Nonyel, N.P., Moore, T.S.: The Rx-HEART framework to address
health equity and racism within pharmacy education. Am. J. Pharm. Educ. 85(9) (2021)
19. Ghoreishi, M., Happonen, A.: Key enablers for deploying artificial intelligence for circu-
lar economy embracing sustainable product design: three case studies. In: AIP Conference
Proceedings 2233(1), 1–19 (2020). https://doi.org/10.1063/5.0001339
20. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning algorithm
for automated detection of skin lesions. Appl. Sci. 11(20), 9367 (2021)
21. Dignum, V.: The role and challenges of education for responsible AI. Lond. Rev. Educ. 19(1),
1–11 (2021)
22. Leslie, D.: Tackling COVID-19 through responsible AI innovation: five steps in the right
direction. Harv. Data Sci. Rev. (2020)
23. Ghoreishi, M., Happonen, A.: The case of fabric and textile industry: the emerging role of
digitalization, Internet-of-Things and industry 4.0 for circularity. In: Yang, X.-S., Sherratt,
S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information
and Communication Technology. LNNS, vol. 216, pp. 189–200. Springer, Singapore (2022).
https://doi.org/10.1007/978-981-16-1781-2_18
24. Wang, Y., Xiong, M., Olya, H.: Toward an understanding of responsible artificial intelligence
practices. In: Proceedings of the 53rd Hawaii International Conference on System Sciences,
pp. 4962–4971. Hawaii International Conference on System Sciences (HICSS), January 2020
25. Cheng, L., Varshney, K.R., Liu, H.: Socially responsible AI algorithms: issues, purposes, and
challenges. J. Artif. Intell. Res. 71, 1137–1181 (2021)
26. Happonen, A., Ghoreishi, M.: A mapping study of the current literature on digitalization and
industry 4.0 technologies utilization for sustainability and circular economy in textile indus-
tries. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International
Congress on Information and Communication Technology. LNNS, vol. 217, pp. 697–711.
Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_63
27. Ashok, M., Madan, R., Joha, A., Sivarajah, U.: Ethical framework for artificial intelligence
and digital technologies. Int. J. Inf. Manag. 62, 102433 (2022)
28. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforced active learning algorithm
for semantic segmentation in complex imaging. IEEE Access 9, 168415–168432 (2021)
29. Maree, C., Modal, J.E., Omlin, C.W.: Towards responsible AI for financial transactions.
In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 16–21. IEEE,
December 2020
30. Rockall, A.: From hype to hope to hard work: developing responsible AI for radiology. Clin.
Radiol. 75(1), 1–2 (2020)
31. Constantinescu, M., Voinea, C., Uszkai, R., Vică, C.: Understanding responsibility in responsi-
ble AI. Dianoetic virtues and the hard problem of context. Ethics Inf. Technol. 23(4), 803–814
(2021). https://doi.org/10.1007/s10676-021-09616-9
32. Happonen, A., Santti, U., Auvinen, H., Räsänen, T., Eskelinen, T.: Digital age business model
innovation for sustainability in university industry collaboration model. In: E3S Web of Con-
ferences, vol. 211, Article no. 04005, pp. 1–11 (2020). https://doi.org/10.1051/e3sconf/202
02110400
33. Al-Dhaen, F., Hou, J., Rana, N.P., Weerakkody, V.: Advancing the under- standing of the role
of responsible AI in the continued use of IoMT in health-care. Inf. Syst. Front., 1–20 (2021)
Enhancing Artificial Intelligence Control Mechanisms 305
34. McDonald, M.L., Keeves, G.D., Westphal, J.D.: One step forward, one step back: white
male top manager organizational identification and helping behavior toward other executives
following the appointment of a female or racial minority CEO. Acad. Manag. J. 61(2), 405–439
(2018)
35. Usmani, U.A., Roy, A., Watada, J., Jaafar, J., Aziz, I.A.: Enhanced reinforcement learning
model for extraction of objects in complex imaging. In: Arai, K. (ed.) Intelligent Computing.
LNNS, vol. 283, pp. 946–964. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-
80119-9_63
36. Lee, M.K., et al.: Human-centered approaches to fair and responsible AI. In: Extended
Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8,
April 2020
37. Yang, Q.: Toward responsible AI: an overview of federated learning for user-centered privacy-
preserving computing. ACM Trans. Interact. Intell. Syst. (TiiS) 11(3–4), 1–22 (2021)
38. Hirvimäki, M., Manninen, M., Lehti, A., Happonen, A., Salminen, A., Nyrhilä, O.: Evaluation
of different monitoring methods of laser additive manufacturing of stainless steel. Adv. Mater.
Res. 651, 812–819 (2013). https://doi.org/10.4028/www.scientific.net/AMR.651.812
39. Sen, P., Ganguly, D.: Towards socially responsible ai: cognitive bias-aware multi-objective
learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03,
pp. 2685–2692, April 2020
40. de Laat, P.B.: Companies committed to responsible AI: from principles to-wards implemen-
tation and regulation? Philos. Technol. 34(4), 1135–1193 (2021)
41. Happonen, A., Tikka, M., Usmani, U.: A systematic review for organizing hackathons and
code camps in COVID-19 like times: literature in demand to understand online hackathons
and event result continuation. In: 2021 International Conference on Data and Software
Engineering (ICoDSE), pp. 7–12 (2021). https://doi.org/10.1109/ICoDSE53690.2021.964
8459
42. Wangdee, W., Billinton, R.: Bulk electric system well-being analysis using sequential Monte
Carlo simulation. IEEE Trans. Power Syst. 21(1), pp. 188–193 (2006)
43. Usmani, U.A., Usmani, M.U.: Future market trends and opportunities for wearable sensor
technology. IACSIT Int. J. Eng. Technol. 6(4), 326–330 (2014)
44. Dignum, V.: Ensuring responsible AI in practice. In: Dignum, V. (ed.) Responsible Artifi-
cial Intelligence. Artificial Intelligence: Foundations, Theory, and Algorithms, pp. 93–105.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30371-6_6
45. Amershi, S.: Toward responsible AI by planning to fail. In: Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, p. 3607, August
2020
46. Cath, C.: Governing artificial intelligence: ethical, legal and technical opportunities and
challenges. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 376(2133), 20180080 (2018)
47. Eskelinen, T., Räsänen, T., Santti, U., et al.: Designing a business model for environmental
monitoring services using fast MCDS innovation support tools. TIM Rev. 7(11), 36–46 (2017).
https://doi.org/10.22215/timreview/1119
48. Truby, J.: Governing artificial intelligence to benefit the UN sustainable development goals.
Sustain. Dev. 28(4), 946–959 (2020)
49. Happonen, A, Salmela, E.: Automatic & unmanned stock replenishment process using scales
for monitoring. In: Proceedings of the Third International Conference on Web Information
Systems and Technologies - (Volume 3), Barcelona, Spain, 3–6 March 2007, pp. 157–162
(2007). https://doi.org/10.5220/0001282801570162
50. Braun, B.: Governing the future: the European central bank’s expectation management during
the Great moderation. Econ. Soc. 44(3), 367–391 (2015)
51. Nitzberg, M., Zysman, J.: Algorithms, data, and platforms: the diverse challenges of governing
AI. J. Eur. Public Policy (2021)
306 U. A. Usmani et al.
52. Salmela, E., Santos, C., Happonen, A.: Formalisation of front end innovation in supply net-
work collaboration. Int. J. Innov. Reg. Dev. 5(1), 91–111 (2013). https://doi.org/10.1504/
IJIRD.2013.052510
53. Piili, H., et al.: Digital design process and additive manufacturing of a configurable product.
Adv. Sci. Lett. 19(3), 926–931 (2013). https://doi.org/10.1166/asl.2013.4827
54. Metso, L., Happonen, A., Rissanen, M., Efvengren, K., Ojanen, V., Kärri, T.: Data openness
based data sharing concept for future electric car maintenance services. In: Ball, A., Gelman,
L., Rao, B.K.N. (eds.) Advances in Asset Management and Condition Monitoring. SIST, vol.
166, pp. 429–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57745-2_36
55. Happonen, A., Siljander, V.: Gainsharing in logistics outsourcing: trust leads to success in
the digital era. Int. J. Collab. Enterp. 6(2), 150–175 (2020). https://doi.org/10.1504/IJCENT.
2020.110221
56. Kärri, T., Marttonen-Arola, S., Kinnunen, S-K., Ylä-Kujala, A., Ali-Marttila, M., et al.: Fleet-
based industrial data symbiosis, title of parent publication: S4Fleet - service solutions for fleet
management, DIMECC Publications series No. 19, 06/2017, pp. 124–169 (2017)
57. Kinnunen, S.-K., Happonen, A., Marttonen-Arola, S., Kärri, T.: Traditional and extended
fleets in literature and practice: definition and untapped potential. Int. J. Strateg. Eng. Asset
Manag. 3(3), 239–261 (2019). https://doi.org/10.1504/IJSEAM.2019.108467
58. Metso, L., Happonen, A., Ojanen, V., Rissanen, M., Kärri, T.: Business model design ele-
ments for electric car service based on digital data enabled sharing platform, Cambridge.
In: International Manufacturing Symposium, Cambridge, UK, 26–27 September 2019, p. 6
(2019). https://doi.org/10.17863/CAM.45886
59. Palacin, V., Gilbert, S., Orchard, S., Eaton, A., Ferrario, M.A., Happonen, A.: Drivers of
participation in digital citizen science: case studies on Järviwiki and safecast. Citiz. Sci.
Theory Pract. 5(1), Article no. 22, pp. 1–20 (2020). https://doi.org/10.5334/cstp.290
60. Palacin, V., et al.: SENSEI: harnessing community wisdom for local environmental monitoring
in Finland. In: CHI Conference on Human Factors in Computing Systems, Glagsgow, Scotland
UK, pp. 1–8 (2019). https://doi.org/10.1145/3290607.3299047
61. Zhang, D., Yin, C., Zeng, J., Yuan, X., Zhang, P.: Combining structured and unstructured data
for predictive models: a deep learning approach. BMC Med. Inform. Decis. Mak. 20(1), 1–11
(2020)
62. Vassev, E., Hinchey, M.: Autonomy requirements engineering. In: Vassev, E., Hinchey, M.
(eds.) Autonomy Requirements Engineering for Space Missions. NASA Monographs in Sys-
tems and Software Engineering, pp. 105–172. Springer, Cham (2014). https://doi.org/10.
1007/978-3-319-09816-6_3
A General Framework of Particle Swarm
Optimization
Particle swarm optimization (PSO) algorithm was developed by James Kennedy (a social
psychologist) and Russell C. Eberhart (an electrical engineer). This section is navigated
by the article “Particle swarm optimization: An overview” of Riccardo Poli, James
Kennedy, and Tim Blackwell. The main idea of PSO is based on social intelligence
when it simulates how a flock of birds search for food. Given a target function known as
cost function f (x), the optimization problem is to find out the minimum point x* known
as minimizer or optimizer so that f (x* ) is minimal. In PSO theory, f (x) is also called
fitness function and thus, when f (x) is evaluated at f (x0 ) then, f (x0 ) is called fitness
value which represents the best food source for which a flock of birds search.. If x* is an
optimizer, f (x* ) is called optimal value, best value, or best fitness value. As a convention,
the optimization problem is global minimization problem when x* is searched over entire
domain of f (x). For global maximization, it is simple to change a little bit our viewpoint.
x∗ = argmin f (x)
x
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 307–316, 2023.
https://doi.org/10.1007/978-3-031-18461-1_20
308 L. Nguyen et al.
pg ∼
= argmin f (x)
x
1. The cost function at pg which is evaluated as f (pg ) is small enough. For example,
f (pg ) is smaller than a small threshold.
2. Or PSO ran over a large enough number of iterations.
Function U (0, φ 1 ) generates a random vector whose elements are random numbers
in the range [0, φ 1 ]. Similarly, function U (0, φ 2 ) generates a random vector whose
elements are random numbers in the range [0, φ 2 ]. For example,
Note, the super script “T ” indicates transposition operator of vector and matrix. The
operator ⊗ denotes component-wise multiplication of two points [2, p. 3]. For example,
given random vector U (0, φ 1 ) = (r 11 , r 12 ,…, r 1n )T and position xi = (x i1 , x i2 ,…, x in )T ,
A General Framework of Particle Swarm Optimization 309
The larger inertial weight ω is, the faster particles move because its inertial is high,
which leads PSO to explore global optimizer. Note that moving fast does not imply fast
convergence. In opposite, the smaller ω leads PSO to exploit local optimizer. In general,
large ω expresses exploration and small ω expresses exploitation. The inverse 1–ω is
known as friction coefficient. The popular value of ω is 0.7298 given φ 1 = φ 2 = 1.4962.
Pioneers in PSO [2, p. 5] recognized that if velocities vi of particles are not restricted,
their movements can be out of convergence trajectories at unacceptable levels. Therefore,
they proposed a so-called constriction coefficient χ to damp dynamics of particles. Note,
χ is also called constriction weight or damping weight where 0 < χ ≤ 1. With support
of constriction coefficient, Eq. 1 becomes [2, p. 5]:
vi = χ vi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi (4)
It is easy to recognize that Eq. 3 is special case of Eq. 4 when the expression χ vi
is equivalent to the expression ωvi . The popular value of constriction coefficient is χ =
A General Framework of Particle Swarm Optimization 311
0.7298 given φ 1 = φ 2 = 2.05 and ω = 1. Note, inertial weight ω is also the parameter that
damps dynamics of particles. This is the reason that ω = 1 when χ = 1 but constriction
of χ is stronger than ω because χ affects previous velocity and two attraction forces
whereas ω affects only previous velocity.
Structure of swarm which is determined by defining neighbors and neighborhood of
every particle is called swarm topology or population topology.
Because
pg is the best
position of entire swarm, the attraction force U (0, φ2 ) ⊗ pg − xi indicates the move-
ment of each particle is affected by all other particles, which means that every particle
connects to all remaining particles. In other words, neighbors of a particle are all other
particles, which is known as fully connected swarm topology. For easily understandable
explanation, suppose particles are vertices of a graph, fully connected swarm topology
implies that such graph is fully connected, in which all vertices are connected together.
Alternately, swarm topology can be defined in different way so that each particle i only
connects with a limit number K i of other particles. In other words, each particle has only
K i neighbors. With custom-defined swarm topology, Eq. 4 is written as follows [2, p. 6]:
1 Ki
vi = χ vi + U (0, φ) ⊗ qk − xi (5)
Ki k=1
the exploitation aspect aims to motivate PSO to converge as fast as possible. Besides
exploitation property can help PSO to converge more accurately regardless of local opti-
mizer or global optimizer. These two aspects are equally important. Consequently, two
problems corresponding to the exploration and exploitation are premature problem and
dynamic problem. Solutions of the premature problem are to improve the exploration
and solutions of the dynamic problem are to improve the exploitation. Inertial weight
and constriction coefficient are common solutions for dynamic problem. Currently, solu-
tions of dynamic problem often relate to tuning coefficients which are PSO parameters.
Solutions of premature problem relates to increase dynamic ability of particles such as:
– Dynamic topology.
– Change of fitness function.
– Adaptation includes tuning coefficients, adding particles, removing particles, and
changing particle properties.
– Diversity control.
The proposed general framework of PSO called GPSO aims to balance the explo-
ration and the exploitation, which solves both premature problem and dynamic problem.
If we focus on the fact that the attraction force issued by the particle i itself is equivalent
to the attraction force from the global best position pg and the other attraction forces
from its neighbors qk , Eq. 5 is modified as follows:
vi = χ ωvi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi
1 Ki
+ U (0, φ) ⊗ qk − xi (6)
Ki k=1
In Eq. 6, the set of K i neighbors does not include particle i and so, the three parameters
φ 1 , φ 2 , and φ are co-existent. Inertial weight ω is kept intact too. It is easy to recognize
that Eq. 6 is the general form of velocity update rule. In other words, GPSO is specified
by Eq. 6, which balances local best topology and global best topology with expectation
that convergence speed is improved but convergence to local optimizer can be avoided. In
other words, Eq. 6 aims to achieve both exploration and exploitation. The topology from
Eq. 1, Eq. 3, Eq. 4, and Eq. 5 is static [2, p. 6] because it is kept intact over all iterations
of PSO. In other words, neighbors and neighborhood in static topology are established
fixedly. However, in the GPSO specified by Eq. 6, it is possible to relocate neighbors of a
given particle at each iteration. Therefore, dynamic topology can be achieved in GPSO,
which depends on individual applications. This implies that the premature problem can
be solved with GPSO so that PSO is not trapped in local optimizer.
In PSO theory, solutions of dynamic problem are to improve the exploitation so that
PSO can converge as fast as possible. Inertial weight and constriction coefficient are
common solutions for dynamic problem. Hence, GPSO supports tuning coefficients.
Concretely, constriction coefficient is tuned with GPSO. However, tuning a parameter
does not mean that such parameter is modified simply at each iteration because the mod-
ification must be solid and based on valuable knowledge. Fortunately, James Kennedy
and Russell C. Eberhart [2, p. 13], [3, p. 3], [4, p. 51] proposed bare bones PSO (BBPSO)
in which they asserted that, given xi = (x i1 , x i2 ,…, x in )T , pi = (pi1 , pi2 ,…, pin )T , and pg =
A General Framework of Particle Swarm Optimization 313
(pg1 , pg2 ,…, pgn )T , the jth element x ij of xi follows normal distribution with mean (pij +
2
pgj )/2 and variance pij − pgj . Based on this valuable knowledge, we tune constriction
parameter χ with normal distribution at each iteration.
Let zi = (zi1 , zi2 ,…, zin ) be random vector corresponding to each position xi of
particle
i. Every jth element zij of zi is randomized according to normal distribution
pij +pgj 2 p +p 2
N 2 , pij − pgj with mean μi = ij 2 gj and variance σi2 = pij − pgj .
Every zij is randomized by normal distribution
pij + pgj 2
N , pij − pgj (7)
2
where,
2
pij + pgj 2 2 1 1 zij − μi
N μi = , σi = pij − pgj = exp −
2 2π σi2 2 σi2
Note, N denotes normal distribution. Let g(zij ) be the pseudo probability density
function of zij .
⎛
⎞
2 pij +pgj 2
1 zij − μi z − 2
⎜ 1 ij ⎟
g zij = exp − = exp⎝− ⎠ (8)
2 σi2 2 pij − pgj 2
Of course, we have:
pij + pgj 2 2
g zij ∼ N μi = , σi = pij − pgj
2
aims to exploration for converging to global optimizer. The farer to global best posi-
tion pg the local best position pi is, the less dynamic the position xi is, which aims to
exploitation for fast convergence. This is purpose of adding probabilistic constriction
coefficient X to Eq. 6 for solving dynamic problem. As a convention, GPSO specified
by Eq. 10 is called probabilistic GPSO. Source code of GPSO and probabilistic GPSO is
available at https://github.com/ngphloc/ai/tree/main/3_implementation/src/net/ea/pso.
The lower bound and upper bound of positions in initialization stage are lb = (–10,
–10)T and ub = (10, 10)T . The terminated condition is that the bias of the current global
best value and the previous global best value is less than ε = 0.01. Parameters of GPSO
are φ 1 = φ 2 = φ = 2.05, ω = 1, and χ = 0.7298. Parameters of probabilistic GPSO are
φ 1 = φ 2 = φ = 2.05. Parameters of basic PSO are φ 1 = φ 2 = 2.05 and χ = 0.7298. The
swarm size is 50. For the three PSO, dynamic topology is established at each iteration
by a so-called fitness distance ratio (FDR). Exactly, Peram [2, p. 8] defined the topology
dynamically at each iteration by FDR. Given target particle i and another particle j, their
FDR is the ratio of the difference between f (xi ) and f (xj ) to the Euclidean difference
between xi and xj .
f (xi ) − f xj
FDR xi , xj = (12)
xi − xj
Given target particle i, if FDR(xi , xj ) is larger than a threshold (>1), the particle j
is a neighbor of the target particle i. Alternately, top K particles whose FDR (s) with xi
are largest are K neighbors of particle i.
From the experiment, basic PSO, GPSO, and probabilistic GPSO converge to best
values –0.9842, –0.9973, and –0.9999 with global best positions (3.0421, 3.1151)T ,
(3.1837, 3.1352)T , and (3.1464, 3.1485)T after 6, 18, and 18 iterations, respectively.
The true best value of the target function specified by Eq. 11 is −1 whereas the true
global optimizer is x* = (3.1416, 3.1416)T . Therefore, the biases in best values (fitness
biases) of basic PSO, GPSO, and probabilistic GPSO are 0.0158, 0.0027, and 0.0001,
respectively and the biases in best positions (optimizer biases) of basic PSO, GPSO, and
probabilistic GPSO are (0.0995, 0.0265)T , (0.0421, 0.0064)T , and (0.0048, 0.0069)T ,
respectively.
A General Framework of Particle Swarm Optimization 315
From Table 2, fitness bias and optimizer bias of probabilistic PSO are smallest.
Therefore, probabilistic PSO is the preeminent one. Basic PSO converges soonest after
6 iterations but basic PSO copes with the premature problem due to lowest converged
fitness value whereas both GPSO and probabilistic GPSO solve the premature prob-
lem with better converged fitness values (−0.9973 and −0.9999) but they require more
iterations (18). The reason that GPSO is better than basic PSO is combination of local
best topology and global best topology in GPSO. The event that probabilistic GPSO
is better than GPSO proves that the probabilistic constriction coefficient can solve the
dynamic problem. About fitness bias, probabilistic GPSO is 27 times better than normal
GPSO, which implies that the exploitation is as important as the exploration. In some
situations where there are many local optimizers, reaching a good enough local opti-
mizer can be acceptable and more feasible than reaching the global optimizer absolutely.
Practical PSO attracts researchers’ attention because it solves the complexity problem
of pure mathematics in global optimization which gets stuck in how to find assuredly the
global optimizer. Therefore, that the probabilistic GPSO improves convergence speed is
meaningful. Moreover, it does not restrict the dynamics of particles. Indeed, it keeps the
dynamics of particles towards optimal trends with support of probabilistic distribution.
Thus, it also balances two PSO properties such as exploration and exploitation.
4 Conclusions
The first purpose of GPSO which is to aggregate important parameters and to generalize
important variants completed with the general form of velocity update rule and the
second purpose is to balance the two PSO properties such as exploration and exploitation
which is reached at moderate rate although experimental results showed that GPSO
and probabilistic GPSO are better than basic PSO due to combination of local best
topology and global best topology along with definition of probabilistic constriction
coefficient, which proved improvement of global convergence. The reason of balance at
moderate rate is that dynamic topology in GPSO is supported indirectly via general form
of velocity update rule, which is impractical because researchers must modify source
code of GPSO in order to define dynamic topology. Moreover, premature problem is
solved by many other solutions such as dynamic topology, change of fitness function,
adaptation (tuning coefficients, adding particles, removing particles, changing particle
properties), and diversity control over iterations. In future trend, we will implement
dynamic solutions with support of other evolutionary algorithms like artificial bee colony
algorithm and genetic algorithm. Moreover, we will research how to apply PSO into
learning neural network.
316 L. Nguyen et al.
References
1. Wikipedia: Particle swarm optimization. (Wikimedia Foundation), 7 March 2017. https://en.
wikipedia.org/wiki/Particle_swarm_optimization. Accessed 8 Apr 2017
2. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. In: Dorigo, M. (ed.) Swarm
Intelligence, vol. 1, no. 1, pp. 33–57, June 2007. https://doi.org/10.1007/s11721-007-0002-0
3. Pan, F., Hu, X., Eberhart, R., Chen, Y.: An analysis of bare bones particle swarm. In: IEEE
Swarm Intelligence Symposium 2008 (SIS 2008), St. Louis, MO, US, pp. 1–5. IEEE, 21
September 2008. https://doi.org/10.1109/SIS.2008.4668301
4. al-Rifaie, M.M., Blackwell, T.: Bare bones particle swarms with jumps. In: Dorigo, M., et al.
(eds.) ANTS 2012. LNCS, vol. 7461, pp. 49–60. Springer, Heidelberg (2012). https://doi.org/
10.1007/978-3-642-32650-9_5
5. Sharma, K., Chhamunya, V., Gupta, P.C., Sharma, H., Bansal, J.C.: Fitness based particle
swarm optimization. Int. J. Syst. Assur. Eng. Manag. 6(3), 319–329 (2015). https://doi.org/10.
1007/s13198-015-0372-4
How Artificial Intelligence
and Videogames Drive Each Other
Forward
1 Introduction
Some level of artificial intelligence has been an integral feature of video games
since the earliest days. Pong, which is commonly perceived as the first game
ever made (although it is not), featured what could be interpreted as an intel-
ligent opponent. Early Mario games had enemies with unique characteristics.
The green koopas would walk forward until they hit a wall or fall off a cliff,
whereas red koopas would turn away from cliffs. While not a real intelligence,
the red koopas were perceived as more intelligent. The perception or illusion
of intelligence has become a staple characteristic of video games. Modern video
games have become more complex, requiring an equally more complex illusion of
intelligence. In games such as the Fable franchise, non-player characters perceive
the player as good or evil based on their deeds. In Elder Scrolls V: Skyrim, non-
player characters discuss the player’s exploits, before they inevitably go on about
their day. While these tricks have been good enough to satisfy gamers thus far,
how long will a simple illusion of intelligence be satisfactory in games? Artificial
intelligence has already become such a common topic of discussion, and while
game AI may not be on the same level as artificial intelligence, the gap is closing.
More games feature some level of implementation. Players and researchers are
creating artificial intelligence to play games. Studies are preparing for artificial
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 317–327, 2023.
https://doi.org/10.1007/978-3-031-18461-1_21
318 N. Fawcett and L. Ngalamoum
intelligence to take part in the production process. Video games and artificial
intelligence even help other fields of study. These combinations of subjects are
used in education and healthcare. Simulations are used for training purposes in
a variety of fields, when failure would be too much of a risk, including military
and aerospace.
2 The Problem
1. How video games and AI are used in conjunction for competitions and edu-
cation
2. How AI can be used in game development as Non-player characters
3. How AI can be used at various stages in the production of games.
4. How AI and video games can function together to create tools in different
fields.
5. How AI can benefit from being used in video games.
This study proposed that video games and AI are a combination that will
continue to benefit each other as well as other fields.
is already a feature in games like Left 4 Dead and Metal Gear Solid 5. Dynamic
difficulties and player analysis would be potential for the player to receive tips
and strategies to improve at a game. A game director AI would additionally be
able to provide more dynamic responses to negative player actions and redirect
them towards a more productive activity, instead of simply relying on a sweeping
ban; a point made by M. O. Riedl and A. Zook [12].
M. Guzdial, B. Li, M. and O. Riedl [7] created a first of its kind artificial intel-
ligence that replicates a game engine, or predicts the game’s backend rules. To
accomplish this task, the model is provided with a spritesheet, or a 2D image
of artwork used in the game and videos of gameplay. In case of this project, the
team chose to use a classic: Super Mario Bros (shown in Fig. 2).
The project had some shortcomings, as it could not make conclusions regard-
ing level transitions and losing a life. Because automated understanding is a new
field, flaws are to be expected, but as this technology expands, implementation in
game development should be expected. Once significantly improved, automated
understanding would allow for game developers to focus less time on writing
backend rules. As an added benefit, losing a game engine would be a thing of
the past. M. J. Nelson and M. Mateas [9] detailed a prototype that would cre-
ate micro games with user-requested themes, similar to those of WarioWare, a
popular Nintendo game franchise that consists of many minigames that have the
player do short tasks, including dodging something or filling a gauge. This area
of research has significant room for growth. While neither research mentioned
attempts to interact with an advanced gameplay engine, it can be inferred that
significant strides will lead to more complex engines.
6.4 AI in Art
Artificial Intelligence does not currently have the most profound position in art.
Still, strides are being made to generate 3D models. These models normally are
generated from a static image of a real life object, however work is being done to
improve these models. The author in [5] demonstrates an extension to an existing
software that uses a variety of images to improve on the 3D Model. The method
demonstrated in [6] demonstrates the high quality reconstruction of facial features
from a still partial image of a face. These technologies do not illustrate artistic abil-
ity in a natural sense, however advancement in these fields can lead to generating
environments and character models for the use in video games.
Video games incorporated with artificial intelligence can be used to create per-
sonalized rehabilitation programs for patients. S. Sadeghi Esfahlani, J. Butt, and
H. Shirvani [4] used a combination of an armband, a Kinect, and foot pedals to
create an engaging fruit-grabbing game that functions to provide patients with
extended program duration and increase the amount of movements performed by
the patients per session. The implementation of video games and AI would serve
to provide individuals requiring rehabilitation sessions with a method that is
Artificial Intelligence and Videogames 325
engaging, personalized, and potentially both cheaper and performed from home.
The video game aspect keeps the patients engaged, and the AI aspect evolves
the workouts, analyzes the progress, and continues to challenge the patient until
they are deemed to have completed the treatment.
Game development is not the only field that is benefited by AI. The reverse
can be true as well. While game AI is an essential part of game development,
the utilization of AI in video games can additionally be applicable to real life
AI. Lample and Chaplot [15] augmented a deep recurrent Q-network (DRQN)
with game information in order to implement an AI agent that can operate
in a partially observable scenario. Deep reinforcement learning has done well
in 2D games, but this is considered a fully observable scenario. Lample and
Chaplot applied deep learning methods into a copy of the Doom game engine,
where it outperformed human players. The agent’s tasks were separated into
two categories: navigation and action. Lample and Chaplot make a point that
the former is applicable to robotics, as deciphering a 3D environment from a
partially observable perspective is similarly a challenge in robotics.
Players will replay the same games to see how different actions interact with and
modify the AI agents in the game.
Game directors and long term AI agents can help to improve a player’s skill at
a game. Players could participate at a competitive level that they normally could
not, thanks to the training provided by AI agents assisting in their gameplay.
The presence of AI in video games is increasing all the time in all forms.
AI player competitions will continue not just to bring people into the fusion of
high level AI and video games, but to indicate that AI presence in video games
is an ever growing milestone that needs to be further acknowledged by game
developers. Games that have a greater perception of intelligence and immersion
are generally both well received and anticipated. Moving forward, more exper-
imental games with high functioning AI will begin getting published, and with
that popularity, larger companies will begin incorporating these features. Similar
to how a game feature or genre has a wave of popularity. This is tracaeble to
the introduction of quick time events, or the popular that was gained in horror
games after the release of Slenderman.
AI in production is a feature that will continue to gain steam. Anyone that
has dabbled in developing a game can identify that popular game engine tools like
Unity and Unreal have incorporated numerous tools that have already started
automating some of the processes of game development. Like every industry,
automation will inevitably have a significant hold. It is not a reach to assume
that Unity and Unreal will begin featuring more automation. Automated testing
will likely become a feature included in popular game engine tools, as it has
been becoming an essential part of the game testing for some game development
studios.
References
1. Kim, M., Kim, K., Kim, S., Dey, A.K.: Performance evaluation gaps in a real-
time strategy game between human and artificial intelligence players. IEEE
Access 6, 13575–13586 (2018). https://ieeexplore.ieee.org/stamp/stamp.jsp?tp&
arnumber=8276283&isnumber=8274985. Accessed 28 Jun 2022. https://doi.org/
10.1109/ACCESS.2018.2800016
2. Ram, A., Ontañón, S., Mehta, M.: Artificial intelligence for adaptive computer
games. In: Twentieth International FLAIRS Conference on Artificial Intelligence,
7–9 May 2007, Key West, FL. Palo Alto, CA, AAAI Press (2007). https://www.
aaai.org/Papers/FLAIRS/2007/Flairs07-007.pdf. Accessed 28 Jun 2022
3. Ponsen, Muñoz-Avila, H., Spronck, P., Aha, D.W.: Automatically acquiring
domain knowledge for adaptive game AI using evolutionary learning. In: The
Seventeenth Annual Conference on Innovative Applications of Artificial Intelli-
gence: IAAI-05, 9–13 July 2005, Pittsburg, PA. Palo Alto, CA, AAAI Press (2005).
https://www.aaai.org/Papers/IAAI/2005/IAAI05-012.pdf. Accessed 28 Jun 2022
4. Sadeghi Esfahlani, S., Butt, J., Shirvani, H.: Fusion of Artificial Intelli-
gence in neuro-rehabilitation video games. IEEE Access 7, 102617–102627
(2019). https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8752216&
isnumber=8600701. Accessed 28 Jun 2022. https://doi.org/10.1109/ACCESS.
2019.2926118
Artificial Intelligence and Videogames 327
Peng An1(B) , Wenbin Ye1 , Zizhao Wang1 , Hua Xiao2 , Yongsong Long2 , and Jia Hao1
1 Beijing Institute of Technology, Beijing 100089, China
[email protected], [email protected]
2 Jiang Nan Design & Research Institute of Machinery & Electricity, Guiyang 550009, China
1 Introduction
Surrogate models, as an effective optimization technique, are widely used in engineering
design fields such as materials and aerospace. The core idea is to use data-driven models
to replace the time-consuming and high-cost physical simulation engines [1–4]. How-
ever, in the design of complex equipment, because of multidisciplinary coupling and
high-dimensional design variables, it is difficult to build high-precision surrogate mod-
els with limited data. There are two main ideas to solve this problem. On the one hand,
some scholars proposed methods such as Transfer Learning [5–7] and Data Augmenta-
tion [8, 9] to increase the amount of data to improve the accuracy; On the other hand,
some scholars proposed that increasing the constraints of the model itself can improve
the accuracy. Fortunately, experts have accumulated a lot of design experience in the
design process for years and have a deep understanding of how design variables affect
product performance, which makes it a great potential to integrate design knowledge
[4, 10–14] into surrogate models. However, design knowledge has the characteristics of
mul-types, large quantities and heterogeneous representation. Not all knowledge can be
integrated into surrogate models and lack of a unified approach to integration. There-
fore, it is very important to carry out research on the acquisition, modeling and fusion
methods of design knowledge for the construction of surrogate models.
Research on knowledge has been studied for many years. In the 1960s, scholars began
to study the acquisition and integration of design knowledge. Booker [15] et al. studied
the way of knowledge extraction and the application of fuzzy theory in expert informa-
tion; Keeney [16] discussed the knowledge extraction method and process based on the
use of knowledge in nuclear reaction systems and determined the knowledge acquisi-
tion process. Gruber [17] sorted out the three principles that must be followed in the
process of knowledge acquisition, and gradually developed the main knowledge acqui-
sition technologies such as the Interview method [18], Observation method [19] and the
Knowledge Graph [20]. The probabilistic representation of knowledge information is
an important method of knowledge fusion. John[21] summarized the method of con-
verting statistical feature information given by experts into probability distribution and
discussed the basic steps of fusing multiple knowledge to form probability distribution;
Marcello[22] predicted the probability distribution of knowledge fusion information by
introducing Steiner points of knowledge distribution among experts.
However, when dealing with complex equipment design problems, design knowledge
has the characteristics of numerous and miscellaneous: numerous means that the number
of design experts is large, which makes the amount of knowledge large; miscellaneous
means that the understanding of knowledge exists bias due to the subjective cognitive
differences of experts. This makes it difficult to effectively use the methods of acquisition
characterization in structured knowledge for the shape knowledge. Furthermore, experts’
subjective cognitive differences are not consistent with curve shape recognition is very
important to be considered.
Therefore, how to identify the knowledge with large deviation among the numerous
design knowledge and find the consensus information of the knowledge to the greatest
extent are the core problems of dealing with the multi-knowledge fusion in the engineer-
ing field. For the problems above, this paper proposed the shape knowledge acquisition
and fusion technology for surrogate models construction, including the completion of
knowledge acquisition based on Bezier curve, the completion of knowledge filtering
based on Hausdorff Distance index, and the completion of fusion knowledge based on
Fermat Points for Finite Point Sets. The acquisition of knowledge was realized based on
the Bezier curve, and then the Hausdorff Distance and Fermat Points for Finite Point Sets
were used to reduce the amount of knowledge and effectively represent the information
of knowledge.
The rest of the paper is structured as follows: in Sect. 2, the proposed method is
explained in detail. In Sect. 3, several experiments are conducted to verify the proposed
method. A discussion based on empirical results is presented in Sect. 4, and this work
is summarized in Sect. 5.
2 Method
Product design knowledge has the characteristics of mul-types, large quantities and het-
erogeneous representation. Not all knowledge can be integrated into the surrogate model
330 P. An et al.
to improve the accuracy. Our previous work [12] defined design knowledge as the map-
ping relationship between design variables and key performances from the perspective of
knowledge-assisted surrogate models construction, and divided design knowledge into
types of monotonic and shape (Table 1). Among them, monotonic knowledge describes
the monotonic relationship between performance variables and design variables; shape
knowledge is based on monotonic knowledge, and the variables satisfy a more complex
curve relationship (such as a parabola). This paper mainly studies the shape knowledge.
Bezier curve has the advantage of accurately drawing curves, so this paper develops
interactive tools to acquire knowledge on its principle (step 1). Due to the large num-
ber of experts involved in the design of complex equipment, there is a large amount of
knowledge with large cognitive bias in knowledge, which means the knowledge needs
to be filtered. This paper proposes a design knowledge filtering technology based on the
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 331
average of Hausdorff Distance (step 2). After the knowledge is filtered by the indicator,
the amount of retained knowledge is still large. It is very important to find the maximum
consensus of knowledge among multiple experts. This paper proposes a design knowl-
edge fusion method based on Fermat Points for Finite Point Sets (step 3) to get the final
fusion knowledge curve. The overall technical process is shown in Fig. 1.
where B(t) is the position of discrete points at time t (the point on the Bezier curve moves
from the start point to the end point while t changes from 0 to 1), Pi is the position of
control points, and Cni is the coefficient of fitting curve.
When using a set of B(t) to define the shape knowledge, the shape knowledge can
be defined by several control points. In this paper, an interactive knowledge acquisition
tool directly operated by experts was developed by using the characteristic. The Fig. 2
shows the knowledge acquisition tool interface. When dragging the control point 2 from
(a) (b)
(c) (d)
Fig. 2(a) to the position of Fig. 2(b), the fitting curve changes from Fig. 2(a) to (b) and
drag control point 4 in turn to form Fig. 2(c), then drag point 6 to form the fitting curve
of Fig. 2(d).
Experts can the define the knowledge in mind and check the fitting curve in real time
through the Bezier curve formula by increasing or decreasing the number of control
points and dragging them to appropriate location. Therefore, the shape knowledge can
be represented by a set of point definitions.
Hausdorff Distance
Given two finite point sets A = {a1, a2, . . . , ap}, B = {b1, b2, . . . , bq} (Fig. 3):
where a − b is the Euclidean distance between a and b. The function h(A, B) is called
the directed Hausdorff Distance from A to B, which identifies the point a ∈ A that is
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 333
farthest from any point of B and measures the distance from a to its nearest neighbor in
B.
Measure Index
The similarity between any two pieces of knowledge can be characterized by the Haus-
dorff Distance between their knowledge curves. Therefore, the overall similarity level
for single knowledge can be defined by the average Hausdorff Distance of its curve with
rest all curves (Eq. 5).
1
n−1
SimilarityExpi = H (Expi , Expj ) (5)
n−1
j=1
where n is the number of acquired knowledge, H (Expi , Expj ) is the Hausdorff Distance
between the knowledge curve Expi with Expj .
Filtering Method
After the measurement process of the above steps, each knowledge has an indicator that
characterizes its overall knowledge similarity value. The filter of knowledge is to revolve
around these values, and the knowledge with large value are filtered out. Three Sigma
principle (3σ) is a commonly used error judgment principle, σ represents the standard
deviation, μ represents the mean, the Fig. 4 shows the 3σ principle:
The value of the probability distributed in (μ − 1σ, μ + 1σ) is 0.6826; the value of
the probability distributed in (μ − 2σ, μ + 2σ) is 0.9544;the value of the probability
distributed in (μ − 3σ, μ + 3σ) is 0.9974.
It can be considered that the values are almost all concentrated in (μ−3σ, μ+3σ),the
probability of beyond range is less than 0.3% and should be eliminated as an abnormal
value. This paper proposes a filtering method based on 3σ principle, and the main process
is divided into two steps:
Step1: For the overall similarity of knowledgeSimilarityExpi , calculate the mean μ and
the standard deviation σ and then get a probability distribution N(μ, σ)
Step2: Calculate the difference between the overall similarity value of all knowl-
edge and sigma, and remove knowledge greater than 3 times the standard
deviation(SimilarityExpi − μ > 3σ).
334 P. An et al.
2.3 Fusion Knowledge Based on Fermat Points for Finite Point Sets
After all knowledge are measured and filtered by the measure index and 3σ principle, the
abnormal knowledge is eliminated, but the amount of retained knowledge is still large. It
is important to find a fusion curve Expcombine from a set of knowledge curves Expi , which
can not only reduce the amount of knowledge but also ensure the effective information
of knowledge. However, the design knowledge gained in this paper is derived from
a series of points fitted by Bezier formulas (Eq. 1), this paper proposes a knowledge
fusion method based on Fermat Points for Finite Point Sets, which can make the average
distance between the points on the fusion curve and the filtered knowledge is the shortest
and approximate the shape of the curve by making points coordinates as close as possible.
The main steps are as follow:
Step 1: Divide the Knowledge Filtered Curve into Several Sets of Points
The number of points on the knowledge curve is uniformly determined by parame-
ter t according to Eq. 1. We divide the points on knowledge filtered curves accord-
ing to the value of each t to form a data set Pointsi , which has m coordinate points
from m pieces of knowledge curves. Finally, we obtain several data sets Pointsfermat =
[Points1 , Points2 , . . . , Pointst ] with the number of t samples.
where x and y are the targets we want. xi and yi are the coordinates of the point on the
i-th knowledge curve.
Derivation of the objective function
The equation shows that f (x, y) is a convex function, and the zero point of its first
derivative is the solution. The problem is converted to find the Fermat Point j (x, y) which
can make f (x, y) = 0(Eq. 7), and y and x expressions are consistent.
∂f (x, y)
n
x − xi
= =0 (7)
∂x
i=1
2
(x − xi )2 + (y − yi )2
Iterative calculation
We set a function g(x, y) = f (x, y) + (x, y), then the problem is further transformed
into:(x, y) = g(x, y), and to make (x, y) = g(. . . g(g(x, y))) through constant iterative
calculation.
The expression of the finite point set Fermat Point j (x, y) is:
n n
i=1 √ i=1 √
xi yi
2 2
(x−xi )2 +(y−yi )2 (x−xi )2 +(y−yi )2
x = n y = n (8)
i=1 √ i=1 √
1 1
2 2
(x−xi ) +(y−yi )
2 2 (x−xi ) +(y−yi )
2 2
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 335
Step 3: Combine Several Fermat Point j (x, y) to form Fusion Curve Expcombine .
The t Fermat points fitted by t finite point sets constitute the final fusion knowledge
curve Expcombine = [Point 1 , Point 2 , . . . , Point t ].
3 Experiments
The proposed method is tested with two benchmark functions and one engineering
problems to verify whether the method can reduce the amount of knowledge while
ensuring the effective information of knowledge. The proposed knowledge of bench-
mark functions is obtained by deriving test function and acquired by experimenter with
the interactive knowledge acquisition tool. The knowledge of the engineering problem
is the subjective judgment obtained by asking the designer. This section details these
experimental cases and experimental design.
The Fig. 5 shows the shape knowledge, which is the change in relationship between
f(x) and x1 when x2 = 0.5 and x1 is set to [−10, 10]. The curve shape is simple but the
curve coordinates cover a wide range.
Function Branin
Function Branin is also used to test two-dimensional functions, and the expression
is:
2
f (x) = a x2 − bx12 + cx1 − r + s(1 − t)cos(x1 ) + s (10)
Figure 6 shows the shape knowledge, which is the change relationship between f(x)
and x1 when x2 = 5 and x1 is set to [-5, 10]. The shape of the curve is complex and the
coordinates of the curve cover a wide range.
Fig. 8. Shape knowledge of maximum stress performance with thickness of the shell in UVT.
where μipredict is the predicted mean value of the abscissa of the pointi on the baseline
for the Gaussian process model, σi is its standard deviation.ybaseline
i is the ordinate of
i
point i .The penalty term operator wpenalty shows that the error requires additional mul-
tiplication if predicted value exceeds the 95% confidence level of the Gaussian process
mean.
Step 5: Repeat Step2–4, we complete 3 noise experiments for each case, 20 rounds of
experiments per noise, and calculate the average value of 20 groups of prediction errors
as the final error of each noise experiment.
As shown in Tables 2, 3, and 4, the prediction error values of the unmanned vehicle
truss body case, the Matyas function, and the Branin function are shown in order. Fusion
knowledge based on Fermat Points for Finite Point Sets has low error performance under
various noises and various curve shapes complexity compared with the range covered
by the knowledge curve.
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 339
Table 2. The mean of error for 20 rounds in case UVT with different noise.
UVT
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 0.0128 0.0096 0.0216
No filtering 0.0098 0.0146 0.0242
Error reduction(%) −30.61 34.25 10.74
Table 3. The mean of error for 20 rounds in case Matyas with different noise.
Matyas
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 0.5188 0.7710 1.2970
No filtering 0.7151 1.0186 2.1728
Error reduction(%) 27.45 24.31 40.31
Table 4. The mean of error for 20 rounds in case branin with different noise.
Branin
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 1.942 2.848 5.519
No filtering 2.686 4.104 8.184
Error reduction(%) 27.70 30.60 32.56
The Filter index based on the mean of Hausdorff Distance can reduce the error by
at least 10%. Although in the case of the UVT, the indicator is not as good as retaining
all knowledge in lower noise, but the errors of the two methods are extremely low.
Therefore, if the coordinates of the shape knowledge cover a small range, keeping all
knowledge to get fusion is still a good way.
As shown in Fig. 9, it can be found that the uncertainty range of the Gaussian process
prediction can effectively cover the baseline under the 95% confidence level, ensuring
the effective information of the baseline in all cases.
340 P. An et al.
(a)
(b)
(c)
Fig. 9. Gaussian process baseline prediction plot.
5 Conclusion
This paper proposes a shape based design knowledge acquisition and fusion technology
for surrogate model construction, which aims at the lack of technical basis of acquisi-
tion, representation and fusion of design knowledge under the framework of surrogate
model construction with fusion knowledge. The type of knowledge oriented is mainly
shape design knowledge represented by curve shape. This paper develops an interactive
knowledge acquisition tool for experts, which can facilitate experts to quickly acquire
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 341
References
1. Mai, H.T., Kang, J., Lee, J.: A machine learning-based surrogate model for optimization of
truss structures with geometrically nonlinear behavior. Finite Elements Anal. Des. 196 (2021)
2. Karen, İ, Kaya, N., Öztürk, F.: Intelligent die design optimization using enhanced differential
evolution and response surface methodology. J. Intell. Manuf. 26(5), 1027–1038 (2013).
https://doi.org/10.1007/s10845-013-0795-1
3. Ögren, J., Gohil, C., Schulte, D.: Surrogate modeling of the CLIC final-focus system using
artificial neural networks. J. Instrument. 16 (2021)
4. Gorissen, D., Couckuyt, I., Demeester, P., Dhaene, T., and Crombecq, K.: ‘A surrogate
modeling and adaptive sampling toolbox for computer based design 11, 2051–2055 (2010)
5. Zhao, X., Gong, Z., Zhang, J., Yao, W., and Chen, X.A surrogate model with data augmentation
and deep transfer learning for temperature field prediction of heat source layout. Struct.
Multidiscip. Optim. 64(4), 2287–2306 (2021)
6. Tian, K., Li, Z., Zhang, J., Huang, L., Wang, B.: Transfer learning based variable-fidelity
surrogate model for shell buckling prediction. Compos. Struct. 273, 114285 (2021)
7. Ma, Y., Wang, J., Xiao, Y., Zhou, L., and Kang, H.: Transfer learning-based surrogate-assisted
design optimization of a five- phase magnet-shaping PMSM. IET Electr. Power Appl. 15
(2021)
8. Liu Y., T.W., Li S.: Meta-data Augmentation Based Search Strategy Through Generative
Adversarial Network for AutoML Model Selection (2021)
9. Li, K., Wang, S., Liu, Y., Song, X.: An integrated surrogate modeling method for fusing noisy
and noise-free data. J. Mech. Des. 144, 1–23 (2021)
10. Zhang, Z., Nana, C., Liu, Y., Xia, B.: Base types selection of product service system based on
apriori algorithm and knowledge-based artificial neural network. IET Collab. Intell. Manuf.
1, 29–38 (2019)
11. Hao, J., Ye, W., Jia, L., Wang, G., Allen, J.: Building surrogate models for engineering
problems by integrating limited simulation data and monotonic engineering knowledge. Adv.
Eng. Inform. 49 (2021)
12. Hao, J., Zhou, M., Wang, G., Jia, L., Yan, Y.: Design optimization by integrating limited
simulation data and shape engineering knowledge with Bayesian optimization (BO-DK4DO).
J. Intell. Manuf. 31(8), 2049–2067 (2020). https://doi.org/10.1007/s10845-020-01551-8
342 P. An et al.
13. Hao, J., Ye, W., Wang, G., Jia, L., Wang, Y.: Evolutionary Neural Network-based Method for
Constructing Surrogate Model with Small Scattered Dataset and Monotonicity Experience
(2018)
14. Aguirre, L.A., Furtado, E.C.: Building dynamical models from data and prior knowledge: the
case of the first period-doubling bifurcation 76, 046219 (2007)
15. Meyer, M.A.A.B., Jane M.: Eliciting and Analyzing Expert Judgment (2001)
16. Keeney, R., Winterfeldt, D.: Eliciting probabilities from experts in complex technical
problems. IEEE Trans. Eng. Manage. 38, 191–201 (1991)
17. Gruber, T.R.: Automated knowledge acquisition for strategic knowledge. In: Marcus, S. (ed.)
Knowledge Acquisition: Selected Research and Commentary: A Special Issue of Machine
Learning on Knowledge Acquisition, pp. 47–90. Springer, Boston (1990)
18. Nue, B., Win, S.: Knowledge acquisition based on repertory grid analysis system. J. Trend
Sci. Res. Dev. 3(6) (2019)
19. do Rosário, C.R., Kipper, L.M., Frozza, R., and Mariani, B.B.: Modeling of tacit knowledge
in industry: Simulations on the variables of industrial processes. Expert Syst. Appl. 42(3),
1613–1625 (2015)
20. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: represen-
tation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 33(2), 494–514
(2022)
21. John Paul, G., Anthony, O.H., Jeremy, E.O.: Nonparametric elicitation for heavy-tailed prior
distributions. Bayesian Anal. 2(4),693–718 (2007)
22. Basili, M., Chateauneuf, A.: Aggregation of experts’ opinions and conditional consensus
opinion by the Steiner point. Int. J. Approx. Reason. 123, 17–25 (2020)
Path Planning and Landing
for Unmanned Aerial Vehicles Using AI
1 Introduction
environments such as parks and open fields, there are cases where they will have
to fly indoors or in restricted environments, such as within urban or construc-
tion environments [13]. Drones ability to flight higher and move above obstacles
made them great solutions for open air tasks such as patrolling of wide and
hardly accessible areas (e.g. forests), road traffic monitoring, field spraying, etc.
However, there is still place for improvement in tasks that impose a flight height
limit, as is the case in more complex environments, such as the shipping of items
within a city, or the inspection of a building under construction.
The correct positioning of the UAV, as well as the generation and continuous
update of its trajectory are crucial for its correct navigation, especially in the
case of dynamic environments, where the landscape composition is not known
in advance and obstacles may appear as the planned trajectory is executed. An
integral part of this trajectory refers to the landing of the drone at the last
segment of its trajectory, after the drone arrives in the predefined landing area
[9]. The aforementioned tasks require the ability of the drone to frequently re-
evaluate the situation (position and scene perception) and re-adjust the planned
trajectory, in order to avoid obstacles and land safely.
With the vision to solve the above problems, the present study aims to find
the appropriate trajectory that safely leads a UAV from its initial position to
its final destination. In specific, the present analysis examines the optimization
problem of obtaining the optimal trajectory for a UAV in an environment with
obstacles.
In order to validate our claim and demonstrate how the problem of
autonomous UAV navigation can be solved using a combination of well-
established path planning algorithms for simplicity and deep neural network-
based approaches for scene perception and fine grained navigation, we perform
our experiments on the virtual flight simulation environment of AirSim1 . This
virtual verification allows algorithm testing with minimum cost and complete
safety for UAVs, and provides useful feedback for the actual testing of their
navigation in the real environment.
The contributions of this study comprise:
In Sect. 2 that follows we summarize the main research areas related to the
problems that we study. Section 3 provides details on the simulation platform,
the algorithms we employ for drone navigation and the solution we used for
properly locating a clear landing site. Section 4 demonstrates the results from
the experimental evaluation of our approach, in the direction of comparing the
1
https://microsoft.github.io/AirSim/.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 345
efficiency of the two algorithms in quickly finding the shortest navigation path,
following this dynamic update approach. Also provides two alternative landing
scenarios. Finally, Sect. 5 provides a discussion on the results achieved so far and
describes our next steps.
2 Related Work
UAVs are revolutionizing data collection and environmental exploration tasks.
They allow to enhance remote monitoring capabilities, increase efficiency and
lower costs, giving potential to numerous applications in various areas, such
as aerial photography, infrastructure inspection, search and rescue, commercial
delivery, and surveillance for law enforcement [29].
Efficient perception of the UAV surroundings, and safe and fast navigation
are critical in BVLOS operations. With respect to their autonomy, UAVs should
be able to dynamically revise their path planning strategy according to the
environmental constraints [22]. For this purpose, a wide range of technologies
are utilized by an UAV to generate informed trajectories for environmental to
exploration of unknown scenes [36], which includes, as show in Fig. 1:
Fig. 1. The Different Tasks that Relate to the Autonomous Navigation of UAVs.
– Sensing and sensor fusion: This mainly refers to the combined use of visual
(e.g. simple or stereoscopic cameras) and non-visual (e.g. LiDAR, proximity)
sensors.
– Scene perception: Signal processing and computer vision algorithms that pro-
vide perception of the surrounding environment and obstacle detection.
346 E. Politi et al.
– Map generation and path planning: They refer to techniques for scene repre-
sentation, with the use of volumetric or 2-D maps, and the continuous update
of the map or path in real time, using the output of the perception module.
– Localization: position of the UAV itself. This can be typically treated jointly
with Mapping as a simultaneous localization and mapping problem.
– Path Planning: This involves the generation of a navigation path based on a
given map and the actual position of the UAV, the location of its target and
the detected obstacles.
already equipped with a LiDAR, this is facing forward and thus does not provide
proper information for detecting the landing site and positioning the UAV above
it. In addition, replacing the LiDAR with cameras is also among our objectives,
and we are currently investigating depth detection with the use of camera input.
permanently. In the opposite case, the repetitive step takes place whenever the
drone detects that the next cell of its route is not empty. The simple environment
setup shown from top view in Fig. 2 is the one used in our experiments. It consists
of a large building block and a second smaller block on the top of it. The drone
can take the option to raise above the two blocks and move from start to the
end, but we decide to test our navigation algorithm in 2-D and fly around the
two blocks.
The method get path in Algorithm 1 can either be A-star or Dijkstra’s short-
est path.
The last segment of a drone trajectory is the safe landing in a designated landing
area. In order to break down this task into smaller tasks and take advantage of
software solutions, we assume that the drone safely approaches the landing area,
350 E. Politi et al.
but due to several reasons (e.g. deviations in the drone position due to wind or
other external conditions, the landing site has moved) the exact position of the
landing site is not determined.
For this purpose, the drone takes advantage of a down facing camera that
covers the landing area from a certain height, and a deep learning model for
image analysis and perception. As shown in Fig. 3, the landing site detection
module finds the exact position of the landing site and navigates the drone right
above the landing site and begins its landing. The same module keeps checking
whether the landing site is clear for landing at all times. In the opposite case the
drone can change altitude and try to approach the landing site again or notifies
the drone operator when there is no clear solution for the situation.
Fig. 3. The Three Stages of the Landing Sequence: i) The Drone Enters the Landing
are (Left), ii) the Drone Approaches the Landing Site (Middle) and iii) Activates the
Down Facing Camera in Order to detect the Exact Position of the Landing Site (Right)
AirSim is an open-source, cross platform simulator for drones, cars, boats, etc.,
which allows connection between simulation and reality when autonomous vehi-
cles are examined. The platform is build as an Unreal Engine plugin and supports
navigation with popular autopilots, such as the PX4 open source autopilot2 ,
which is also used in real drones. AirSim offers hi-fidelity physical and visual
simulation, which allows to generate realistic scenarios and conditions. It is thus
possible to generate large quantities of training data without cost and risk, in
order to better train and evaluate the performance of various artificial intelli-
gence and machine learning techniques for the different tasks related to UAV
autonomy [26].
The configuration of AirSim is fairly simple, and is performed by describing,
in a JSON file (as shown in Fig. 4), the different types of sensors that are on-
boarded to the vehicle, and their characteristics. The same JSON file can be used
to choose simulation environments of various complexity and load them to the
Unreal engine. Such environments range from simple block-based setups to more
complex environments comprising whole cities, like the CiThruS environment
[8]. Finally, AirSim exposes APIs that allow external code to interact with the
vehicle within the simulation environment, to collect data from its sensors as
well as the state of the vehicle and its environment.
2
https://px4.io/.
352 E. Politi et al.
For the navigation scenario, we split the environment using a square grid with a
size of 15 × 15 cells. The rectangular environment as shown in Fig. 5 has a length
of 87 m on each side. There are four obstacles located in the environment with
known position and dimension. Two large cubic blocks were placed one on top of
the other in the middle of the environment, thus allowing the drone to navigate
around the large block. Another obstacle has the shape of a cone and it is located
at the far left of the environment. The last obstacle is a sphere positioned at
the far right of the environment. At the beginning of each episode, the drone
(quadrotor) takes off from a starting point in the environment, denoted with a
capital letter and has to navigate through the obstacles to another designated
point as depicted in Fig. 5.
A-star Dikstra’s
Path length (m) Execution time (s) Path length (m) Execution time (s)
A-T 168.2 111.5 168.2 114.6
T-A 168.2 69 168.2 75.2
A-T’ 110.2 71.5 110.2 72.9
T’-A 110.2 49.6 110.2 47
B-E 139.2 94.8 139.2 99.1
E-B 139.2 59.5 139.2 64.1
B-E’ 110.2 72.8 110.2 73.9
E’-B 110.2 49.1 110.2 49.1
The second experiment aims to test the ability of the landing site detection
module, once the drone approximates the landing area. In this case we test two
different scenarios. The first scenario relates to a landing spot that randomly
spawns each time at a slightly different spot within the greater landing area.
Using the YOLO object detection model, it was easy to detect the landing spot
when it is not covered by other objects, or even when it is partially covered.
The drone managed to fly above the landing spot and start landing. In order
to validate the ability of tiny-YOLO to efficiently detect the landing spot, we
performed more than twenty experiments in which the landing spot was ran-
domly spawned at different positions within the predefined landing cell. In the
experiments we tried to distort the landing spot (shaped as shown in Fig. 3), to
354 E. Politi et al.
add some synthetic noise or to partially cut it, in order to increase the difficulty
of its detection. However, the landing spot has been correctly detected in all
experiments.
The second scenario examined a way to land on the detected spot even if it
was partially covered by a static obstacle (e.g. by a shed at a certain height).
In this case, the drone has to gradually change its height until it manages to
find a clear way to land in the spot. For this purpose it repeatedly moves away
from the landing spot, lowers in height in order to get below the obstacle, and
tries to move again above the detected landing spot. The LiDAR is employed in
every loop to detect whether the position right above the landing spot is empty
or not, and to retrieve the obstacle height if possible. The process repeats until
the drone manages to position above the landing site and get a clear view to it.
Depending on the height of the obstacle that covers the spot and the step that
the drone lowers in height, the duration of this process may vary.
5 Conclusions
In this work, we examined the problem of UAV path planning, route execution
and landing, with the use of camera and LiDAR sensor input. We performed an
overview of the various techniques that exist in the literature for the navigation
of autonomous vehicles in various environments and highlighted the pros and
cons of each group of techniques. We selected two popular search techniques
that model the navigation space using a grid and represent the search space as
a graph. The two techniques, namely A-star and Dijkstra’s shortest path, have
been embedded in a path planning strategy that updates the UAV path, when
it is not sure whether the vehicle cannot move to a new cell, or when a cell in
the path is detected to be occupied by an obstacle.
The experimental results showed that proposed techniques can find an opti-
mal path for the vehicle, for a given granularity of the grid space, which is the
same for both path algorithm. The A-star algorithm uses a heuristic function,
which gives priority to cells that are supposed to compose a shorter path than
others, while Dijkstra’s just explores all possible paths. For this reason, A-star’s
best first strategy performs faster that Dijksra’s.
The simplicity and efficiency of the developed solution has been demonstrated
experimentally in the simulated environment of AirSim. Successful execution of
the path planning and navigation within AirSim, with the use of its sensors
and its autopilot, guarantees that minimum effort will be needed to port the
algorithms in a real UAV case. Our work also performed an initial study on
the task of UAV landing, employed a computer vision approach for locating
the exact landing spot and navigating the drone to it, and examined various
scenarios. With the use of the Tiny-YOLO object detector module, which can
easily run on edge devices, with minimum requirements for resources.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 355
References
1. Md Shah Alam and Jared Oluoch: A survey of safe landing zone detection tech-
niques for autonomous unmanned aerial vehicles (UAVS). Expert Syst. Appl. 179,
115091 (2021)
2. Azar, A.T., et al.: Drone deep reinforcement learning: a review. Electronics 10(9),
999 (2021)
3. Cabreira, T.M., Brisolara, L.B., Ferreira, P.R.: Survey on coverage path planning
with unmanned aerial vehicles. Drones 3(1), 4 (2019)
4. Cai, Y., Xi, Q., Xing, X., Gui, H., Liu, Q.: Path planning for UAV tracking tar-
get based on improved a-star algorithm. In: 2019 1st International Conference on
Industrial Artificial Intelligence (IAI), p. 1–6 (2019)
5. Deng, Y., Chen, Y., Zhang, Y., Mahadevan, S.: Fuzzy dijkstra algorithm for short-
est path problem under uncertain environment. Appl. Soft Comput. 12(3), 1231–
1237 (2012)
6. Dijkstra, E.W., et al.: A note on two problems in connexion with graphs. Numer.
Math. 1(1), 269–271 (1959)
7. Feng, Y., Zhang, C., Baek, S., Rawashdeh, S., Mohammadi, A.: Autonomous land-
ing of a UAV on a moving platform using model predictive control. Drones 2(4),
34 (2018)
8. Galazka, E., Niemirepo, T. T., Vanne, J.: CiThruS2: Open-source photorealistic
3D framework for driving and traffic simulation in real time. In: 2021 IEEE Inter-
national Intelligent Transportation Systems Conference (ITSC), pp. 3284–3291.
IEEE (2021)
9. Gautam, A., Sujit, P. B., Saripalli S.: A survey of autonomous landing tech-
niques for UAVs. In: 2014 International Conference on Unmanned Aircraft Systems
(ICUAS), pp. 1210–1218. IEEE (2014)
356 E. Politi et al.
10. Gupta, G., Dutta, A.: Trajectory generation and step planning of a 12 DoF biped
robot on uneven surface. Robotica 36(7), 945–970 (2018)
11. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determi-
nation of minimum cost paths. IEEE Trans. Syst. Sci. Cybernet. 4(2), 100–107
(1968)
12. Hodge, V.J., Hawkins, R., Alexander, R.: Deep reinforcement learning for drone
navigation using sensor data. Neural Comput. Appl. 33(6), 2015–2033 (2020).
https://doi.org/10.1007/s00521-020-05097-x
13. Kawabata, S., Lee, J. H., Okamoto, S.: Obstacle avoidance navigation using hor-
izontal movement for a drone flying in indoor environment. In: 2019 Interna-
tional Conference on Control, Artificial Intelligence, Robotics & Optimization
(ICCAIRO), pp. 1–6. IEEE (2019)
14. Raza Khan, M.T., Saad, M.M., Ru, Y., Seo, J., Kim, D.: Aspects of unmanned
aerial vehicles path planning: overview and applications. Int. J. Commun Syst
34(10), e4827 (2021)
15. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In:
Autonomous Robot Vehicles, pp. 396–404. Springer, New York (1986). https://doi.
org/10.1007/978-1-4613-8997-2 29
16. LaValle, S. M.: Planning Algorithms. Cambridge University Press, Cambridge
(2006)
17. LaValle, S. M., Kuffner, J. J., Donald, B. R., et al.: Rapidly-exploring random
trees: progress and prospects. Algorithmic and Computational Robotics, vol. 5,
pp. 293–308 (2001)
18. Li, F., Zlatanova, S., Koopman, M., Bai, X., Diakité, A.: Universal path planning
for an indoor drone. Autom. Constr. 95, 275–283 (2018)
19. Meng, H., Xin, G.: UAV route planning based on the genetic simulated annealing
algorithm. In: 2010 IEEE International Conference on Mechatronics and Automa-
tion, pp. 788–793. IEEE (2010)
20. Mirzaeinia, A., Shahmoradi, J., Roghanchi, P., Hassanalian, M.: Autonomous rout-
ing and power management of drones in GPS-denied environments through dijkstra
algorithm. In: AIAA Propulsion and Energy 2019 Forum, p. 4462 (2019)
21. Panda, M., Das, B., Subudhi, B., Pati, B.B.: A comprehensive review of path
planning algorithms for autonomous underwater vehicles. Int. J. Autom. Comput.
17(3), 321–352 (2020)
22. Politi, E., Panagiotopoulos, I., Varlamis, I., Dimitrakopoulos, G.: A survey of UAS
technologies to enable beyond visual line of sight (BVLOS) operations. In VEHITS,
pp. 505–512 (2021)
23. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 779–788 (2016)
24. Roberge, V., Tarbouchi, M., Labonté, G.: Comparison of parallel genetic algorithm
and particle swarm optimization for real-time UAV path planning. IEEE Trans.
Industr. Inform. 9(1), 132–141 (2012)
25. Sethian, J.A.: A fast marching level set method for monotonically advancing fronts.
Proc. National Acad. Sci. 93(4), 1591–1595 (1996)
26. Shah, S., Dey, D., Lovett, C., Kapoor, A.: AirSim: high-fidelity visual and physical
simulation for autonomous vehicles. In: Hutter, M., Siegwart, R. (eds.) Field and
Service Robotics. SPAR, vol. 5, pp. 621–635. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-67361-5 40
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 357
27. Shivgan, R., Dong, Z.: Energy-efficient drone coverage path planning using genetic
algorithm. In: 2020 IEEE 21st International Conference on High Performance
Switching and Routing (HPSR), pp. 1–6. IEEE (2020)
28. Souissi, O., Benatitallah, R., Duvivier, D., Artiba, A., Belanger, N., Feyzeau, P.:
Path planning: a 2013 survey. In: Proceedings of 2013 International Conference on
Industrial Engineering and Systems Management (IESM), pp. 1–8. IEEE (2013)
29. Tan, L.K.L., Lim, B.C., Park, G., Low, K.H., Yeo, V.C.S.: Public acceptance of
drone applications in a highly urbanized environment. Technol. Soc. 64, 101462
(2021)
30. Tang, G., Tang, C., Claramunt, C., Xiong, H., Zhou, P.: Geometric a-star algo-
rithm: an improved a-star algorithm for AGV path planning in a port environment.
IEEE Access 9, 59196–59210 (2021)
31. Tsintotas, K. A., Bampis, L., Taitzoglou, A., Kansizoglou, I., Antonios Gasteratos,
A.: Safe UAV landing: a low-complexity pipeline for surface conditions recognition.
In: 2021 IEEE International Conference on Imaging Systems and Techniques (IST),
pp. 1–6. IEEE (2021)
32. Turker, T., Sahingoz, O. K., Yilmaz, G.: 2D path planning for UAVs in radar
threatening environment using simulated annealing algorithm. In: 2015 Interna-
tional Conference on Unmanned Aircraft Systems (ICUAS), pp. 56–61. IEEE
(2015)
33. Yang, Q., Yoo, S.-J.: Optimal UAV path planning: sensing data acquisition over
IoT sensor networks using multi-objective bio-inspired algorithms. IEEE Access 6,
13671–13684 (2018)
34. Zhang, Z., Zhao, Z.: A multiple mobile robots path planning algorithm based on
a-star and dijkstra algorithm. Int. J. Smart Home 8(3), 75–86 (2014)
35. Zhang, Z., Tang, C., Li, Y.: Penetration path planning of stealthy UAV based on
improved sparse a-star algorithm. In: 2020 IEEE 3rd International Conference on
Electronic Information and Communication Technology (ICEICT), pp. 388–392
(2020)
36. Zhou, X., Yi, Z., Liu, Y., Huang, K., Huang, H.: Survey on path and view planning
for UAVs. Virtual Reality Intell. Hardware 2(1), 56–69 (2020)
Digital Ticketing System for Public Transport
in Mexico to Avoid Cases of Contagion Using
Artificial Intelligence
1 Introduction
At the beginning of the pandemic, several prevention strategies were implemented to
avoid contagion of the virus, such as the “Stay at home” initiative, the use of face masks,
constant hand washing, and the disinfection of commonly used items, in addition to
prone areas to the virus. To reduce contagion, a new way of continuing to work had to
be found so that the economy would not suffer many consequences, which led to remote
work and the temporary or permanent closure of schools [1].
In recent months, places of first need have been opened as the epidemiological
traffic light goes down, this leads to a flow of people on public transport, which is where
a problem lies, the low or no hygiene measure for the prevention of contagion. Even
with the vaccination campaigns, the contingency situation is still in force since the virus
has not been 100% eradicated. This causes insecurity in a certain part of the population
as they must go out and use public transport daily [2]. Digital transportation collection
systems have been implemented around the world, making it an efficient and convenient
system for users. With the use of a digital system for public transport, the contact between
driver and passenger could be reduced.
Develop a web system of digital tickets for public transportation in the Tijuana area.
Which will be an alternative to physical tickets, to reduce the number of infections due
to the health contingency presented in recent months.
Creation of digital tickets through the web system, in which the reduction of physical
tickets is sought. Seek to be an alternative solution for the reduction of infections through
physical contact between driver and passenger. Reduce the use of paper, contributing to
the reduction of soil contamination. Use the web system through electronic tablets. The
reasons that led us to develop this project is the need to have a digital ticket system for
public transport that is carried out in an agile way and without the need for interaction
between users and drivers.
The main reason to carry it out currently is the contingency due to the pandemic,
which is still active, and the increased risk that it entails for the staff and users of these
services. Another reason is also to seek to reduce soil contamination, since many times
these tickets are discarded by the passenger, throwing them out the windows and even
on the same transport. Through this system of electronic tickets, it is intended to avoid
the manipulation of physical tickets in public transport, since this can lead to contagion
by contact with tickets, such as cash (this is left as something optional), in addition to
avoiding contact physical by reducing the interaction of the passenger with the driver.
It is the supervision of the validity of the tickets used in the transport, normally they
are contactless cards, however, the control of other types of tickets can be implemented,
for example, payment through NFC, SMS tickets, 2D codes in mobile phone screens,
tickets with barcode or QR, among others. The system is designed in such a way as to
make the work of the inspectors easier and more efficient and to eliminate attempts to
avoid payment of the corresponding trip. Electronic tickets in public transport, with the
passage of time and due to the different needs, that have arisen, collection systems have
evolved allowing to reduce effort, time, and margin of error. Reliability and efficiency
have been one of the most important characteristics in the design of this type of system,
which is why the intervention of digital devices facilitates the task.
Various technologies and service models have been developed to provide mobile
payment service in different contexts, including e-wallets, operator billing (SMS) and
contactless payments, for example with NFC (Near Field Communication) technology
[3, 4].
In the context of public transport, different technologies have been implemented and
adapted to create new service models that respond to the challenges of this sector [5, 6].
Technology is also closely related to the structure of the service, some of the options
360 J. S. Magdaleno-Palencia et al.
that can be found are pay per use, fixed tickets (from one specific station to another) and
subscriptions.
Self-ticketing: It is one of the most popular technologies, given the ease of imple-
mentation. By using an app, the user can buy tickets from a specific exit to a specific
final station, which means that the travel route is fixed. The result of the transaction is
a QR code, or barcode, that can be viewed on the phone, there are two options for the
execution of this service model. Both in Europe and in other countries around the world,
different digital collection and ticketing systems for public transport have been created
and implemented [7]. When stations do not have gates or when boarding vehicles, the
ticket must be activated prior to boarding. At that moment, the ticket begins to show
an animated background, which can be represented by a QR code, which the driver can
easily check during boarding through the device that the user is using (his cell phone,
for example). On most services there is an extra check done by an inspector on board
[8].
Closed stations: In stations with gates, the challenge is to open the doors with the
mobile device, for that purpose the doors are usually equipped with QR scanners. NFC
(Near Field Communication). - It’s a wireless data transfer method that detects and
then enables technology in the vicinity to communicate without the need for an internet
connection. It’s easy, fast and works automatically [9]. It means that two devices can
transfer data to each other without being connected to Wi-Fi, or using a pairing code
like in Bluetooth. Due to the encryption protocol, the chips embedded in most high-tech
smartphones are secure enough to be used for payments like a contactless card.
NFC technology is today one of the most used technologies in mobile phones in
general. Mainly used for payments in physical stores and other services. However, the
implementation of this technology has great challenges that directly affect the opportu-
nities in public transport. According to RFID Journal [10, 11] the biggest challenge is
related to the slow adoption process due to lack of infrastructure, complex ecosystem of
stakeholders and standards. Several NFC service models have been developed to adapt
the technology to the specifications and needs of public transport. Some of the NFC
applications are NFC + Sim Card [12]. Most gates today are not equipped with NFC
readers, as they were used in the field before NFC emerged as a standard. In addition,
there is a wide variety of encryption systems according to the different mobile models.
For this reason, this solution proposes a SIM card, plus an NFC chip, which emulates
the protocol of the chip that is integrated in the doors and posts.
Then, it is possible to open the doors and track the user’s transactions as a transport
card, as well as recharge with an app. This model is currently used in Hong Kong. NFC
to scan the card. In this case, given the limited infrastructure of the readers, NFC is used
to scan the service card (for example, OV-chipcard or Octopus Card) and to top up via
an app transaction.
Mobile wallet. - This is so far from the most widespread use of NFC for mobile
payment due to the launch of Apple pay and Android pay in most countries. By using
an application that stores credit and debit card information, a smartphone can be used
as a means of payment instead of the physical card.
There are other technologies which are not widely used. Hop-on: This technology
was developed by an Israeli start-up as an alternative way of paying for public transport
Digital Ticketing System for Public Transport 361
with a mobile phone. The technology sends information through ultrasonic sound waves
transmitted from the mobile to the reader. It is said to be safe, low cost and fast.
Projects such as Route 664 have served, being a project that emerged as a school idea
within the Communication Career of the Faculty of Humanities and Social Sciences at
UABC.
The purpose of the platform is to make known to the citizens of Tijuana the routes of
both trucks and taxis through an interactive map. Other tools such as videos, photographs
and graphics are also used, the latter is made up of a photograph of the transport, the
schedule, the rate and the place where said route can be taken.
A field investigation was carried out, in which some streets of the Center were visited.
Where there are more than 170 routes, the interactive map project has 70 routes. And
it is the users who feed the same information on roads that are needed. The idea is to
add these type of projects. In Fig. 1 you can see the transportation routes of the city of
Tijuana.
Figure 1 shows the current map taken from the project “Route 664: Public
Transportation Platform in Tijuana” and Fig. 2 shows the current Tijuana’s transport
routes.
Fig. 1. Map taken from the project “route 664: public transportation platform in
Tijuana” https://www.sandiegored.com/es/noticias/98691/Ruta-664-Plataforma-de-Transporte-
Publico-en-Tijuana.
GPS, the global positioning system is a satellite navigation system that allows locat-
ing the position of an object, vehicle, person, or ship around the world with precision
of up to a few centimeters, although it is usual to have a margin of error of meters.
GPS works through a network of 32 satellites, 28 operational and 4 backups, orbiting
20,200 km above the planet, with synchronized trajectories. To determine the position,
the receiver locates and uses at least 3 satellites of the network, which deliver a series of
identification signals and the time of each one [13]. With this, the device synchronizes
the GPS clock and calculates the time it takes for the signals to reach the equipment,
362 J. S. Magdaleno-Palencia et al.
which allows us to know the distances to the satellite through triangulation. Triangula-
tion consists of determining the distance of each satellite from the measurement point
[14]. The useful data for the GPS receiver, which allows it to determine its position, is
called ephemeris. Each satellite emits its own ephemeris where the position in space of
each satellite is included, if it should be used or not, its atomic time, doppler information,
etc.
A QR code (Quick Response code) is a method of representing and storing infor-
mation in a two-dimensional dot matrix [15]. This 2D symbology has its origin in 1994
in Japan [16], when the company Denso Wave, a subsidiary of Toyota, developed it to
improve the traceability of the vehicle manufacturing process. It was designed with the
main objective of achieving a simple and fast decoding of the information contained.
They are common in Japan and increasingly widespread worldwide (thanks to their use
to encode Internet URLs and to existing decoding applications for mobile phones with
cameras), they are characterized by having three squares in the corners, which facilitate
the process Reading.
The Secretariat of Infrastructure, Urban Development and Territorial Reorganization
(SIDURT), carries out studies on the Tijuana-Tecate railway, as well as the works and
replacement of tracks, and after these studies the Tijuana-Tecate interurban train will be
able to start operating in 2024, to mobilize an average of 40 thousand people, reorder
transportation routes and thus reduce traffic in Tijuana.
Figure 2 shows the current public transportation network in Tijuana with 114 routes
for mass transportation and 125 routes for enroute taxi transportation.
As can be seen in Fig. 2, there are 34 routes waiting in parallel in a single corridor.
There are other systems that are also handled in Europe, such as the London Oyster
card. This works by means of a card which uses RFID and is used in buses, subways, or
suburban trains. It took a few years to be accepted by millions of users. In Asia, South
Korea, a smart card system for public transport, called T-money, was introduced, and is
used throughout the country. The card is used to pay for public transport services; it is
used in buses, taxis, and trams. It is also used for other activities, such as gas stations,
vending machines, among others. In China, the Octopus card is used as a means of
electronic payment for public transport. Like the T-money card, the Octopus card can be
used to pay for public transport services, as well as to make payments in supermarkets,
restaurants, and different businesses. In Germany, a service called “Touch and Travel”
had been launched, in which GPS is used to track the user’s journey. Users are required
to sign in and out on the app when getting on and off. It was a service that started in
2008 in Berlin, and later it could be used throughout Germany. However, this service
was discontinued at the end of 2016 [17].
Artificial intelligence (AI) is increasingly present in our lives. A good definition of it
would be the combination of algorithms, which try to simulate some actions of humans
or better yet, go beyond human intelligence. The best of all is that it is open to many
fields and can provide solutions, such as in the case of customer service using chatbots,
which according to the Gartner Consulting firm by 2020 will be implemented in 84%
of the companies consulted, which will increase investment in this type of technology
[18].
For this and for many other reasons, in this research we will focus on the development
of an artificial intelligence; let’s investigate how they work, and how to create one.
3 Methodology
3.1 Applications
The sample will be obtained in a probabilistic way, selecting an amount of the population,
in certain areas of the city that meet the characteristics of a transport route or a transport
stop that are users of public transport. To have a faster data collection and greater accu-
racy, a stratified random sampling will be used proportionally to the population of the
area near the transport route or main stops and that they use transport normally. In other
words, the universe is taken as all the people who are users of public transportation in
the city of Tijuana. Because it is not possible to locate or interview all the people who
are users of public transportation in the city of Tijuana, a sample of the population will
be chosen, which will be taken at points of greater accessibility.
It was determined that the type of investigation will be an experimental investigation,
it is investigated based on the existing means and an analysis will be made to obtain a
result for the problem. This to be able to analyze the possible changes that there may
be with the changes to be made. The tool that will be used will be the questionnaires.
They are composed of closed questions, so that the people who will be surveyed can
respond briefly and specifically and thus obtain the information with which they wish to
work. This survey will be carried out before and after the changes to be made to obtain
a point of comparison and to be able to determine if there was an improvement after
364 J. S. Magdaleno-Palencia et al.
applying the changes and how much improvement percentage was obtained based on the
efficiency of all the processes carried out at the time to get around on public transport.
3.2 Fieldwork
The sample will be obtained in a probabilistic way, selecting an amount of the population,
in certain areas of the city that meet the characteristics of a transport route or a transport
stop, that are users of public transport. To have a faster data collection and greater
accuracy, a stratified random sampling will be used proportionally to the population of
the area near the transport route or main stops and that they use transport normally.
The survey technique will be used, since with this collection technique, there will be
communication with the people who will be selected for the collection of information,
through questionnaires. The way to do it will be through people previously trained and
informed of what you want to execute.
This person who we will call “field worker” must have contact with the population
that is close to the problem in question, as well as it will be the worker himself who
asks the questions, does the survey, and records the answers to later obtain more general
results. The initial contact is of the utmost importance since it will be necessary to
convince the population that their participation is important. For the approach of the
questions, it is necessary to respect the order in which they appear in the survey, it is
also advisable to read the questions slowly for a better understanding by the interlocutor
and repeat them if necessary. For the recording of the answers, the responses of the
respondents will not be summarized or paraphrased.
Once the data obtained from the surveys is obtained, the data obtained will be ana-
lyzed and classified for a better understanding and to consider the needs of the user.
The surveys will have to be validated to confirm that what was initially assumed has
been done according to what is established, this to detect fraud or failures committed
by the interviewer (field worker) such as: has made a bad record of the answers, that the
order of the questions has not been followed, that has paraphrased the answers obtained.
Afterwards, a list of the answers will continue to work more quickly. Once the data has
been analyzed, processed, and analyzed, the step will be taken to convert this information
into a graphic presentation.
For this, a tabulation system will be used for the questions that allows us to generate
graphs and give a value to the answer.
The algorithm used in the Q-Learning method is quite complicated, but it is
recommended for what we need.
Qnew (st , at ) ← (1 − α) ∗ Q(st , at ) + a(rt + τ ∗ max Q(st + 1, a)) (1)
The agent’s goal is to maximize his total reward. He does this by adding the maximum
attainable reward, in future states, to the reward for reaching his current state, effectively
influencing current action for the potential future reward. This potential reward is a
weighted sum of the mathematical expectation of the rewards of future steps starting
from the current state.
Our population are the results shown by the artificial intelligence tests each time it
is sent to “train”, and our sample is not a fixed amount of these results, rather they are
select results that we will choose depending on the variables put in the training and of
the time the training is left.
For the analysis of the results, we are going to document the data that resulted from
the artificial intelligence training, these data are going to be grouped in their respective
trainings and depending on the results of each training is where we are going to be able
to determine which set of variables it is better to train the artificial intelligence on the
given route.
5 Conclusions
Artificial intelligences have their complexity when developing them, the theory that
needs to be understood before starting to develop an AI is quite abstract because it handles
a lot of information and many complex mathematical calculations; the development of
artificial intelligence itself is even more complicated because the theory and calculations
must be implemented in code. But once we understand to a certain extent how the learning
method works (in this case, Q-Learning) we can already understand more clearly how
everything works and why. Although the knowledge of development can be expressed
with few ideas, the truth is that the knowledge of the specific functioning of artificial
intelligence learning is not so easy to express due to the intrinsic complexity of it.
References
1. Hamidi, S., Zandiatashbar, A.: Compact development and adherence to stay-at-home order
during the COVID-19 pandemic: a longitudinal investigation in the United States. Landsc.
Urban Plan. 205, 103952 (2021)
2. Hernández Bringas, H.: COVID-19 en México: un perfil sociodemográfico. Notas Poblacion
(2021)
3. Coskun, V., Ozdenizci, B., Ok, K.: A survey on near field communication (NFC) technology.
Wirel. Pers. Commun. 71(3), 2259–2294 (2013)
4. Rahul, A., Rao, S., Raghu, M.E.: Near field communication (NFC) technology: a survey. Int.
J. Cybern. Inform. 4(2), 133 (2015)
5. Pelletier, M.-P., Trépanier, M., Morency, C.: Smart card data use in public transit: a literature
review. Transp. Res. Part C Emerg. Technol. 19(4), 557–568 (2011)
6. Połom, M., Wiśniewski, P.: Implementing electromobility in public transport in poland in
1990–2020. A review of experiences and evaluation of the current development directions.
Sustainability 13(7), 4009 (2021)
7. LeRouge, C., Nelson, A., Blanton, J.E.: The impact of role stress fit and self-esteem on the
job attitudes of IT professionals. Inf. Manag. 43(8), 928–938 (2006). https://doi.org/10.1016/
j.im.2006.08.011
Digital Ticketing System for Public Transport 367
8. Finžgar, L., Trebar, M.: Use of NFC and QR code identification in an electronic ticket sys-
tem for public transport. In: SoftCOM 2011, 19th International Conference on Software,
Telecommunications and Computer Networks, pp. 1–6 (2011)
9. Faulkner, C.: Secure commissioning for ZigBee home automation using NFC. Jan 23, 1–3
(2015)
10. Chen, J., Hines, K., Leung, W., Ovaici, N., Sidhu, I.: NFC mobile payments. Cent. Entrep.
Technol. Technical report, vol. 28 (2011)
11. Du, H.: NFC technology: today and tomorrow. Int. J. Futur. Comput. Commun. 2(4), 351
(2013)
12. Aziza, H.: NFC technology in mobile phone next-generation services. In: 2010 Second
International Workshop on Near Field Communication, pp. 21–26 (2010)
13. Bahmani, K., Nezhadshahbodaghi, M., Mosavi, M.R.: Optimisation of doppler search space
to improve acquisition speed of GPS signals. Surv. Rev., 1–17 (2022)
14. Peng, H., et al.: Analysis of precise orbit determination for the HY2D satellite using onboard
GPS/BDS observations. Remote Sens. 14(6), 1390 (2022)
15. Julham, M.L., Lubis, A.R., Al-Khowarizmi, I.K.: Automatic face recording system based on
quick response code using multicam. Int. J. Artif. Intell. 11(1), 327–335 (2022)
16. Pan, J.-S., Liu, T., Yan, B., Yang, H.-M., Chu, S.-C.: Using color QR codes for QR code secret
sharing. Multimedia Tools Appl., 1–19 (2022). https://doi.org/10.1007/s11042-022-12423-z
17. Gerpott, T.J., Meinert, P.: Who signs up for NFC mobile payment services? Mobile network
operator subscribers in Germany. Electron. Commer. Res. Appl. 23, 1–13 (2017)
18. Bosch-Sijtsema, P., Claeson-Jonsson, C., Johansson, M., Roupe, M.: The hype factor of digital
technologies in AEC. Constr. Innov. (2021)
To the Question of the Practical Implementation
of “Digital Immortality” Technologies: New
Approaches to the Creation of AI
Abstract. On the basis of the principle of dialectical symmetry put forward within
the framework of the philosophy of dialectical positivism, it is shown that the
scheme of personality structure according to Jung should be clarified. It should
include an element that makes this scheme symmetrical - the collective conscious
(the term is formed by analogy with the term collective unconscious). This app-
roach allows us to distinguish between the concepts of intellect, mind and con-
sciousness. In particular, the intellect is interpreted as a structural component of
the personality, most closely adjacent to the collective conscious. It is shown that
it is this structural component of personality that can be converted into a digital
format already at this stage of research by using methods for decoding algorithms
for the operation of convolutional and similar neural networks, which we previ-
ously proposed based on new methods of digital signal processing based on the
use of non-binary Galois fields. It is shown that the digital reconstruction of a
separate component of the personality - intelligence - can be considered as the
first step towards the implementation of digital immortality technologies.
1 Introduction
Currently, a number of technologies have already been implemented that provide an
imitation of digital immortality. Thus, neural networks make it possible to synthesize
video messages from people who have already died, popular show business stars give
concerts, act in films after death, etc. Microsoft has patented a technology for creating
an interactive chat bot of a specific person.
There is no doubt that technologies of this kind are far from digital immortality
proper, it is really nothing more than a kind of imitation. But, their very appearance
and wide distribution not only once again demonstrates a person’s desire for individual
immortality, but also shows the direction of the further development of information
technologies in general and artificial intelligence (AI) in particular.
The vector of AI development is obvious - it will increasingly approach human intel-
ligence. Consequently, all research (philosophical, neurophysiological, psychological,
etc.) will be updated, which, to one degree or another, can contribute to understanding
the essence of intelligence as such. From the most general considerations of practical
philosophy, it follows that the construction of artificial intelligence systems approach-
ing human intelligence is inseparable from the problem of digital immortality. If we,
mankind, understand what intelligence is - at the level of consistent philosophical inter-
pretation and correct mathematical description - we will understand how to transfer it to
a computer or other non-biological carrier.
As noted in [1], published at the very beginning of this century, estimates of the
information performance of computers - even at the level of representations of that time
- made it possible to assume that it was enough to “transfer a personality to a non-
biological carrier”, which resulted in particular, in numerous discussions regarding the
possibility of creating an “e-creature”.
Consequently, the point is no longer the level of development of computer technology,
the point is to comprehend the essence of such information objects as the human intellect.
This factor makes the thesis about the convergence of natural science, technical and
humanitarian knowledge more than relevant.
An excellent illustration of this thesis is the judgment expressed in the review article
[2], written by one of the most prominent experts in the field of mathematical logic and
the philosophy of logic.
“Gabbay predicts that the day is not far off when the computer scientist will wake
up with the realization that his professional line of work belongs to formal philosophy.”
The further development of artificial intelligence systems, which no one doubts, even
regardless of the issue of digital immortality, already raises and will continue to raise
questions for “techies” that were previously predominantly within the competence of the
social sciences/humanities. The main one, obviously, is the question of the essence of
intelligence as such. Without an answer to it, all discussions about whether this particular
system can be considered artificial intelligence or not become pointless [3, 4].
This paper proposes a new non-trivial approach to the development of artificial
intelligence systems, which is most closely related to the problems of digital immortality
and the problems of Jungian psychology.
Specifically, we are talking about the fact that the problem of “digital immortality”
cannot be solved immediately. We aim to show that it can be solved step by step. The
basis for this approach is that the structure of personality - and this is demonstrated in
this work - is very complex. There is no point in trying to transfer the “personality” to a
non-biological carrier entirely and immediately, especially since modern science has not
reached that level of understanding of the essence and intellect, mind and consciousness
of a person that would allow writing an adequate technical task for programmers.
We argue that there are already prerequisites for transferring to a non-biological
carrier individual component of the personality that are mainly associated with the con-
cept of “intelligence”, and it should be emphasized that the concepts of “intelligence”,
370 A. Bakirov et al.
“mind” and “consciousness”, although they have overlapping semantic spectrum, but
are by no means identical.
2 Literature Review
As shown in [3, 4], the intellect, consciousness and mind of a person should, first of all,
be considered as information processing systems. However, it should be noted that these
concepts are by no means identical. A distinction between them can be made, starting
from the conclusion about the dual nature of the intellect, consciousness and mind of a
person [5]. In cited work, in particular, it was shown that the intellect and consciousness
of a person are only relatively independent, in fact, their nature is dual, i.e. human
intelligence simultaneously has both a collective and an individual component.
This conclusion, as well as the principle of dialectical symmetry put forward in [4],
makes it possible to overcome some of the methodological contradictions inherent in
the views of Jung and his followers on the collective unconscious.
Namely, as was shown in the cited works, the collective unconscious, understood
according to Jung, is a consequence of the formation of transpersonal information struc-
tures that arise due to the fact that the exchange of signals between neurons takes place
not only within the brain of an individual [3, 4].
Any interpersonal communication de facto comes down to the exchange of signals
between neurons regarding independent fragments (localized within the brain of each
of the people) of a common neural network. Consequently, along with such information
objects as the intellect, mind and consciousness of a person (individual level), transper-
sonal information structures are also formed, which are also generated by the exchange
of signals between neurons that are part of the global neural network, which can be
identified with the noosphere, understood according to V.I. Vernadsky.
This mechanism, in particular, allows revealing the essence of the collective uncon-
scious as an objectively existing information system. Moreover, it radically changes the
view of what should be understood as the structure of personality.
It should be noted that at present there are quite a lot of psychological schools (not
only Jungian), whose representatives have proposed quite a lot of different schemes of
personality structure and interpretations of the phenomenon of the collective unconscious
[6, 7]. Moreover, the question is raised about the practical use of Jung’s ideas, both, for
example, in politics [8] and in marketing [9].
However, the dual nature of human intellect and consciousness was reflected in
them inconsistently. In our opinion, this is due to the fact that psychologists created
these models on a purely empirical basis. The theory of neural networks has so far found
only limited application in psychology, which, as emphasized in [5], is due to a lack
of understanding of the algorithms for the functioning of neural networks themselves.
Recall that the vast majority of neural networks currently used are de facto the result
of computer experiments: algorithms are known according to which artificial neural
networks are trained, but it is most often impossible to predict the result of training, and
even more so to reveal the real algorithm of the network’s functioning.
It is this factor (the logical opacity of neural networks) that led to the emergence of
the thesis about the need to develop explainable neural networks [10, 11] and explainable
To the Question of the Practical Implementation 371
3 Research Methodology
To substantiate new approaches to the implementation of digital immortality (in a limited
format at the first stage), this paper uses the method of bringing the personality structure
scheme according to Jung to a form that meets the principle of dialectical symmetry [3,
4].
From this point of view, the collective unconscious certainly cannot be considered
as an element of the personality structure (at least in the full sense of the term).
Personality should be considered only as something relatively independent, in fact,
what is called a personality is the result of a complex interaction of an individual with
society, or rather, with the noosphere. It is in this respect intellect and consciousness
of a person are interpreted as entities that have a dual nature, in which there are both
collective and individual components, let’s emphasize this again.
In this case, the collective component is generated by transpersonal information
structures, more precisely, the collective component of intelligence is a certain projection
372 A. Bakirov et al.
4 Results
Based on the principle of dialectical symmetry and the above methodological conclusion,
it should be recognized that along with the collective unconscious, understood in the
sense of Jung, there is also a collective conscious. It, in particular, is formed by scientific
theories, political doctrines and everything that a person can be taught and is learning
throughout his life. Consequently, the well-known scheme of Jung’s personality structure
(presented at Fig. 1 in simplified form) should be supplemented by making it symmetrical
(Fig. 2).
From this conclusion, in turn, it follows that a quite definite distinction should be
made between the intellect, consciousness and mind of a person. These are by no means
synonymous terms. A detailed consideration of this issue is beyond the scope of this
work, it is only important to note that the intellect is that component of the structure of
the personality (which is considered as a subsystem of the noosphere), which is most
closely in contact with the collective conscious.
Namely this part of the structure of the personality can already be “digitized”, i.e.
transferred to a non-biological storage medium in the foreseeable future.
To the Question of the Practical Implementation 373
Fig. 2. Bringing the personality structure scheme according to Jung to a form that meets the
principle of dialectical symmetry (simplified version).
We should emphasize once again, there are all the prerequisites associated mainly
with the development of the theory of artificial neural networks (ANN) and AI at the
modern step of investigations. In particular, as noted above, considerable attention is
currently being paid to research in the field of explainable neural networks [10, 11].
Note that ANNs are often opposed to systems with explicitly prescribed algorithms.
This is due to the fact that the algorithm of the ANN, which is formed in the process
of learning, most often remains uninterpretable, which is expressed by the thesis of the
logical opacity of the ANN.
Obviously, overcoming the thesis of the logical opacity of neural networks is the
basis for further progress in understanding the essence of intelligence, including human
intelligence. From a general methodological point of view, this seems almost obvious,
especially if we consider both trained neural networks and human intelligence as a “black
box”. Having revealed the algorithms for the functioning of neural networks in this vein,
one can come closer to understanding the functioning of more complex systems.
An important step in this direction was made in [25], where a “digital” analogue of
the convolution theorem was applied to the description of signals, the model of which
is built on the basis of functions that take values in Galois fields.
It is appropriate to emphasize, firstly, that this theorem is a direct analogue of the
convolution theorem, widely used in applications (in Fourier optics, for example [26]),
and, secondly, that algebraic structures of this kind are currently actively used in theory
of codes [27, 28]. The theorem considered in [25] exhaustively describes the functioning
of a certain type of ANN (convolutional neural networks [29, 30]), just as the apparatus
of transfer functions exhaustively describes the behavior of any linear systems that have
the property of invariance with respect to time shift.
Consequently, the thesis about the logical opacity of ANNs has been overcome for
at least one, and rather important, type of networks.
374 A. Bakirov et al.
This allows one to raise the question concerning at least partial preservation of the
human intellect on some non-biological carrier through the decoding of those biological
neural network operations that are associated with this particular component of the
personality.
Simplifying, intelligence is not only memory, it is primarily an algorithm (in the
broad sense of the term), a product of information self-organization processes. More
precisely, the intellect is a system of information processing, but this system is built on
certain rules, and it is they that need to be revealed.
Intelligence from this point of view can be considered as a “black box”, the real
structure of which (neurophysiological processes) remains unknown. However, its algo-
rithm can be reconstructed, having an array of data reflecting its response to external
influences. In particular, as applied to convolutional ANNs, such decoding can be carried
out already now, based on the digital analogue of the convolution theorem [19].
Recall that the equivalent electronic circuit of any linear system that processes time-
dependent signals and has the property of invariance (with respect to time shift) can
be established based on the analysis of its amplitude-frequency characteristic, using the
classical convolution theorem. In a completely similar way, the logic of a convolutional
neural network is decoded based on the digital convolution theorem, if there is a sufficient
data array, to which each set of values characterizing the state of the inputs of the ANN
is associated with a set of data characterizing the state of the outputs.
This example clearly shows that for any technologies aimed at even a partial transfer
of intelligence to a non-biological carrier (the first clearly visible step towards ensuring
digital immortality), there is obviously not enough data that fixes the user’s behavior
(say, an array of video information, etc.). In any case, at this stage of research it is not
obvious how such an array of data can be used for the purposes under consideration.
The technologies of digital reconstruction of intelligence must obviously record not
only the behavior of the user, but also the circumstances that cause this or that reaction,
these or those judgments.
The most convenient area in which such reactions can be tracked is obviously related
to the educational process (more broadly, professional activity).
Here already now it is possible to offer a whole range of possible technical solutions
that are quite realizable at the present level of development of programming.
One of them is related to fixing the reaction of users to digital educational resources
and/or special fiction (scientific, popular, etc.) literature, which makes known an external
stimulus (input signals).
In fact, we are talking about the fact that there are already prerequisites for recreating
a professional in a particular field (for example, a teacher), using the appropriate data
array, which is largely formed as a result of his professional activity.
There is no need to create a completely artificial intelligence system designed to
teach students if you can reconstruct the digital image of a real teacher (of course, you
need to choose the most talented and experienced ones) who will conduct the classes.
Of course, such an image can only partly be interpreted as digital immortality, however,
this step is already really visible. In addition, it is able to stimulate further research in
the field of understanding the essence of intelligence, i.e. in an area that de facto remains
the borderline between information technology and applied philosophy.
To the Question of the Practical Implementation 375
5 Conclusion
Thus, on the basis of the principle of dialectical symmetry, put forward in the framework
of the philosophy of dialectical positivism, Jung’s scheme of personality structure should
be modernized by adding new component (collective conscious).
This component of the structure of personality most closely adjoins the intellect; it is
formed by all those concepts, theories, views, etc. which a person is able to consciously
explore throughout his life.
Presented report also shows that the question of digital immortality ceases to lie in
the plane of the unattainable.
Moreover, this problem can be solved sequentially; the basis for such approach is the
existence of a complex structure of personality as an element of the enclosing system -
the noosphere.
The first step corresponds to the deciphering of that component of the personal-
ity that most closely adjoins the collective conscious, and even the limited success of
technologies for this purpose will inevitably give impetus to further work in this direction.
References
1. Bostrom, N.: How long before superintelligence? Int. J. Futures Stud., 2 (1998)
376 A. Bakirov et al.
2. Karpenko, A.S.: Modern research in philosophical logic. Quest. Philos. 9, 54–75 (2003)
3. Suleimenov, I.E., Vitulyova, Y.S., Bakirov, A.S., Gabrielyan, O.A.: Artificial intelligence:
what is it?. In: ACM International Conference Proceeding Series, pp. 22–25 (2020). https://
doi.org/10.1145/3397125.3397141
4. Vitulyova, Y.S., Bakirov, A.S., Baipakbayeva, S.T., Suleimenov, I.E.: Interpretation of the
category of complex in terms of dialectical positivism. IOP Conf. Ser. Mater. Sci. Eng. 946(1),
012004 (2020). https://doi.org/10.1088/1757-899X/946/1/012004
5. Bakirov, A.S., Vitulyova, Y.S., Zotkin, A.A., Suleimenov, I.E.: Internet users’ behavior from
the standpoint of the neural network theory of society: prerequisites for the meta-education
concept formation. In: The International Archives of the Photogrammetry, Remote Sensing
and Spatial Information Sciences, vol. XLVI-4/W5–2021, pp. 83–90 (2021). https://doi.org/
10.5194/isprs-archives-XLVI-4-W5-2021-83-2021
6. Hunt, H.T.: A collective unconscious reconsidered: Jung’s archetypal imagination in the light
of contemporary psychology and social science. J. Anal. Psychol. 57(1), 76–98 (2012)
7. Mills, J.: Jung’s metaphysics. Int. J. Jungian Stud. 5(1), 19–43 (2013)
8. Odajnyk, V.W.: Jung and politics: the political and social ideas of CG Jung. iUniverse (2007)
9. Woodside, A.G., Megehee, C.M., Sood, S.: Conversations with (in) the collective unconscious
by consumers, brands, and relevant others. J. Bus. Res. 65(5), 594–602 (2012)
10. Assaf, R., Schumann, A.: Explainable deep neural networks for multivariate time series
predictions. In: IJCAI, pp. 6488–6490 (2019)
11. Angelov, P., Soares, E.: Towards explainable deep neural networks (xDNN). Neural Netw.
130, 185–194 (2020)
12. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, oppor-
tunities and challenges toward responsible AI. Inf. Fus. 58, 82–115 (2020)
13. Gunning, D., et al.: XAI—explainable artificial intelligence. Sci. Robot. 4(37) (2019)
14. Došilović, F.K., Brčić, M., Hlupić, N.: Explainable artificial intelligence: a survey. In: 2018
41st International Convention on Information and Communication Technology, Electronics
and Microelectronics (MIPRO), pp. 0210–0215. IEEE (2018)
15. Suleimenov, I.E., Bakirov, A.S., Matrassulova, D.K.: A technique for analyzing neural
networks in terms of ternary logic. J. Theor. Appl. Inf. Technol. 99(11), 2537–2553 (2021)
16. Vitulyova, Y.S., Bakirov, A.S., Shaltykova, D.B., Suleimenov, I.E.: Prerequisites for the anal-
ysis of the neural networks functioning in terms of projective geometry. IOP Conf. Ser. Mater.
Sci. Eng. 946(1), 012001 (2020)
17. Yang, Z., et al.: Understanding retweeting behaviors in social networks. In: Proceedings
of the 19th ACM International Conference on Information and Knowledge Management,
pp. 1633–1636 (2010)
18. Benevenuto, F., et al.: Characterizing user behavior in online social networks. In: Proceedings
of the 9th ACM SIGCOMM Conference on Internet Measurement, pp. 49–62 (2009)
19. Roblek, V., Meško, M., Bach, M.P., Thorpe, O., Šprajc, P.: The interaction between internet,
sustainable development, and emergence of society 5.0. Data 5(3), 80 (2020)
20. Rutter, J.: From the sociology of trust towards a sociology of ‘e-trust.’ Int. J. New Prod. Dev.
Innov. Manag. 2(4), 371–385 (2001)
21. Hossain, S.: The Internet as a tool for studying the collective unconscious. Jung J. 6(2),
103–109 (2012)
22. Luria, A.: Language and Consciousness. Publishing House Peter, 336 p. St. Petersburg (2020)
23. Kravtsova, E.E.: Non-classical psychology L.S. Vygotsky. Natl. Psychol. J. 1, 61–66 (2012)
24. Klochko, V.E.: The Problem of Consciousness in Psychology: A Post-non-Classical Per-
spective. Bulletin of the Moscow University. Series 14. Psychology, vol. 4, pp. 20–35
(2013)
To the Question of the Practical Implementation 377
25. Vitulyova, E.S., Matrassulova, D.K., Suleimenov, I.E.: Application of non-binary galois fields
Fourier transform for digital signal processing: to the digital convolution theorem. Indones.
J. Electr. Eng. Comput. Sci. 23(3), 1718–1726 (2021)
26. Goodman, J.W.: Introduction to Fourier Optics. Roberts and Company Publishers (2005)
27. Hla, N.N., Aung, D., Myat, T.: Implementation of finite field arithmetic operations for large
prime and binary fields using Java BigInteger class. Int. J. Eng. Res. Technol. (IJERT) 6(08)
(2017)
28. Shah, D., Shah, T.: Binary Galois field extensions dependent multimedia data security scheme.
Microprocess. Microsyst. 77, 103181 (2020)
29. Afridi, M.J., Ross, A., Shapiro, E.M.: On automated source selection for transfer learning in
convolutional neural networks. Pattern Recogn. 73, 65–75 (2018)
30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
Collaborative Forecasting Using
“Slider-Swarms” Improves Probabilistic
Accuracy
1 Introduction
In the field of Collective Intelligence (CI), it is well known that aggregating estimations,
evaluations, and forecasts from large groups can significantly outperform its individual
members. For well over a century, a wide variety of aggregation techniques have been
explored for harnessing the intelligence of human populations to enable more accurate
decisions [1–3]. Artificial Swarm Intelligence (ASI) is a recent real-time technique that’s
been shown to significantly amplify the decision-making accuracy of networked human
groups using intelligence algorithms modeled on biological swarms. Unlike votes, polls,
surveys, or prediction markets, which treat each participant as a separable datapoint for
statistical processing, the ASI process treats each individual as an active member of a
real-time dynamic system, enabling the full group to efficiently converge on solutions
as a unified intelligence [4, 5, 9].
For example, a recent study conducted at the Stanford University School of Medicine
showed that small groups of radiologists, when connected by real-time ASI algorithms,
could diagnose chest X-rays with 33% fewer errors than traditional methods of aggre-
gating human input [6, 7]. Researchers at Boeing and the U.S. Army recently showed
that small groups of military pilots, when using ASI technology, could more effec-
tively generate subjective insights about the design of cockpits than current methods [8].
Researchers at California Polytechnic published a study showing that networked busi-
ness teams increased their accuracy on a standard subjective judgment test by over 25%
when deliberating as real-time ASI swarms [9–11]. Also, researchers at Unanimous AI,
Oxford University, and MIT showed that small groups of financial traders, when fore-
casting the price of oil, gold, and stocks, increased their predictive accuracy by over
25% when using ASI method [12–14]. And researchers at Unanimous AI showed that
networked human teams collaboratively responding to standard IQ question tests could
increase their collective IQ score by 14 points when working together using the Swarm®
software platform vs. WoC voting [15].
Groupwise decision-making is an increasingly important area of research as more
and more teams work remotely. In addition, the rise of Decentralized Autonomous
Organizations (DAOs) increases the need for powerful and precise tools for amplified
groupwise decision-making. While the ability of swarm-based systems to amplify group
intelligence has been validated across many disciplines, current methods do not allowed
individuals to freely control their own input which is desirable for some probabilistic
forecasting tasks. In this paper, we introduce a new swarming method, called a “slider-
swarm” designed for probabilistic forecasting. We first explain the mechanics of this
new interface, and then examine the effectiveness of groups using the interface when
answering a set of partial-knowledge group probabilistic forecasts.
2 Slider-Swarm
dynamic system that converges on a final result. In practice, this Groupwise Delibera-
tion phase lasts between 20 and 40 s during which time participants continuously adjust
their forecast based on the behaviors of other participants in the real-time groupwise
process. After this window, each individual’s Final Answer is recorded, and an aggre-
gated groupwise forecast is generated algorithmically using the dynamic data collected
during the two-step process.
Figure 2 shows a slider-swarm during on a real-world probabilistic forecasting task
in which a group predicts which of two films is more likely to win an Oscar. In Fig. 2(a),
each user is presented with a forecasting prompt: “Which will win Best Documentary?”
and is asked to set their own individual forecast. Each user, working in isolation, must
move their probabilistic slider out of a highlighted deadband region in order for their
answer to be registered. Once they do, their slider turns green (Fig. 2(b)). After this initial
phase is complete, all users are simultaneously shown the distribution of responses from
the full population of participants (Fig. 2(c)). Each user is then asked to adjust their
forecast by considering the responses from other users. This is a simultaneous swarming
process in which all users can see the changing input from other users in real-time,
thereby creating system in which users are acting, reacting, and interacting as a unified
system. To ensure that all users provide degree of change, they are required again to
move out of a 2% deadband region around their initial answer.
The group is provided 25 s for the swarming phase, during which time the real-time
movements of each user influence the behaviors of other users, often creating cascades of
change within the system. Figure 2(d) shows the final responses from this group: notice
the net-leftward movement of the group and that the mean answer has changed from
an initial collective forecast of 55% probability of the film Summer of Soul winning
to a final collective forecast of a 64% probability of Summer of Soul winning. The
real-time swarming process thereby encouraged this group to collectively change the
final collaborative answer by 9%--a move that was ultimately in the correct direction,
as Summer of Soul indeed won this category in 2022.
Collaborative Forecasting Using “Slider-Swarms” 381
a) b)
c) d)
Fig. 2. A view of a user in a slider-swarm answering a question.
3 Experimental Design
of participants or the majority Group B with 60% of participants. Participants were not
aware of the random assignment into groups, nor aware of the minority/majority structure
of the task at hand. Group A and Group B were each shown a different set of marbles,
with every member within the same group seeing the same thing. The marbles shown
to each group were structured so that one group always saw more blue marbles than
red marbles, and vice versa. This created a split in the population so that the majority B
supported one color while the minority A supported the other color.
Additionally, on each question, we controlled the confidence level of each group
by changing the composition of marbles shown (the difference in the number of red
and blue marbles), thereby creating one confident group and one less confident, flexible
group. The “correct” answer was always the answer favored by the confident group.
The majority was designed to be more confident on 5/15 (33%) of the questions in this
experiment, while the minority was designed to be more confident on 10/15 (67%) of the
questions in this experiment. For analysis purposes, the question type was further broken
down by the confidence level assigned to each group: a group was initialized with High
confidence if they saw a three-four marble color difference between Red and Blue (e.g.
six red, two blue), a medium confidence if they saw a two-marble color difference, and
a low confidence if they saw a one-marble color difference. No group was ever shown
a set of marbles with equal numbers of red and blue. Therefore, both question types
(Majority correct and Minority correct) consisted of three subcategories based on the
confidence levels assigned to group A and B: High vs. Medium confidence, High vs
Low confidence, and Medium vs. Low confidence.
Figure 3 shows two examples of the online interface we used for this study, each
from a different participant’s screen. The leftmost user who was part of group A saw six
marbles–five red and one blue–and therefore favored “red” to be the correct answer. The
rightmost user who was part of group B saw seven marbles–three red and four blue–and
therefore slightly favored “blue” to be the correct answer.
Fig. 3. Two participants see different random draws from the same bag of marbles.
Using this methodology, three groups of between 30 and 36 Mechanical Turk users
were convened to answer a set of 15 probabilistic forecasting questions of this type, each
featuring a different simulated “bag” of marbles.
Collaborative Forecasting Using “Slider-Swarms” 383
Participants were given the following instructions before the experiment began: (1)
Each question focuses on one bag of marbles hidden from view. You know three things
about each bag: (i) Each marble in this bag will be either RED or BLUE. (ii) There are
always 19 marbles in the bag. (iii) The fraction of RED and BLUE marbles in the bag
is randomly selected before each question. (2) While we’re all using the same bag of 19
marbles on each individual question, everyone will see a random selection of marbles
from this bag. Some people may see more marbles than others, but no one will see more
than half of the marbles. (3) We will swarm as a group to forecast whether the bag
contains more RED or BLUE marbles.
On each question, after privately observing their marbles and entering their initial
responses based on the marbles they saw, the group then worked together using the
slider-swarm interface to create a collective probabilistic forecast: “Are the majority of
the marbles in this bag RED or BLUE?”
To motivate participants to give reasonable forecasts and to pay attention to one
another’s forecasts, a $2 bonus was given out to each group that answered more than
80% of questions correctly.
4 Results
Results were computed for each of the three methods of probabilistic forecasting: as
(i) individuals, (ii) WoC, and (iii) slider-swarm. Individual answers were collected as
the final responses registered during the Personal Deliberation phase. As previously dis-
cussed, in this section each participant gave an answer on the slider without seeing other
user answers, essentially acting as a blind survey. Next, representing traditional group
aggregation methods, the WoC answers were computed as the mean of all individuals’
initial answers. Finally, the slider-swarm answers were computed as the mean of all
individual final answers at the end of the Group Deliberation phase.
Over the course of the slider-swarm, individuals reduced their mean brier scores from
0.238 during the Personal Deliberation phase to 0.214 during the Group Deliberation
phase, or 10%. Moreover, the group forecast had lower errors when working together in
a slider-swarm compared to using traditional WoC aggregation, significantly reducing
the mean brier score from 0.212 (WoC) to 0.189 (slider-swarm), or 11%.
To compute statistical significance in brier score differences across the three cate-
gories (individual, WoC, slider-swarm), a bootstrapping analysis was performed to gen-
erate a confidence interval for the mean brier score of each category, where the observed
brier scores for each forecasting method were resampled with replacement 1000 times
(Fig. 4). As outlined in Table 1, across the full question set, the slider-swarm method
achieved a significantly lower error as compared to both individuals (p < 0.001) and
WoC (p < 0.001).
To better understand how the composition of the group impacts the performance of the
slider-swarm, we next analyzed the confident-majority and confident-minority question
types separately. As described in Table 2 and illustrated in Fig. 5, on questions with a
more confident majority, the slider-swarm resulted in 22% lower brier scores than the
WoC (p < 0.001). On questions with a confident minority, the slider-swarm moderately
outperformed traditional WoC aggregation yielding a 3.7% reduction in error, but this
384 C. Domnauer et al.
Fig. 4. Results of bootstrap analysis for mean brier score across three categories: individual,
WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error compared to both
individuals and WoC aggregation.
Table 1. Slider-swarm offers significant reduction in error compared to both individual answers
and traditional WoC aggregation.
result was not statistically significant (p = 0.12). Therefore, we can conclude that slider-
swarms produce better probabilistic forecasts than WoC aggregation on questions where
most people are confident and correct; this result may hold for questions where a minority
of people are confident and correct but the effect size is likely smaller and thus a larger
number of trials is needed to confirm the effect.
Exploring this issue further, we analyzed the unique subset of questions in which the
majority group was shown a set of marbles with only a one-marble color differential,
thereby inspiring low confidence, while the minority group had either Medium or High
confidence. For these questions, the slider swarm produced significantly lower brier
scores than a WoC survey (slider-swarm brier = 0.207, WoC = 0.223, p = 0.025).
In other words, when the majority possesses relatively low confidence in their answer
compared to the minority, the slider-swarm method enables the minority to influence
the population, causing the majority to switch to the minority position. Thus, the slider-
swarm system allows a correctly confident minority to more successfully influence the
group to converge upon the correct answer (rather than the most popular answer), even
when a large majority holds the opposing (incorrect) belief. This is an important result.
Collaborative Forecasting Using “Slider-Swarms” 385
Fig. 5. Slider-swarm reduced the individual and WoC error both in cases of a confident majority
(left) and in cases of a confident minority (right)
Why is it that a slider-swarm can aggregate confidence better than the WoC on
these types of questions? The key difference is that the slider-swarm doesn’t merely ask
participants to report their confidence, but requires each participant to behave in real-
time while being exposed to the beliefs and behaviors of other members of the group. In
doing so, slider-swarm is a dynamic system that empowers the group to converge upon
the answer they are collectively the most confident in.
To examine how user behaviors allow slider-swarms to converge upon the answers
the group is collectively the most confident in, we examined the “flipping behavior”
of individuals: e.g. when they choose to switch from one side of 50% to the other. In
those questions with a confident minority, the slider-swarm permitted a large number
of individuals who were initially incorrect to switch on to the correct side of 50. In
fact, if a simple majority vote was taken at the end of the Personal Deliberation phase
(i.e., a traditional WoC survey) on these questions, the group would have answered
only 1/27 (3.7%) of questions correctly. However, if this vote was taken at the end of
the Group Deliberation phase, the group would have answered 15/27 (55.6%) questions
correctly, which translates to a 1500% increase in the voting accuracy enabled by the use
of slider-swarm. Finally, using the mean answer from the end of the Group Deliberation
phase, the slider-swarm probabilities were on the correct side of 50% on 17 out of 27
questions (62.9%). In other words, not only do individuals themselves become more
386 C. Domnauer et al.
accurate through real-time dynamic Group Deliberation using slider-swarms, but also
the collective intelligence becomes even more accurate.
How do individuals collectively become more accurate during the Group Delibera-
tion phase? As illustrated in Fig. 6, individuals belonging to the more confident group
on a given question move their slider significantly less, averaging 1.4% less motion (p
= 7E−4), as compared to individuals belonging to the less confident group. This result
gives an indication of how the slider-swarm works in practice: individuals who are less
confident in their initial answers are more likely to concede their position, driving the
dynamic system towards collective answers that are biased towards the more confident
sub-populations in the group.
Fig. 6. Individuals who are assigned into the more confident sub-population move significantly
less than individuals of the less confident sub-population.
Again, results were computed for each of the three methods of probabilistic forecast-
ing: as (i) individuals, (ii) WoC, and (iii) slider-swarm. In each of these three real-world
experiments, the slider-swarm achieved a significant reduction in error compared to the
average individual as well as the WoC aggregation.
In the Smile Test, over the course of the slider-swarm, individuals reduced their mean
brier scores from 0.237 during the Personal Deliberation phase to 0.209 during the Group
Deliberation phase, achieving a 12% error reduction using the slider-swarm interface.
Moreover, the group forecasts were 9.6% more accurate (p = 2.5 × 10–3 ) when the
group worked together as a slider-swarm, down to a brier score of 0.191 (slider-swarm)
from 0.211 (WoC). These results are shown in Table 3, and a bootstrap analysis is shown
in Fig. 7.
Table 3. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in subjective judgment “Smile Test.”
Smile test
Forecast method Brier score Error vs slider-swarm (p-value)
Individual 0.237 19.7% greater error (2.5 × 10–8 )
WoC 0.211 9.65% greater error (2.5 × 10–3 )
Slider-swarm 0.191 n/a
Fig. 7. For smile test – results of bootstrap analysis for mean brier score across three cate-
gories: individual, WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error
compared to both individuals and WoC aggregation.
In the Faces Test, over the course of the slider-swarm, individuals reduced their
mean brier scores from 0.218 during the Personal Deliberation phase to 0.188 during the
388 C. Domnauer et al.
Group Deliberation phase, a 14% reduction in forecast error. Moreover, when working
together in a slider-swarm, the group forecast error reduced by 13.6% (p = 3.8 × 10–4 ),
from 0.191 (WoC) to 0.165 (slider-swarm). A bootstrap analysis is shown in Fig. 8 and
these results are tabulated in Table 4.
Table 4. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in photoshop recognition test
Fig. 8. Results of bootstrap analysis for mean brier score across three categories: individual,
WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error compared to both
individuals and WoC aggregation.
Finally, in the 2022 Academy Awards Test, individuals again reduced their mean
brier scores from 0.211 during the Personal Deliberation phase to 0.184 during the
Group Deliberation phase, a 13% error reduction. Moreover, the group forecast brier
score was reduced by 11.1% (p = 0.038) using slider-swarm, from 0.171 (WoC) to 0.152
(slider-swarm). A bootstrap analysis is shown in Fig. 9 and these results are tabulated in
Table 5.
In total, all the real-world experiments revealed slider-swarms to be the most accurate
method of group forecasting, significantly reducing the group’s brier scores across mul-
tiple datasets. As outlined in Table 6 and depicted in Fig. 10, we find that slider-swarms
Collaborative Forecasting Using “Slider-Swarms” 389
Table 5. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in predicting results of the 2022 academy awards.
Fig. 9. Results of bootstrap analysis for mean brier score across three categories: individual, WoC,
and slider-swarm. Slider-swarm achieved a reduction in error compared to both individuals and
WoC aggregation.
produce significant reductions in brier score not only in each experiment in isolation,
but also when all data is combined across multiple real-world scenarios.
Table 6. Slider-swarms achieve lowest error rate across real-world forecasting tests
Fig. 10. Results of bootstrap analysis for mean brier score across all real-world experiments in
three categories: individual, WoC, and slider-swarm. With all datasets combined, slider-swarm
achieved a significant reduction in error compared to both individuals and WoC aggregation.
Collaborative Forecasting Using “Slider-Swarms” 391
6 Conclusion
In this paper we introduced a novel ASI method called slider-swarms for collaborative
probabilistic forecasting in networked groups and showed that this method improves
group accuracy on a challenging limited-information forecasting task by over 10% as
compared to a traditional Wisdom of the Crowd aggregation (p < 0.001). We further
showed this improvement was not limited to questions where the majority was more
confident than the minority, but that slider-swarms also produce better probabilistic
forecasts on questions where a 40% minority is more confident than a 60% majority–a
much harder domain, although this result was not statistically significant.
To explain how the slider-swarm allows groups to improve their collective accuracy,
we showed that the more confident individuals in sliders swarms tend to concede their
position less frequently than lower confidence individuals. This means that individuals
in this system are not just reporting their confidence level at the outset, but instead
are actively adjusting their beliefs in real-time response to the displayed beliefs and
behaviors of other individuals in the group, enabling the swarming system to produce
collective answers that the group can better agree upon.
We then examined the performance of slider-swarms on three real-world forecasting
tasks: the Smile Test, the Real/Fake Faces test, and forecasting the winners of the 2022
Academy Awards. We observed significant decreases in group errors as indicated by
superior brier scores on each of these forecasts when the groups used slider-swarms
as compared to a standard Wisdom of the Crowd aggregation. The results showed
between 9.7% and 14% superior accuracy on each task when using slider swarms—and
a statistically significant 11% increase in accuracy overall.
These initial results suggest that the slider-swarm method is a viable and effective
tool for amplifying groupwise forecasting accuracy across a range of conditions, and that
slider-swarm can be successfully applied to real-world forecasting tasks to significantly
reduce group forecasting errors.
References
1. De Condorcet, N.: Essai sur l’application de l’analyse à la probabilité des décisions rendues
à la pluralité des voix. Cambridge University Press, Cambridge (2014)
2. Boland, P.J.: Majority systems and the Condorcet Jury theorem. Statistician 38, 181 (1989).
https://doi.org/10.2307/2348873
3. Larrick, R.P., Soll, J.B.: Intuitions about combining opinions: misappreciation of the averaging
principle. Manag. Sci. 52(1), 111–127 (2006)
4. Rosenberg, L.: Artificial Swarm Intelligence, a human-in-the-loop approach to A.I. In: Pro-
ceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix,
Arizona, pp. 4381–4382. AAAI Press (2016)
5. Rosenberg, L.: Human Swarms, a real-time method for collective intelligence. In: Proceedings
of the European Conference on Artificial Life 2015, ECAL 2015, pp. 658–659. MIT Press,
York (2015). ISBN 978-0-262-33027-5
6. Halabi, S., et al.: Radiology SWARM: novel crowdsourcing tool for CheXNet algorithm
validation. In: SiiM Conference on Machine Intelligence in Medical Imaging (2018)
392 C. Domnauer et al.
7. Rosenberg, L, Willcox, G., Halabi, S., Lungren, M, Baltaxe, D., Lyons, M.: Artificial swarm
intelligence employed to amplify diagnostic accuracy in radiology. In: 2018 IEEE 9th Annual
Information Technology, Electronics and Mobile Communication Conference (IEMCON),
Vancouver, BC (2018)
8. Befort, K., Baltaxe, D., Proffitt, C., Durbin, D.: Artificial swarm intelligence technology
enables better subjective rating judgment in pilots compared to traditional data collection
methods. Proc. Hum. Factors Ergon. Soc. Ann. Meet. 62(1), 2033–2036 (2018)
9. Askay, D., Metcalf, L., Rosenberg, L., Willcox, D.: Enhancing group social perceptive-
ness through a swarm-based decision-making platform. In: Proceedings of 52nd Hawaii
International Conference on System Sciences (HICSS-52). IEEE (2019)
10. Rosenberg, L., Willcox, G.: Artificial swarm intelligence. In: Bi, Y., Bhatia, R., Kapoor, S.
(eds.) IntelliSys 2019. AISC, vol. 1037, pp. 1054–1070. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-29516-5_79
11. Metcalf, L., Askay, D.A., Rosenberg, L.B.: Keeping humans in the loop: pooling knowledge
through artificial swarm intelligence to improve business decision making. Calif. Manag. Rev.
61(4), 84–109 (2019)
12. Rosenberg, L., Pescetelli, N., Willcox, G.: Artificial Swarm Intelligence amplifies accuracy
when predicting financial markets. In: 2017 IEEE 8th Annual Ubiquitous Computing, Elec-
tronics and Mobile Communication Conference (UEMCON), New York City, NY, pp. 58–62
(2017)
13. Willcox, G., Rosenberg, L., Schumann, H.: Group sales forecasting, polls vs. swarms. In:
Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2019. AISC, vol. 1069, pp. 46–55. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-32520-6_5
14. Schumann, H., Willcox, G., Rosenberg, L., Pescetelli, N.: “Human Swarming” amplifies
accuracy and ROI when forecasting financial markets. In: 2019 IEEE International Conference
on Humanized Computing and Communication (HCC), Laguna Hills, CA, USA, pp. 77–82
(2019). https://doi.org/10.1109/HCC46620.2019.00019
15. Willcox, G., Rosenberg, L.: Short paper: swarm intelligence amplifies the IQ of collaborating
teams. In: 2019 Second International Conference on Artificial Intelligence for Industries
(AI4I), pp. 111–114 (2019). https://doi.org/10.1109/AI4I46381.2019.00036
16. Bernstein, M.J., Young, S.G., Brown, C.M., Sacco, D.F., Claypool, H.M.: Adaptive responses
to social exclusion: social rejection improves detection of real and fake smiles. Psychol. Sci.
19(10), 981–983 (2008). https://doi.org/10.1111/j.1467-9280.2008.02187.x
17. CIPLAB, Yonsei University: Real and Fake Face Detection (2019). https://www.kaggle.com/
datasets/ciplab/real-and-fake-face-detection. Accessed 26 Apr 2022
Learning to Solve Sequential Planning
Problems Without Rewards
Chris Robinson(B)
1 Introduction
This paper presents an algorithm created specifically to solving arbitrary plan-
ning problems without requiring a reward function or pre-defined transition
model. The impetus for such agents is based on two principles: (1) the idea
that crafting reward functions introduces a risk of bias; and (2) objective based
learning models reduce the potential for knowledge re-use.
The Goal Agnostic Planning (GAP) algorithm applies these principles by
combining an MDP-like planner with an RL-based learning mechanism, inte-
grated with a composite datastructure combining a hypergraph, pointer arrays,
and linked lists. This datastructure is populated and updated throughout learn-
ing so that Dijkstra’s algorithm may be used to find an optimal maximum proba-
bility path between an observed current state and any reachable goal state which
exists. GAP agents therefore require no modification to re-use already learned
domain knowledge when presented with an alternate goal state. They do not
require manual construction of a transition graph or reward function.
With thanks to Joshua and Ellen Lancaster.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 393–413, 2023.
https://doi.org/10.1007/978-3-031-18461-1_27
394 C. Robinson
2 GAP Algorithm
In this section, we discuss the construction and operation of the GAP algorithm,
including the composite augmented hypergraph datastructure, use of Dijkstra’s
algorithm in the context of this algorithm, and the learning mechanism employed
in training.
396 C. Robinson
Fig. 1. Array/Linked List Showing the Indexed Cell Locations within the Array, Con-
taining Pointers to the Corresponding Elements in the Sorted Linked List, which itself
Contains the Data Component Associated with each Array Cell, and is Organized into
Columns Containing the Same Number of Observed Instances.
However, as si is fixed but al and sf are not, this presents two possibilities for
probability models, one referenced against resultant states and one referenced
against actions taken. In the first model, we elect to choose actions based on
the most probable outcome of taking an action from a given state, the latter the
most likely action to cause a transition:
IN C[si , sj , al ] IN C[si , sj , al ]
P (al (si ) → sj ) = ; or (2)
∀s IN C[si , s, al ] ∀a IN C[si , sj , a]
GAP Algorithm 397
In the qualitative sense, the a priori probability model selects actions most
likely to result in goal achievement, and the a posteriori policies selects state
changes most likely to reach the goal. This model is simple, but allows an elegant
learning system to be designed around it.
It is possible to define subgraphs (denoted AF I) embedded within the hyper-
graph which contain all edges of the optimal solution. One such subgraph con-
tains transitions (si , sj ), stored as an |S| × |S| × 2 array. In this array, the
component < si , sj , 0 > is the maximum probability associated with the si → sj
transition, and component < si , sj , 1 > is the index of the corresponding action:
3 Analytical Evaluation
Phrased in terms of Markov Decision Processes, the GAP algorithm produces
a policy π(si ) such that the action taken at any step is the first action in the
maximal probability sequence between si and sg , or:
⎛ ⎞
⎝
π(si ) = argmax P (oj )
⎠ (3)
σig
∀oj ∈σig
k=0
the first action in the most–probable sequence σig from state i to state g. We
can show that this policy is globally optimal, using the known optimality of
Dijkstra’s algorithm and the properties of the AF I array/linked lists.
Theorem 1. The policy illustrated by Eq. 3 produces the maximum likelighood
sequence for achieving a given goal state sg .
Proof. We proceed by contradiction. Presume that there exists an optimal solu-
tion sequence σog which contains an occasion not allocated AFI. By either of
Eq. 2, AFI must be sorted in descending order. Because the probabilities are
in [0, 1], Eq. 1 is monotonically decreasing. The first node in the sequence will
have the maximum probability edge of all leading from si to si+1 , and thus
any alternate path to this node is bounded by that single probability. Because
probabilities are monotonically decreasing, the sequence σog will have probabil-
ity greater than or equal to the assumed solution, and thus either σog is not
optimal, or both paths are.
To analyze the behavior of GAP agents, we can use Markov process analysis
techniques, with Eq. 3 as the chosen policy. We build the tree of maximal prob-
ability paths rooted in an arbitrary goal state, derived from AF I, and it TP (g) ,
illustrated in Fig. 3.
GAP Algorithm 399
We then take the state distribution sk , where k is step time. The state occu-
pation distribution as a function of time is given by: sk = Pgk · s0 which represents
the stochastic vector of probable states evolved from s0 .
400 C. Robinson
Where Ts is the transition matrix internal to only non-goal states, tg is the
vector of transition probabilities from {si ∈ S|i = g}, and the final column is
the stochastic vector of sg . Then:
Tsk 0
Pgk = (5)
tg · Σk−1 l
l=1 Ts + tg 1
.
From this, we can see that theprobability of reaching the goal state at step
k−1
k is given by P (si → sg |k) = tg · m=1 Tsm + tg . Because Tg is strictly positive
definite, Tgk is as well, and consequently P (si → sg |k) is monotonically increasing
in k, so Pg has no steady state. Thus sg is an attractor state, as it is identical
to its own start-state distribution, and no other state can be an attractor unless
there is a zero probability of transitioning out from that state.
It is also possible for state sequences which are not attractor states, but
which present no path to the goal once reached, to be non-steady attractors (as
illustrated in Fig. 4). We can define a subset of states, tnet, to represent the
states associated with this. We can re-cast Pg in the following form, noting that
states in tnet can transfer between one another, but not to other states not in
tnet:
GAP Algorithm 401
⎛ ⎞ ⎛ ⎞
Ts∈tnet
/ 0 0 Tsk∈tnet
/ 0 0
⎜ 0⎟
Pg = ⎝ Ts∈tnet Ttnet 0⎠ ; Pgk = ⎝Σk−1 j k−1−j
j=0 Ttnet Ts∈tnet Ts∈tnet
/
k
Ttnet ⎠ (6)
tg|i∈tnet k−1 j
/ 0 1 Σ t
j=0 g|i∈tnet
/ T s∈tnet
/ 0 1
Fig. 4. Illustration of Subgraph Segment from which No Path to the Goal Exists, yet
Contains Multiple Transition state Cycles. Such Regions can Present Non-Steady State
Attractors from which the Agent cannot Progress to the Goal, hence being Considered
‘trapped’ in the Subgraph.
From which we can see that P (si∈tnet → sg |k) = 0 for all k. Further, we will
define a system parameter Lmax : the longest minimum length path between any
two states. For any reachable state si , P (si → sg |Lmax ) > 0. In any graph the
|S|
maximum path length is |S|, so it suffices to check Pg : any state i for which
P (sg |si , k = |S|) = 0 is necessarily a member of a trap net. We can then use Eq. 6
to determine the probability at any point in time that the system has become
stranded in a trap net:
k−1 j
P (st ∈ tnet|k) = 11×|tnet| · j=0 T tnet T s∈tnet T k−1−j
s∈tnet
/ T k
tnet
0 · st (7)
Attractor states and trap nets together represent all the ‘dead ends’ for a
GAP algorithm, analytically identifiable from the form of Pg , making dead-end
removal algorithms, such as FRET [10] or reachability analysis [15], unnecessary
for GAP agents. This is a convergent behavior model in which the long-term
behavior of the agent can be statistically parametrized, and thus fully define the
goal–convergent behavior of the agent, resolving the problem discussed in [16].
Further, we may examine GAP behavior in terms of the L1 norm of Tsk .
Because all columns are stochastic, the maximum absolute column sum is paired
with the minimum probability single step goal transition, thus:
Often, min∀a,i Pg [si , g, a] = 0, however, at k = Lmax , all states from which the
goal is reachable have a non-zero transition probability:
k
Ts 1 = 0 k < Lmax
(8)
Tsk 1 ≤ TsLmax k−L
1
max
otherwise
log(1 − Pthresh )
kp ≥ + Lmax (9)
log(TsLmax 1 )
k −L
Rewriting the relation as 1 − TsLmax 1p max ≤ Pthresh , we can see that the
probability of transition to goal is bounded by an exponential growth rate- the
minimum probability threshold reached is limited by an exponential asymptotic
function approaching unity.
Perturbation Model. To examine the effect of error and state abstractions,
we use a probability transform acting on transition values. Consider a mapping
α() which transforms a state space S into a more compact space α(S). This
probabilistic mapping model is similar to that used by [8], but benefits from the
structure of AF I. Presume that we have an |α| × |S| transformation matrix, αT
which contains in each cell α[j, i] the probability P (α(si ) = α(j)) that the ith
‘true’ state is mapped onto the j th abstracted state. For a state vector st , the
corresponding abstracted probability vector is sαt = αT · st , or, for general time
propagation: sαt = αT · Pgt · s0
Given a learned AFI subgraph for the abstracted space, Pα , we also have
sαt = Pαt sα0 , and since sα0 = αT s0 we can construct a relation from the equiv-
alence αT Pgt = Pαt αT :
t
Pα = αT Pgt αT+
Pgt = αT+ Pαt αT
where αT+ is the pseudoinverse of αT . It is notable that this transform does not
allow for conversion into the true state space, even if αT is known perfectly, as
αT+ cannot unmix states which are combined. Recognizing that both arrays must
be stochastic transforms, due to the action on s:
αT s αT g + αT+s αT+g
αT = ; α =
1 − 1αT S 1 − 1αT g T 1 − 1α+ 1 − 1α+
TS Tg
Expanding Pgk lets us calculate the probability of goal transition in the true
space, and using the relations 1Tαs
k
= 1 − Vp , and 1αT+g = ||αT+g ||:
We presumed that Pα is convergent, and thus we can note the limiting behavior
k
of Tαs and Vp : limk→∞ Vp = 1 and limk→∞ Tαs k
= 0, from which the limiting
behavior of P (si → sg |k) can be determined: limk→∞ P (si → sg |k) = 1−||αT+g ||1
Convergence of Pg can be expressed as P (si → sg |k) → 1, so:
Which shows that the convergence of the true system to the goal, given
convergence of the abstracted state, is predicated on the transform between
the true goal states and the abstracted goal states being onto, analogous to,
but distinct from, the convergence conditions derived in [15], and mirroring the
observability model utilized in [7].
Given this condition on the abstraction function, we can also determine the
performance impact of the transform. Beginning with the relation ||Tsk ||1 ≤
||Ts ||k1 for the true state system:
||Ts ||k1 ≥ ||αT+s Tαk αT s ||1 + ||αT+g ||1 (1 − ||Tαk ||1 ||αT s ||1 )
||Tαk ||1 , ||αT s ||1 are strictly in [0, 1], but ||αT+s ||1 is not, and so this derivation
applies only to Pα → Pg . Convergence of the abstracted model implies conver-
gence of the true model, but not the converse. From this inequality, we can then
replicate the prior analysis for the abstracted case:
log(1 − Pthresh )
kpα ≥ + Lmax (11)
log(||αT+s ||1· ||Tαk ||1 · ||αT s ||1 )
Which describes how the inclusion of the abstraction modifies the minimum
expected time to achieving the goal state. By examining this expression above, we
can make some inferences about the impact of αT on convergence performance:
kpα > kp ||αT s ||1 · ||αT+s ||1 < 1
(12)
kpα ≤ kp ||αT s ||1 · ||αT+s ||1 ≥ 1
We can use the product above as a rough measure of the ‘quality’ of
an abstraction, the degree to which it effects performance, by: Q(αT ) =
1
||α || ·||α+ ||
So that Q(αT ) is directly correlated to the impact αT has on
Ts 1 Ts 1
performance, resolving the metric problem discussed in [13]. Empirically, we can
also approximate this using kp and the measured kpα :
k −k
kpα − kp log(||αT+s ||1 · ||αT s ||1 ) p
k −L
pα
Using this metric, we can measure the impact of perturbation model, under-
writing the effectiveness of the GAP for operating under an abstraction or uncer-
tainty.
Learning Model. We can model learning an abstraction which becomes more
accurate as learning progresses. Using transform starting with the initial assump-
tion of a uniform random distribution: αT 1 = |α| 1
· 1 and αT+1 = |S|1
· 1. We can
approximate expected learning curves with an amortized update at each step k:
k
a single state has been visited |S| times, and total counts can be expressed as
k
sαi . Combining the prior occasions with the new, for k+1
|S| |S| steps gives:
k 1 1 ksαi + si
s αi = sαi + si · k =
|S| |S| |S| + 1
|S|
k+1
Which, in aggregate, gives the expression across the full transition array as
kP +Pg
2 · 1 · Pg · 1, or:
1
a recurrence relation: Pαk+1 = αk k+1 , Pα1 = |S|
1 · Pg · 1 k − 1 +
Pαk = + P g = αT k Pg αT k (14)
k|S|2 k
Which we can express in similar block fashion as above:
1 1 + |S|(k − 1)Ts 1
αT k = αT k Pg
k|S| 1 + |S|(k − 1)P (g) 1 + |S|(k − 1)
(1 − αT g1)αT+s = 1 − αT g1
αT+s = I → αT s = I
Demonstrating conclusively that as Pα is learned, GAP agent training will
be convergent. Equation 14 also allows us to determine the amortized form of
the transition array over time- we can express it as:
1 1
Pαk = − Pg + Pg
k |S|
1
In which the terms |S| − Pg and Pg are clearly time invariant, thus the
average learning curve will follow a reciprocal pattern kpα (k) = A k1 + B. B is
naturally the asymptotic average path to goal length, kp . We can evaluate the
initial behavior of the system given the form for αT 1 and Eq. 14:
2log(|S|) − 2log(|S| − 1)
A = (kp − Lmax )
log(||Tα1 ||1 )
GAP Algorithm 405
|S|−1
||Tα1 ||1 can be directly calculated from αT 1 Pg αT+1 as |S| , and thus:
2(Lmax − kp )
kpα (k) = + kp (15)
k
Which establishes the average form for the learning curve for GAP agents as
an offset reciprocal function of step number.
4 Empirical Experiments
In this section we demonstrate the effectiveness of the GAP algorithm in learning
across a diverse array of archetypal learning and planning domains.
Training Process. To train the agent, the AFI datastructure is initialized with
uniform random values. Upon observations of occasions, the corresponding INC
cells are updated and links in AFI sorted. We artificially induce error in some
trials by a random threshold process which executes a random non–planned
action. Simulation models are designed to output string states when polled for
information, and a simple hash algorithm generates a lookup table for the agent
to use.
We demonstrate the effectiveness of the GAP algorithm by measuring param-
eters related to performance characteristics. We calculate best fit equations
(“Ak −1 ”) and measure of their accuracy: percentage off–linear (“%OL”) aver-
|k [n]−(A 1 +B)|
ages of linear regressions on the plots of ( k1 , kp ): ∀n∈N N1 p kp [n]n We also
calculate and compare approximations of kp and Lmax . For kp , the fit curve for
Eq. 15 (“kp I”) and by average performance after convergence (“kp II”). Lmax
comparisons are made between Eq. 13 (“Lmax I”) and Eq. 9 (“Lmax II”). We
ground performance with comparisons to Q-Learning and MDP policies, using
reward function R(si , al ) = log(P (al (si ) → si+1 )) + log(P (σ(i + 1, g))) to mimic
the probability maximizing function. In our results, “QL kp ” is the average num-
ber of steps to reach the goal for the trained QL agent, “QL Ep.” is the number
of epochs for the QL agent to converge (where “NC” indicates failure to converge
after 1000 epochs), and “MDP kp ” is the average shortest path to the goal found
by an MDP planner using Value Iteration, to set a performance floor.
STRIPS-Type Problems. First, we implement a STRIPS-type planning prob-
lem, as schematically represented in Fig. 5. The agent is in a world with move
operations that move it through space, a pair of world manipulating actions, and
Fig. 5. Illustration of a STRIPS World, Containing Linked Location States (Li ) and
Multiple Independent Actionable States (V1 and D1 )
406 C. Robinson
Fig. 6. Learning Curves for the STRIPS Problem Across Levels of Induced Error from
0% to 50%
a state space including possession of an item, location, and status of the door,
for a total state space of size |S|max = 52.
Figure 6 showcases the learning curves of the GAP algorithm on this problem
at each induced error level independently. Each curve is the average performance
over 50 trials. We can see from these curves that the learning tends to follow
the same reciprocal pattern as the general curve, with variance in asymptotic
performance shifting due to the increase in error rate elevating the expected
number of steps to reach the goal.
To reinforce the reciprocal relationship, we also plot the linearization of these
curves along with the off–linear percent labeled for these curves. For each plot
but one, the deviation from linear fit is in the single digits, with the greatest
deviation being for the 30% curve, with a 15% average off–linear error. These
measurements serve to validate the prediction of Eq. 15 that the GAP algorithm
will express reciprocal learning curves.
In addition to these linearized plots showing correlation between k1 and steps
to goal, we also highlight the correlations between kp as predicted by the asymp-
totic behavior of the data itself and the fit reciprocal curve, and calculate Lmax
from Eq. 15 and as predicted by the threshold in Eq. 8. Both measures are pre-
sented on Table 1, along with the corresponding percent errors. Here, we can see
that the differences between the asymptotic kp and the fit function are small,
ranging from 0.87% to 7.12% for induced error rates up to 40%, and the differ-
ence between the measured and predicted Lmax is 7.8%, indicating very close
correspondence between the observed performance and the predictions of Eqs. 15
and 8. These successive curves illustrate effective learning at high levels of error-
convergence occurring within 20 epochs even at 50% error.
Of note is the 50% error case, with error roughly twice that of the next largest.
However, an introduction of 50% error into the action of the agent is extremely
substantial, and it is reasonable to expect that the learning performance will
degrade. Qualitatively speaking, as the induced error rate increases, Pα behaves
more and more like a random uniform stochastic process. Referring back to
Eq. 13, we can see that the limit of kpα (k) will grow until the difference between
kpα (0) and the asymptotic performance is negligible. In more rigorous terms,
GAP Algorithm 407
limk→∞ kpα (k) → Lmax , and so the function kpα (k) no longer properly behaves
as a reciprocal, but as a constant function, exactly the expected behavior of an
attempt to learn a uniform random process.
Maze/TAXI Domain. The TAXI and Maze problems are canonical study
cases for machine learning systems. In the TAXI problem, the agent must visit
a list of locations, pick up a ‘passenger’, and deliver it to a specific destina-
tion. We complicate the problem by performing navigation in a maze. For the
agent, actions are cardinal direction movements, and pickup and drop off actions.
States include local observations of the maze topography, direction to the tar-
get ‘passenger’, and whether a passenger is currently carried. Additionally, note
that we do not perform training for fixed TAXI destinations and mazes, but
rather generate a random maze and passengers for each training epoch. We use
a relative measure, requiring the agents to learn broader patterns rather than a
rote problem. Rather than restricting ourselves to simple mazes, we allow non-
uniform spacing. Such a maze is illustrated in Fig. 7. As a result, the maximal
state space size is variable, however for the maze generation parameters used,
averages to |S|max = 18, 432.
Table 1. Comparison of measured and predicted values for analysis, calculated from
the performance on the strips problem learning curves, and comparisons to QL and
MDPs
Table 2. Comparison of measured and predicted Lmax across abstractions for the com-
plex Maze/TAXI domain with joint abstractions, along with QL and MDP performance
baselines.
In Fig. 8, we plot the learning curves the Maze/TAXI problem across levels of
induced error ranging from 5% to 30%. We observe two trends: the asymptotic
kp ’s proportionality to the error rate, and the correlation between initial per-
formance and long term performance. We also note the presence of ‘adaptation
bumps’ between epochs 4 and 8. This is correlated to changes in effectiveness as
the agent encounters large changes in the random maze, and indicates adaptive
learning. Note that GAP agent learning consistently converges within 10 epochs.
To investigate abstraction performance, we use three versions of the state
definition. AI, representing the 8 neighborhood cells; AII is similar to AI, but
includes only the four cardinal directions; and wA, or ‘with Action’, adds the
additional information of the most recent action the agent has taken. We produce
four different state generation methods with these: ‘AI wA’, using AI and wA
together, ‘AII wA’, and AI and AII both without wA (nominally ‘AI w/oA’ and
‘AII w/oA’). By joining the different models in this way, we can compare the
relative impact of each transform using Eq. 11.
We measure the same indicators as before, tested at all six error levels and dis-
played on Table 2. Table 2 additionally presents the calculated values for Lmax .
We find that the pairs of values are typical for the GAP algorithm thus far, and
on the appropriate scale for the performance values observed. Further, the QL
agent fails to learn in either the AII wA or the AI w/oA case, indicating that
the GAP agent can effectively learn problems which QL cannot.
Fig. 8. Performance of the GAP algorithm across multiple levels of induced error on
the Maze/TAXI problem space
GAP Algorithm 409
Table 3. Measured kpα and corresponding |α+ T α| estimates, with quadrants repre-
senting pairs of composed abstractions.
AI AII
+
PT hresh kpα |α T α| kpα |α+ T α|
1% 30.19 1.05 23.40 1.15
5% 54.16 1.09 24.58 1.11
10% 52.49 1.06 31.72 1.21
15% 67.30 1.57 35.43 1.77
20% 42.12 1.02 31.66 1.14
25% 58.45 1.05 29.70 1.08
wA 1.144 (±12.5%) 1.249 (±14.1%)
1% 396.17 1.01 38.00 0.99
5% 272.13 1.00 30.88 0.99
10% 285.07 1.00 37.76 1.00
15% 418.00 1.01 35.83 0.99
20% 729.29 0.99 36.64 1.00
25% 464.91 1.01 51.31 0.00
w/oA 1.004 (±0.3%) 0.996 (±0.2%)
Table 4. Calculated |α+ α| ratios across abstractions and predicted transform measure,
derived from the entries in Table 6 and Eq. 11
Fig. 9. Illustration of a Traditional Tower of Hanoi (ToH) problem. This graphic shows
the 3-peg, 4-disk, variant of the problem, T oH3,4
Fig. 10. Average learning curves for the GAP algorithm over the three investigated
ToH domains, T oH3,3 , T oH3,5 , and T oH4,5 at varying error levels, along with reciprocal
Fit curves
GAP Algorithm 411
Table 5. Chart of the correlation measures for the GAP algorithm learning the tower
of hanoi problem, across error level and problem complexity class
Table 6. kp and Lmax comparisons for the GAP algorithm learning the ToH problem
with various abstractions and across complexity classes
AIV cases for the T oH3,5 case unilaterally converge to the optimal number of
steps, presumably incidentally.
5 Conclusions
In this paper we have presented the GAP algorithm, which uses an elegant
datastructure and carefully chosen action policy to efficiently learn solutions to
sequential planning problems without requiring design of a reward function or
world model. We highlighted the relationships between extant reward and mod-
eling based systems which indicate the detriments of using rewards to drive solu-
tion finding. We proposed to fill this gap which additionally allows for planning
between arbitrary states using the same training data, and operates in low-order
polynomial time thanks to the use of the augmented hypergraph datastructure.
We showed how the design of the algorithm creates useful properties, which
enable analytic proof for several valuable characteristics, including global opti-
412 C. Robinson
mality of the action policy, exponentially bounded goal achievement rates, pre-
cise identification of dead-end state probabilities, conditions for convergence
under abstracted, perturbed, and error transforms, a measure for the perfor-
mance impact of said transforms, learning convergence, and the form for the
average learning curve for the agent.
Batteries of experiments on three demonstration domains highlighted effi-
cient learning and convergence properties of the GAP agents, which consistently
learned an order of magnitude faster than QL agents, and to solve problems
which the QL agents failed to, with performance levels comparable to MDP
agents despite not being provided with a transition model or reward function.
We used the STRIPS domain problem to establish fundamental effectiveness of
the algorithm. The Maze/TAXI domain was used to illustrate the power of the
GAP algorithm in a complex hierarchical, relativistic, and dynamic domain with
over 18,000 states, and demonstrated validity of the L1 norm–based abstraction
performance analysis by comparing multiple composed transforms. We used the
Tower of Hanoi domain to illustrate effective performance over multiple levels of
single-domain complexity, across a range of error rates, and with multiple state
space transforms.
The GAP algorithm has outstanding limitations we would like to address.
Firstly, though the cubic-order hypergraph is less extensive than many world
models which grow exponentially, it is still relatively inefficient. A dynamically
allocated structure would improve performance. Additionally, the planning per-
formance can be improved, especially by implementing heuristics, such as A*.
Such an algorithm using non-biased, structural heuristics is a goal for future
development. Also, the familiarization phase, (Eq. 14) can lead to bias during
initial learning, but introduction of an implicit learning rate may ameliorate
this.
There are also some other topics we would like to investigate going forward.
The abstraction mechanism allows opportunities to develop unsupervised hierar-
chical decomposition functions for state spaces. Alternatively, ihe action policy
can be altered to use a statistical selection of actions, rather than argmax.
Finally, the effectiveness on a dynamic and relative domain suggests a rigorous
model for adaptation and learning re-use can be constructed.
References
1. Bertsekas, D.P., Tsitsiklis, J.N.: An analysis of stochastic shortest path problems.
Math. Oper. Res. 16(3), 580–595 (1991)
2. Blum, A.L., Furst, M.L.: Fast planning through planning graph analysis. Artif.
Intell. 90(1–2), 281–300 (1997)
3. Blum, A.L., Langford, J.C.: Probabilistic planning in the graphplan framework.
In: Biundo, S., Fox, M. (eds.) Probabilistic planning in the graphplan framework.
LNCS (LNAI), vol. 1809, pp. 319–332. Springer, Heidelberg (2000). https://doi.
org/10.1007/10720246 25
4. Dimitrov, N.B., Morton, D.P.: Combinatorial design of a stochastic markov decision
process. In: Operations Research and Cyber-Infrastructure (2009)
GAP Algorithm 413
1 Introduction
Change in Society. Today we feel the necessity of change in society, although the
words are yet missing for the phenomena that are arising around us rapidly. Past
technologies have led to an unprecedented growth in technology, communication,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 414–425, 2023.
https://doi.org/10.1007/978-3-031-18461-1_28
Analysis and Concept of Democracies Human-Technological Development 415
knowledge and population. Yet one thing has remained the same, the terrain of
the earth. Humanity has now at least two problems with technology: The power
individual humans and small groups have relative to each other and the task
to distribute the power among individuals. Each problem is increasing also as a
consequence of technologies.
This means we have a power control problem of individuals, which is usually
called politics, and which has historically a multifaceted room of solutions.
Industry 4.0. Industry 4.0 can be regarded as one ruling actual paradigm of
industry and society (cf. [2–4]). As it effects the whole bandwidth of society and
technology it also entails technologies in the digitalization area, which is essential
part of the actual humanity transformation. In this specific paper we focus on the
widely applicable aspect of personalization, which is represented in industry and
society by personalized products, services, machines and algorithms to mention
only a few, in each case increasing the value of the underlying, by giving choice
or power to the affected individual.
A New Democratic Principle. In this paper we introduce for the first time a
new democratic principle that further divides power into mainly (a) territorial
and (b) personalized membership. This leads then as a result, in the infinite
regress, to the personalized state ’in the pocket’. This will solve some basic
human problems in a new fashion, and above all put the human rights on a
new humanity standard, that would, from the future perspective, never have
been possible without current and future technologies. As an optimistic outlook
remains that with the fundamental basics of logic in every aspect of our daily
lives as well as in our systems, law, ecology and -nomy, we will reach the next
steps of necessary progress of humanity.
Scientific Motivation. The scientific motivation of this paper can be seen from
the systemic perspective. As systems are becoming larger and larger, there takes
place a change in some system properties, as each system in the world is sub-
ject to size scaling, as one universal law. We see the globalization of all sorts of
effects in economy and society, e.g. [19,21] and big data, e.g. [16,17]. This means
in a globalized world that the interconnectivity increases on the one side, and
the velocity of role out of models or prototypes, like businesses, technologies or
products on the other side, accelerates which can be identified with regard to
the basic law of mechanics as an acceleration increase of this “model role out” in
the transcribed understanding. This means that together with the driving forces,
all dependent systems accelerate, and with this negative trends are enhanced,
as well as positives. So for this reason, the negative trends, have to be better
controlled globally. The global trend of democracy, although actually it has been
documented, that the democracy index is going down, and Karin Schmidlechner
[20] seems to be pessimistic with regard to the global development of democ-
racy increase, my opinion is that the democracy is globally increasing on the
long run, as this is the only possibility to increase order, and hence global eco-
nomic and social or human efficiency by means of cooperation. For this sort of
416 B. Heiden and B. Tonino-Heiden
processes Bianca shaped the term to be conficient [12]. So what are today defi-
ciencies of democracy or other of less decentralized state-, company- or people’s-
organizations, will increase in urgency of an efficiency reshape. So in this paper
we focus on what is one of the root causes of democracy, and decentral organi-
zation in general, which will be the new paradigm of the future, and of future
technologies increasingly, as central systems reach their limits. The limits are
multifolded, but as the corresponding system limits are coming nearer we need
to have concepts how to reshape all sorts of systems, and in the technological
sense, force driven systems, into the direction of highly dense osmotic functional
systems, in production [5,11,23], information [26], organizational, societal and
political ones. We will focus in this work on democratic further development,
but the concept, can be applied to all sorts of mentioned systems, when adapted
specifically. In this sense we introduce the close connection between democratic
and decentral concepts, which is the systemic generalization, which then applies
to these systems. Hence with this new approach it might be valid to speak from
democratic production, information, organizations, societies and also politics.
In this very concept of systemic perspective, democratization can be regarded
as the partitioning of forces, their orgitonization according to origiton theory,
which means that the system is analyzed and partitioned into smaller coupled
and decoupled cybernetic units, that increase potentially overall system order,
by means of an emergence contraction process (see also [6]), which means that
information density is increased by simultaneous emerging properties of inte-
grated higher order power control in the here applied case.
Method. In this paper we use the method of natural language logic argumenta-
tion, and axiomatization, to form a theory set for the goal that we are aiming
at.
Goal and Research Question. The goal of this paper is to translate the principle
of personalization, as a core principle or paradigm of Industry 4.0 to the more
general context of democracy and future technologies requirements. With this,
we shall sketch shortly, how and why, with the help of now and future technology,
this will possibly, as a projection, lead to a higher ordered and dynamically
stabilizing and self-organizing world. The dominating research question of this
article will be: How can power be divided further in democratic or decentral
systems to decrease system absolute individual power and increase potentially
order beyond the previous system state?
Analysis and Concept of Democracies Human-Technological Development 417
Limits of the Work. As this work tries to enlarge the frame for future technolo-
gies, it is naturally limited to the focused results, that can be further generalized
on the one side, but need also be tested in different possible systems, as they
should be widely applicable. So the most challenging limit, is how much room the
generalization provided by this model shall take place, to sufficiently “control”
the regarded system in question. Technical limits are, that there can be given
only the most important ideas of these principles, and technical details will need
future follow up applications and research.
Content of the Work. In this work we first make in Sect. 3 a systemic analysis of
democracy and formulate it with basic Axioms, which we then later enhance by
further Axioms, into the human right direction of a new democratic paradigm. In
Sect. 4 we give then an application example of a new democratic implementation,
including the basic power structure of the newly shaped nation or state or system
and the individual as an overall system of “trade units”. We further give a short
sketch how current technological developments can support the implementation
of this potential democratic paradigm in all sorts of systems, from the individual,
over industrial production to the society. In Sect. 5 we give a conclusion and
an outlook for applications and future research in this field of order enhanced
democracy.
According to Luhmann [14, p. 1022] “Freedom and equality are initially still
‘natural’ attributes of human individuals. Since they are not found realised in
civil society, they are upgraded to ‘human rights’, the observance of which can be
demanded - up to the human rights fundamentalism of these days.” (translated
from German). This points out the development of human rights on the one side,
leading to a formulation of the UN-Charta of human rights after WWII [24,25].
So this frozen state, that is leading to an absolutism, indicates the border of
actual human rights. According to self-organizing theory the far from “equilib-
rium” state of order, in conjunction with chaos theory can be reformulated as an
order trajectory, which can potentially bifurcate into order increase and decrease.
It cannot stay fixed, as a fundamentalistic approach or development would sug-
gest. So the question is in which direction future development of humanity will
develop. Into the direction of higher order, which means an effective increase
in human rights standards, or a decrease which means the opposite, or a weak-
ening of the latter. In fact today’s advance in weakening these, has effectively
brought back the state of military confrontation in Europe, which means that
we have an actual partial earth global oscillating order backshift behind the
WWII status, while having at the same time highly advanced technologies in
all knowledge areas. The effective order decrease can be observed, e.g. by the
economic, decline, which can be regarded as a rough integrating function of all
human earth activities. We mention this greater problem only from a side-view,
418 B. Heiden and B. Tonino-Heiden
There are several reasons for the increase in power of humans. One main
reason is that the effective available increasing decision room by technologi-
cal enhancements like, e.g. Artificial Intelligence (AI), ambient intelligence, etc.
solutions is increasing.
With regard to the human right property of possible autonomy as part of
the specific living condition, we can formulate the third basic Axiom:
Axiom 7. The state or system can be divided in the functional parts: territory,
state-contract and individual.
The division of the state or system according to Axiom 7 into several parts,
makes it possible that there is a general decoupling and fractionalization. This
means, that each of the components can be distributed over the world. Hence this
is a decentral process that is intrinsically democratic in nature itself, a second
order cybernetics or higher ordered origiton. According to Axiom 4 then the
uniting value, which is as well the self-organizing element, due to the personalized
decision of the individuals for the nation or state or system. According to Axiom
1 this increases potentially order and leads to a decoupling and by this to a
stabilisation of processes (see also Fig. 2). This complexity growth according to
Fig. 2 as an increasingly potentially ordering is also depicted by an increasing
multidirectionality (cf. also [9]).
When we now reshape the arguments and sum it up in an overall information
dense process we can give the following integrating Axiom:
Axiom 8. The ethics of the world, and individuals is increasing by applying
increasing personalization of decisions.
Axiom 8 can be made plausible, as (a) to fulfil the condition of separation of
territory and state-contract, this needs a higher order, as the organization has
to be standardized, e.g. for the executive forces and to be accountable on the
whole world due to general global consented or universal applicable laws. The
contracts then guarantee the higher order, but the construction of a state which
is based on these fundamentals, has, with necessity, a higher order. Further (b)
ethics is a meta-function of individual ethics. This can hence be regarded as
a multicriterial functional optimization problem leading to a more integrated
solution, where the individual as the central state-stakeholder of the nation or
state or system is at the same time making the decisions. This is all together
also a higher order back-coupling process, as the individual is in a controllable
fashion self-affected by its previously decided decisions, and for this will easily
be motivated to take on responsibility for his own actions. This also gives the
bridge to the naturally following Axiom from the previous argumentation:
Axiom 9. Legal or system mandats have to be avoided as human understanding
has to be increased.
With Luhmann this Axiom can be interpreted insofar as the explicit has to
be increased over the implicit, which is committed by the communication dimen-
sion [13]. In Luhmann’s communication theory the triangle (1) understand (2)
inform and (3) communication are one system. The elements (1–3) are auto-
poetic systems, which means that they are cybernetical closed or decoupled and
structurally interwoven or coupled. So Axiom 9 states that the personal dimen-
sion in communication is of fundamental importance and priority in society. The
choice to enforce rules by mandatory legislative obligations, could over-control
a running system, by dictating abstract rules over living beings, that could bet-
ter decide locally and personally for their own good in live. So the enforcing
Analysis and Concept of Democracies Human-Technological Development 421
Fig. 1. World nation trade map - basic elements and composition principle.
partition or dimension or
...
territory
decoupling grade
nation: territory
individual and people ... world
size
# persons
m2 land, area
composing factors, to make decisions possible, that are increasing rational, e.g.
with computationally enhanced logic AI, the lambda computatrix (cf. e.g. [10]).
References
1. Aristoteles. Organon. Hofenberg (2016)
2. Bauernhansl, T., Michael ten, H., Birgit, V.-H.: Industrie 4.0 in Produktion,
Automatisierung und Logistik. Wiesbaden, Springer (2014). https://doi.org/10.
1007/978-3-658-04682-8
3. Granig, P., Hartlieb, E., Heiden, B. (eds.): Mit Innovationsmanagement zu Indus-
trie 4.0. Springer, Wiesbaden (2018). https://doi.org/10.1007/978-3-658-11667-5
4. Heiden, B.: Wirtschaftliche Industrie 4.0 Entscheidungen - mit Beispielen - Praxis
der Wertschöpfung. Akademiker Verlag, Saarbrücken (2016)
424 B. Heiden and B. Tonino-Heiden
5. Heiden, B., Knabe, T., Alieksieiev, V., Tonino-Heiden, B.: Production organiza-
tion: some principles of the central/Decentral dichotomy and a witness application
example. In: Arai, K. (eds.) Advances in Information and Communication. FICC
2022. Lecture Notes in Networks and Systems, vol. 439, pp. 517–529. Springer,
Cham (2022). https://doi.org/10.1007/978-3-030-98015-3 36
6. Heiden, B., Tonino-Heiden, B.: Diamonds of the orgiton theory. In: 11th Interna-
tional Conference on Industrial Technology and Management (ICITM). Oxford,
UK (2022). Online
7. Heiden, B., Tonino-Heiden, B.: emergence and solidification-fluidisation. In: Arai,
K. (ed.) IntelliSys 2021. LNNS, vol. 296, pp. 845–855. Springer, Cham (2022).
https://doi.org/10.1007/978-3-030-82199-9 57
8. Heiden, B., Tonino-Heiden, B.: Lockstepping conditions of growth processes: some
considerations towards their quantitative and qualitative nature from investiga-
tions of the logistic curve. In: Arai, K. (eds.) Intelligent Systems and Applications.
IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 543, pp. 695–705.
Springer, Cham (2023). https://doi.org/10.1007/978-3-031-16078-3 48
9. Heiden, B., Tonino-Heiden, B., Alieksieiev, V.: System ordering process based on
Uni-, Bi- and multidirectionality – theory and first examples. In: Hassanien, A.E.,
Xu, Y., Zhao, Z., Mohammed, S., Fan, Z. (eds.) BIIT 2021. LNDECT, vol. 107, pp.
594–604. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-92632-8 55
10. Heiden, B., Tonino-Heiden, B., Alieksieiev, V., Hartlieb, E., Foro-Szasz, D.:
Lambda computatrix (LC)—towards a computational enhanced understanding of
production and management. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.)
Proceedings of Sixth International Congress on Information and Communication
Technology. LNNS, vol. 236, pp. 37–46. Springer, Singapore (2022). https://doi.
org/10.1007/978-981-16-2380-6 4
11. Heiden, B., Volk, M., Alieksieiev, V., Tonino-Heiden, B.: Framing artificial intelli-
gence (AI) additive manufacturing (am). In: Procedia Computer Science, vol. 186,
pp. 387–394. Elsevier B.V. (2021). https://doi.org/10.1016/j.procs.2021.04.161
12. Heiden, B., Walder, S., Winterling, J., Perez, V., Alieksieiev, V., Tonino-Heiden,
B.: Universal Language Artificial Intelligence (ULAI), chapter 3. Nova Science
Publishers, Incorporated, New York (2020)
13. Luhmann, N.: Die Wissenschaft der Gesellschaft, 3rd edn. Suhrkamp, Verlag, Berlin
(1994)
14. Luhmann, N.: Die Gesellschaft der Gesellschaft, 10th edn. Suhrkamp Verlag, Frank-
furt/Main (2018)
15. Machiavelli, N.: Der Fürst / Il Principe. Philipp Reclam jun. Verlag GmbH (1986)
16. Pentland, A.: Building a New Economy: Data as Capital, MIT Press, Cambridge
(2021)
17. Pentland, A., Lipton, A., Hardjono, T.: Building the New Economy Data as Cap-
ital. MIT Press, Cambridge (2021)
18. Russell, B.: Philosophie des Abendlandes - Ihr Zusammenhang mit der politischen
und der sozialen Entwicklung. Europa Verlag Zürich, 3rd edn., History of Western
Philosophy (Routledge Classics) (2011). (englisch)
19. Scharmer, O., Käufer, K.: Leading from the Emerging Future - From Ego-System
To Eco-System Economies - Applying Theory U to Transforming Business, Society,
and Self. Berrett-Koehler Publishers Inc., San Francisco (2013)
20. Schmidlechner, K.: Überlegungen zur Geschichte und aktuellen Situation von
demokratischen Gesellschaften. Institut für Kinderphilosophie. 14.-17. Oktober
2021
Analysis and Concept of Democracies Human-Technological Development 425
21. Senge, P., Scharmer, C.O., Jaworski, J., Flowers, B.S.: Presence - Exploring Pro-
found Change in People, Organizations and Society. Nicholas Brealey Publishing,
London (2007)
22. smartfactory. http://www.smartfactory.de/. (Accessed 04 Apr 2014)
23. Tonino-Heiden, B., Heiden, B., Alieksieiev, V.: Artificial life: investigations about
a universal osmotic paradigm (UOP). In: Arai, K. (ed.) Intelligent Computing.
LNNS, vol. 285, pp. 595–605. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-80129-8 42
24. UN. Statement of essential human rights presented by the delegation of
panama (1946). https://digitallibrary.un.org/record/631107?ln=en. (Accessed 28
Sept 2022)
25. UN. Allgemeine Erklärung der Menschenrechte (1948). https://www.humanrights.
ch/de/ipf/grundlagen/rechtsquellen-instrumente/aemr/. (Accessed 28 Sept 2022)
26. Villari, M., Fazio, M., Dustdar, S., Rana, O., Ranjan, R.: Osmotic computing: a
new paradigm for edge/cloud integration. IEEE Cloud Comput. 3, 76–83 (2016)
Using Regression and Algorithms
in Artificial Intelligence to Predict
the Price of Bitcoin
1 Introduction
In today’s rapidly developing technology era, the introduction of cryptocurren-
cies is an inevitable part of society. The trend of using cryptocurrencies has only
appeared in recent years but is gradually dominating and promises to replace
cash in the future. Cryptocurrencies that are becoming familiar to investors,
such as Ethereum [24], Ripple [20], and especially Bitcoin [12], introduced by
Satoshi Nakamoto in October 2008, have made crypto money stand out more
than ever. Bitcoin was born and became famous thanks to blockchain technol-
ogy, which can be transacted directly without an intermediary organization. As
a result, Bitcoin can better secure cryptocurrencies at a lower cost. The value of
Bitcoin has been proliferating in recent years, attracting large volumes of trans-
actions and investments into the sector worldwide. This makes investors willing
to invest in Bitcoin to get significant profits from this digital currency. Big tech
companies have gradually accepted Bitcoin as payment. https://www.cnbc.com/
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 426–438, 2023.
https://doi.org/10.1007/978-3-031-18461-1_29
Predict the Price of Bitcoin 427
In February 2021, Tesla announced that it had purchased $1.5 billion in bitcoin
and accepted Bitcoin as a payment for its cars. This has made the cryptocur-
rency investment market, especially Bitcoin, more exciting in recent years. Bit-
coin price prediction is always a hot topic, attracting many researchers. The use
of regression and algorithmic artificial intelligence aims to focus on solving the
following problems:
– First, find the best prediction algorithm among the single models.
– Second, use the best performing algorithm among individual algorithms to
hybridize with other algorithms to improve prediction performance and accu-
racy.
– Finally, compare the two-hybrid and single models to find the model with the
best accuracy and performance to predict the future Bitcoin price accurately.
Using artificial intelligence algorithms and hybrid models will be tested to
predict the opening price of Bitcoin. The prediction results of the models will
be compared to find the suitable model with the best accuracy to predict the
Bitcoin price. This study will use Bitcoin transaction history data to predict
Bitcoin price.
The rest of the paper is organized as follows: Sect. 2. Description of related
works; Sect. 3. Presentation of algorithms in artificial intelligence used in this
study; Sect. 4. Representation of hybrid method; Sect. 5. Experimental produc-
tion and evaluation of results; Sect. 6. Conclusion and future development direc-
tion.
2 Related Works
Many studies on Bitcoin price prediction are done and use many methods such
as regression, machine learning, and deep learning to predict the Bitcoin price
trend. This prediction is made on the data set of Bitcoin’s transaction history.
The machine learning method is widely used and achieves promising results,
especially in the short-term prediction of the Bitcoin market trend.
Zheshi Chen et al. [2] used RF, XGBoost, Quadratic Discriminant Analy-
sis, SVM, and Long Short-term Memory models to predict the Bitcoin 5-minute
interval price. The best result obtained was 67.2% which outperformed the sta-
tistical method. The SVM model is still used in the research of Dennys C.A.
Mallqui and Ricardo A.S. Fernandes [11], and the authors [11] further used the
ANN model to predict the maximum, minimum, and closing price direction of
Bitcoin. Research results show that the best prediction model improves more
than 10% inaccuracy. Besides, this study also shows the mean absolute percent-
age error from 1% to 2%. In addition, the authors [7] use time series prediction
models such as ARIMA, FBProphet, and XG Boosting. This study shows that
the best predictive model is ARIMA, with an RMSE score of 322.4 and an MAE
score of 227.3. The authors [9] use SVM and LR models to predict Bitcoin prices
using a time series that includes daily Bitcoin closing prices from 2012 to 2018.
As a result, the SVM model has a better performance than the LR model in
Bitcoin price prediction.
428 N. D. Thuan and N. T. V. Huong
used. The methods used in this study include SVM, Deep Learning, and RF. As
a result, RF is the best performing algorithm with 70.5% accuracy.
Hakan Pabuccu et al. [14] (2020) predicts Bitcoin price movements, having
applied machine learning algorithms such as SVM, ANN, Naive Bayes (NB), RF,
and LR. The research results are that RF has the highest predictive efficiency,
and NB has the lowest predictive performance in continuous data. ANN has the
best predictive performance in discrete data, and the worst-performing algorithm
is NB.
Haerul Fatah et al. [3] (2020) used data mining to predict cryptocurrency
prices. The three cryptocurrencies used in this study are Bitcoin, Ethereum,
and NEO. The machine learning algorithms used are KNN, SVM, RF, DT, NN,
and LR. Experimental results show that the most accurate prediction algorithm
is SVM.
Most studies on Bitcoin price often use single models, but some suggest using
hybrid models. So in this study, we will predict the opening price of Bitcoin
the next day with single models and recommend hybrid models to improve the
algorithm.
Regression tells us about the relationship between a dependent variable and one
or more independent variables. The regression includes many types of problems;
among them is Linear Regression. Linear regression [1] was invented around
200 years ago. So, it is considered one of the classic models in regression problems.
It tells us about the linear relationship between a dependent variable and one
or more independent variables. It has the following form:
Y = A + BX (1)
In Eq. (1), Y is the dependent variable, X is the independent variable, A
is the base coefficient, and B is the coefficient. The objective of this study is
to use a linear regression model to predict available prices from the respective
independent variables.
Support Vector Machines (SVM) were proposed by Vapnik et al. in the 1970 s
Then became famous in the 1990 s s because SVMs performed exceptionally
well in multidimensional space. SVM belongs to the class of supervised math
problems. To find the most optimal hyperplane, where the margin is the distance
from the nearest point of that layer to the subdivision. The idea of SVM is that
the margin of the two layers must be equal and must be as large as possible; this
is illustrated in the equation below:
430 N. D. Thuan and N. T. V. Huong
n
1
ξi + ζi∗ )
2
min( w + C (2)
2 1=1
⎧
⎪
⎪yi − < w, xi > −b ≤ ε + ξi
⎪
⎨b+ < w, x > −y ≤ ε + ζ ∗
i i i
In there:
⎪
⎪ξi , ζ ∗
≥ 0
⎪
⎩
i
i = 1, ..., n
Support Vector Regression (SVR) [15] is a regression model that uses the
Support Vector Machine algorithm to predict the value of a continuous variable,
which in this study is the price of the cryptocurrency Bitcoin.
understand. The main idea of this algorithm is to predict based on the combina-
tion of many decision trees by averaging all the predictions of the individual trees.
Each of these trees is very simple but random and grows differently, depending
on the choice of training data and attributes.
4 Hybrid Methodology
4.1 Hybrid Model Based on LR with SVM, KNN, NN, DT, and RF
G. Peter Zhang [23] in his research, proposed a hybrid method between two
models, ARIMA and ANN, to predict time series. Which time series data consists
of two components, linear and non-linear time data. The idea of this hybrid
approach is a combination of the two components mentioned above. Furthermore,
these two components are represented by the following equation:
yt = Lt + Nt (3)
In (3), yt is the time series value, Lt is the linear component, and Nt is the
non-linear component. We will first use the ARIMA model to predict the linear
component and the time series value. The ANN model will predict the non-linear
component from the error prediction value obtained from the ARIMA model.
The following equation determines the error values from the predicted ARIMA
model:
i
et = yt − L (4)
In (4), et is the error value after using the predictive ARIMA model at time
t, yt is the value of the time series at time t, L i is the predicted value of the
ARIMA model at time t. The ANN model will be used to predict the value of
et - the error value obtained from the prediction by the ARIMA model, which
is illustrated by the following equation:
t + N
y
t = L t (6)
From the above idea, we propose hybridizing the LR model to predict the
linear part and then using SVM, KNN, NN, DT, and RF models to predict
the non-linear part of the remaining data. The test results compare the hybrid
models with single models to find the best model to forecast the Bitcoin price.
4.2 Deployment
The input is the time series value to be forecasted. There are two parts to the
time series value: the non-linear component and the linear component. After
the data goes through the preparation step: selecting the necessary attributes
and preprocessing, the LR model will be used for forecasting. After using the
predictive LR model, the output value will be a linear component of the time
series. Error-values (Difference error of predicted value and actual value from LR
model) will use SVM, KNN, NN, DT, and RF models, respectively, to predict
the result obtained in the component. The nonlinearity of the time series. Two
linear and non-linear results obtained from LR and SVM, KNN, NN, DT, and
RF models will be combined to give the final prediction result (Fig. 1.)
The process of the hybrid model goes through the following steps:
Step 1: Prepare and preprocess the data and find the best model for the
forecasted time series.
Step 2: Train the LR model with the training dataset and then make pre-
dictions on the test dataset. Calculate the error value of the prediction result
just practiced with the actual result.
Step 3: Using SVM, KNN, NN, DT, and RF models, respectively, to predict
the error of the results in Step 2.
Step 4: Combining the value prediction results in Step 2 and the error value
prediction results in Step 3 gives the prediction results of the combined model.
Step 5: Evaluate the model based on two parameters, RMSE and MAE, to
find the model with the best prediction results.
5 Experiment
Attribute Describe
Date Cryptocurrency trading day
Open Opening price/initial price of the cryptocurrency at a given time
High The highest price of the opening price
Low The lowest price of the opening price
Close Closing price/last price of a cryptocurrency at a given time
Model k = 5 k = 10
RMSE MAE RMSE MAE
SVM 11659.529 9981.240 9107.910 8200.751
DT 206.610 46.298 142.754 36.700
NN 24.327 13.286 20.218 12.990
KNN 232.066 70.294 169.977 58.278
RF 200.119 42.189 141.032 36.034
LR 8.415 0.527 6.390 0.518
Test Single Models: LR has the best forecast with RMSE and MAE. For k =
5, LR has an RMSE of 8.415 and an MAE of 0.527. At k = 10, the LR has an
RMSE of 6,390 and an MAE of 0.518.
Hybrid model k = 5 k = 10
RMSE MAE RMSE MAE
LR + SVM 8.410 0.506 6.381 0.480
LR + DT 16.109 1.171 15.618 1.191
LR + NN 25.216 15.406 12.081 6.566
LR + KNN 10.224 0.954 8.081 0.879
LR + RF 13.231 1.415 10.264 1.263
436 N. D. Thuan and N. T. V. Huong
Test Hybrid Models: LR + SVM has the best forecast with RMSE and MAE.
For k = 5, LR + SVM has an RMSE of 8,410 and an MAE of 0.506. At k=10,
the LR + SVM has an RMSE of 6.381 and an MAE of 0.480.
In both cases, k = 5 and k = 10 give the same results, and the hybrid model
gives better results than the single model. The hybrid LR model with SVM,
KNN, NN, DT, and RF uses the LR method to predict the linear component
of the time series and uses the remaining algorithms to predict the nonlinear
component. This dramatically improves performance compared to other single
models. The results show that the hybrid model gives more minor errors than the
single model. In particular, LR+SVM gives the best results in this experiment
because LR is the best predictive algorithm in single models. Furthermore, SVM
has a good predictive ability for small and less volatile data, so it did a good job
when using SVM in predicting error values. Besides, the worst algorithm in this
experiment is the SVM algorithm because the prediction ability with large-size
data is not good, so the results are bad.
We can see that the hybrid model gives excellent hourly Bitcoin price pre-
dictions. The hybrid models result in improved accuracy and performance over
the single models.
The following graph is obtained after experimenting with k = 5 comparing
the actual value, the predicted value by LR, and the predicted value by the
hybrid model LR+SVM. Figure 4 below.
Fig. 4. The graph illustrates the actual value, LR model, and hybrid model LR+SVM
6 Conclusion
Cryptocurrencies in general and Bitcoin are one of the potential investment
areas, but there are also many risks when their fluctuations are unpredictable.
The help of regression and algorithms in artificial intelligence has helped cre-
ate prediction methods with high accuracy. Experimental results show that the
hybrid model has high accuracy and better performance than the single model.
Therefore, the use of the hybrid model is very potential in predicting the future
price of Bitcoin and cryptocurrencies.
Predict the Price of Bitcoin 437
References
1. Ali, M., Swakkhar, S.: A data selection methodology to train linear regression
model to predict bitcoin price. In: 2020 2nd International Conference on Advanced
Information and Communication Technology (ICAICT), pp. 330–335. IEEE (2020)
2. Chen, Z., Li, C., Sun, W.: Bitcoin price prediction using machine learning: an
approach to sample dimension engineering. J. Comput. Appl. Math. 365, 112395
(2020)
3. Fatah, H., et al.: Data mining for cryptocurrencies price prediction. J. Phys. Conf.
Ser. 1641, 012059 (2020)
4. Guo, Q., Lei, S., Ye, Q., Fang, Z., et al.: MRC-LSTM: a hybrid approach of multi-
scale residual CNN and LSTM to predict bitcoin price. In: 2021 International Joint
Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
5. Huang, J.-P., Depari, G.S.: Forecasting bitcoin return: a data mining approach.
Rev. Integr. Bus. Econ. Res. 10(1), 51–68 (2021)
6. Inamdar, A., Aarti, B., Suraj, B., Pooja, M.S.: Predicting cryptocurrency value
using sentiment analysis. In: 2019 International Conference on Intelligent Com-
puting and Control Systems (ICCS), pp. 932–934. IEEE (2019)
7. Iqbal, M., Iqbal, M.S., Jaskani, F.H., Iqbal, K., Hassan, A.: Time-series prediction
of cryptocurrency market using machine learning techniques. EAI Endorsed Trans.
Creative Technol. 8(28), e4–e4 (2021)
8. Ji, S., Kim, J., Im, H.: A comparative study of bitcoin price prediction using deep
learning. Mathematics 7(10), 898 (2019)
9. Karasu, S., Altan, A., Saraç, Z., Hacioğlu, R.: Prediction of bitcoin prices with
machine learning methods using time series data. In: 2018 26th Signal Processing
and Communications Applications Conference (SIU), pp. 1–4. IEEE (2018)
10. Li, Y., Jiang, S.: Hybrid data decomposition-based deep learning for bitcoin pre-
diction and algorithm trading. Available at SSRN 3614428 (2020)
11. Mallqui, D.C.A., Fernandes, R.A.S.: Predicting the direction, maximum, minimum
and closing prices of daily bitcoin exchange rate using machine learning techniques.
Appl. Soft Comput. 75, 596–606 (2019)
12. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. In: Decentralized
Business Review, p. 21260 (2008)
438 N. D. Thuan and N. T. V. Huong
13. Nguyen, D.-T., Le, H.-V.: Predicting the price of bitcoin using hybrid ARIMA
and machine learning. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds.) Future
Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science,
vol. 11814, pp. 696–704. Springer, Cham (2019) https://doi.org/10.1007/978-3-
030-35653-8 49
14. Pabuçcu, H., Ongan, S., Ongan, A.: Forecasting the movements of bitcoin prices:
an application of machine learning algorithms. Quant. Finan. Econ. 4(4), 679–692
(2020)
15. Peng, Y., Albuquerque, P.H.M., de Sá, J.M.C., Padula, A.J.A., Montenegro, M.R.:
The best of two worlds: forecasting high frequency volatility for cryptocurrencies
and traditional currencies with support vector regression. Exp. Syst. Appl. 97,
177–192 (2018)
16. Phaladisailoed, T., Numnonda, T.: Machine learning models comparison for bitcoin
price prediction. In: 2018 10th International Conference on Information Technology
and Electrical Engineering (ICITEE), pp. 506–511. IEEE (2018)
17. Poongodi, M.: Prediction of the price of Ethereum blockchain cryptocurrency in
an industrial finance system. Comput. Electr. Eng. 81, 106527 (2020)
18. Rathan, K., Sai, S.V., Manikanta, T.S.: Crypto-currency price prediction using
decision tree and regression techniques. In: 2019 3rd International Conference on
Trends in Electronics and Informatics (ICOEI), pp. 190–194. IEEE (2019)
19. Rizwan, M., Narejo, S., Javed, M.: Bitcoin price prediction using deep learning
algorithm. In: 2019 13th International Conference on Mathematics, Actuarial Sci-
ence, Computer Science and Statistics (MACS), pp. 1–7. IEEE (2019)
20. Saadah, S., Whafa, A.A.A.: Monitoring financial stability based on prediction of
cryptocurrencies price using intelligent algorithm. In: 2020 International Confer-
ence on Data Science and Its Applications (ICoDSA), pp. 1–10. IEEE (2020)
21. Singh, H., Parul, A.: Empirical analysis of bitcoin market volatility using supervised
learning approach. In: 2018 Eleventh International Conference on Contemporary
Computing (IC3), pp. 1–5. IEEE (2018)
22. Wu, Z.: Predictions of cryptocurrency prices based on inherent interrelationships.
In: 2022 7th International Conference on Financial Innovation and Economic Devel-
opment (ICFIED 2022), pp. 1877–1883. Atlantis Press (2022)
23. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network
model. Neurocomputing 50, 159–175 (2003)
24. Zoumpekas, T., Houstis, E., Vavalis, M.: Eth analysis and predictions utilizing
deep learning. Exp. Syst. Appl. 162, 113866 (2020)
Integration of Human-Driven
and Autonomous Vehicle: A Cell
Reservation Intersection Control Strategy
Overview
The format of this paper begins with an introduction to the mix-traffic man-
agement in Sect. 1. Section 2 follows the introduction with an assessment of the
state-of-the-art for a combination of human-driven and autonomous vehicle mod-
elling. In a 4-way intersection with a merging T-junction with a priority lane, the
suggested mix-traffic control approach via cell reservation strategy is explained
in Sect. 3. In this part, the behavior of autonomous cars (AV) with kind driving
characteristics and human-driven vehicles (HV) with aggressive driving charac-
teristics is modeled. A thorough discussion of the underlying nonlinear traffic
flow characteristics relating to the car-following model and safe distance under
study is also included here. Section 5 presents information about the carried out
experiments, a discussion, and an assessment of the outcomes. In Table 2, the
findings that corroborate the research premise are presented. The outcomes of
the three traffic control strategies that were compared to the suggested alter-
native are presented here. Section 8 concludes with some last thoughts and a
summary of our findings.
1 Introduction
To minimize the delay and lower the likelihood of accidents at a road intersec-
tion, a new traffic control model in a mixed environment at the merging road
utilizing the “textbf” cell reservation model is developed. In general, traffic inter-
sections are regarded as a major source of congestion. As a result, controlling
and optimizing mix-traffic flow at road crossings is essential as a foundation
for the integration of autonomous and human-driven vehicles. Additionally, as
autonomous vehicles have grown in popularity, issues with mixed traffic have
drawn academics to create a variety of related technologies to aid in the inte-
gration of these vehicles. Autonomous vehicles have recently been considered
as an alternate solution to several issues with road traffic, from lowering travel
time to providing convenient and safe driving. In order to improve traffic flow,
autonomous cars can exchange information like position and velocity in real-time
with one another or with a centralized controller. While human-driven vehicles
use traffic signals and the corresponding stochastic driver behavior, this fea-
ture enables the prediction of its speeds in managing traffic at the intersection.
The unpredictable nature of human driving behavior contributes to a delay in
decision-making when in motion. As a solution to traffic issues, autonomous
vehicles can exchange real-time car movement parameters and enhance HV per-
formance.
Integration of Human-Driven and Autonomous Vehicle 441
The suggested 4-way road intersection model is shown in Fig. 1, with vehicle
trajectories denoted by green arrows and the cross collision site, also known as
the reservation node, shown by a red dot. Given that AVs and HVs occupy the
same junction space, AVs are vehicles with wireless communication signs. The
wireless communication sign is outside the intersection, the control unit is the
box outside the circle, and L denotes the lane identification. The green trajectory
lines cross one other at these intersection cross-collision sites from various road
lanes or trajectories before continuing on to their final destination after passing
through the intersection region. The main duty of junction control is to assign
reservation nodes to vehicles in a seamless and sequential manner without caus-
ing a collision. Figure 2, presents a 3-way intersection model extracted from the
4-way model, where the route’s merging segment connected to the main road,
there was a continuous one-way flow of traffic. The type of intersection, which
is determined by the number of road systems and lanes involved, heavily influ-
ences the traffic intersection management strategy. The inquiry for this study is
centered on two different kinds of intersections: 3-way and 4-way intersections.
The amount of vehicle trajectories taken into account while deciding which inter-
section management tactics to use varies. In this scenario, drivers must choose
a trajectory based on the junction management rule depending on the goal or
destination of the vehicle. A small error in path trajectory judgment at the inter-
section location carries a substantial danger of many accidents. Depending on
the road and junction control strategy model, there are delays for vehicles at the
intersection.
Research Question. Since AVs are developing and HVs are not going away any-
time soon, it seems clear that AVs and HVs will have to coexist for a while. This
442 E. F. Ozioko et al.
study’s investigation and analysis are based solely on simulated traffic data that
was parameterized using the suggested methodology and conducted based on
traffic theories. Research on the integration of human-driven and autonomous
vehicles is still primarily focused on a few concerns. These are a few of these
difficulties:
Hypothesis. When the road intersection cells are reserved in order, traffic flows
smoothly. Creating an approach to represent a 2-dimensional lateral and lon-
gitudinal driving behavior required for a realistic mix traffic flow model is the
main challenge of the drivers behavioural model. The interactions between cars
on the road are governed by elements such lateral vehicle displacement, driver
behavior, and the environmental impact of adjacent vehicles. The idea behind
vehicle collaboration is to use data obtained by using vehicle-to-vehicle com-
munication links to modify the movements of the vehicle, decrease idling time,
and minimize fuel consumption rate at a road intersection. In most cases, it is
presumable that autonomous vehicles approaching the intersection can interface
with the infrastructure and get data from the incoming traffic flow.
– Only AVs, not HVs, are capable of adjusting the inter-vehicle space, which
can be done to improve traffic flow after passing through the merging zone.
What would happen if the automobile ahead of you broke down? Don’t strike
him. To align the braking pattern of HVs with the merging of AVs, a longer
response delay and greater braking force are needed after the merge. An AV
and an HV both slowed down to 1 unit less than the speed of the car in front
of them in both situations.
Studying the effects of autonomous vehicles (AVs) on human-driven vehicles
(HVs) at a merging single-lane road with a priority lane and car-following model
may be regarded reasonable based on the aforementioned arguments.
flow while examining the effects of our approach on various traffic control mea-
sures. It is well known that HVs are composed of radical drivers who frequently
behave aggressively when they interact with AVs. By compelling AVs to stop
and giving them the right of way rather than waiting for them to pass, HVs are
given the priority to access and eliminate AVs. At intersections where a minor
street (non-priority route) meets a major roadway, the inter-vehicle distance is
typically taken into account (priority road). A priority road vehicle may roll
through an intersection if it has just arrived; otherwise, depending on the type
of automobile, it may start the movement from rest. A space between cars in
a conflicting traffic pattern is presented to human drivers who wish to perform
merging manoeuvre.
Depending on the vehicle mix, the pattern of signalised street vehicle arrivals
generates varied time intervals of different values. From [7], the inter-vehicle spac-
ing is often measured in seconds, and it is the distance between the first car’s
back bumper and the next one’s front bumper. The period between vehicles
arriving at a stop line on a non-priority route and the first vehicle pulling up to
the priority road is the “space” being discussed here. The earlier study by [2]
indicates that modelling delays for homogeneous traffic show a linear connection
for the same type of vehicle. A traffic collision results from the coexistence of
mixed traffic and non-uniform car behaviour, which cannot be predicted by such
linear models. Because of homogeneity in vehicle behaviour, this homogeneous
traffic situation can result in less inter-vehicle spaces being accessible. Addition-
ally, a discernible rise in the occupation time of low priority movements has been
seen.
Cell reservations at intersection collision spots, on the other hand, ensure
safe and ideal intersection management and may even be a less expensive way to
increase productivity in a mixed-traffic situation. Every instance in time (0.1 s)
to update the intersection status, all vehicles in the intersection environment
are verified. This kind of traffic management will effectively relieve and man-
age traffic congestion at road intersections. The underlying presumption is that
autonomous agents solely control how a vehicle navigates. The sophisticated
traffic simulator calculates the various delays that occur when moving vehicles
through an intersection. For the performance evaluation of the method for com-
parison, the intersection performance measures were defined. It is anticipated
that with the addition of autonomous vehicle capabilities like cruise control,
GPS-based route planning, and autonomous steering, it will be easier to govern
multi-agent behavior in mixed traffic, which will boost HV performance.
3 Methodology
The road model Fig. 2 outlines a single lane merging road system with its physical
properties.
How can coexistence occur when there are cars with diverse driving styles
using the same road system without significantly reducing the effectiveness of
traffic flow? In this situation, a merging T-intersection at an angle of 45◦ is
Integration of Human-Driven and Autonomous Vehicle 445
being considered for cars sharing space to test the hypothesis. As seen in Fig. 2,
this situation involves vehicles coming from a different road merging onto a pri-
ority road at a junction in between them. The likelihood of a collision cannot
always be determined from the distance between the two objects in this sce-
nario. They think about the possibility that they would eventually collide if they
were traveling in the same direction. The main issue will inevitably come from
human-controlled vehicles since they lack the ability to be self-aware of their sur-
roundings and because their behavior is stochastic and more prone to error than
other types of vehicles. The proposed model took into account combining the
two approaches to traffic management, centralization and decentralization. In
the centralized approach, drivers and vehicles interact with a central controller
and the traffic signal to designate the intersection’s right-of-way access prior-
ity. With contrast, in the decentralized model, drivers and vehicles interact and
bargain for priority right-of-way access. The impact of autonomous vehicles on
intersection traffic flow has been the subject of several research studies [6]. The
author in [11] proposes the optimisation of traffic intersection using connected
and autonomous vehicles. Also, [5] considered the impact of autonomous cars on
traffic with consideration of two-vehicle types, which are distinguished by their
maximum velocities; slow (Vs ) and fast (Vf ), which denotes the fraction of the
slow and fast vehicles respectively.
446 E. F. Ozioko et al.
The driving agents can “phone ahead” and book the spaces they require along
their route using the reservation node system. Within the intersection region,
the SIR is divided into a n n grid of reservation tiles, where n is referred to as the
specifics of the reservation nodes system. At each time step, only one automobile
may reserve each RN. The following information is included in the request to
utilize RN.
– Vehicle type
– Vehicle arrival time
– Vehicle current velocity, though in the simulation, we maintain a maximum
velocity of 10 m/s for optimal performance.
– vmax and vmin
– amax and amin
– Vehicle trajectory with details of the requested RNs
– A request to schedule the car request is issued to the CU after the vehicle has
arrived. The present reservation must not be in conflict with any of the RN
time duration along the vehicle trajectory on the SIR, according to the CU.
– Let RNk be the k-th RN in the SIR, where 1 < k < n and n is the number
of RN’s along the vehicle route. Let tak be the requested time of arrival to
RNk and tdk be the departure time from RNk . At every time instance, the
CU checks whether the requested RN is available for reservations.
– If the RN is accessible, the CU will verify the availability of the following RN
along the vehicle’s trajectory at the following time step, and so on until the
vehicle exits the SIR.
– If the road camera detects HVs from the intersection, HVs are given prefer-
ential access to the reservation based on their vehicle type.
– If the request is unsuccessful or the requested RNk is not available at time
interval tak and tdk , the CU performs a search for the next available time for
the first requested RN. The CU iterate this process for another RS request.
The reservation node framework that coordinates traffic flow at the intersec-
tion could be explained using Fig. 3. This scenario only uses one RN to explain
the background of the reservation node process. From Fig. 3, it is assumed that
car A has the minimum time to the reservation node, so it is allowed to keep
moving. Cars B and C, which are further away than car A, are next checked to
see if the distance between them and another automobile is below the minimum
distance; if it is, the brakes are applied; if not, the vehicle is free to continue
travelling. For cars B and C, the method described above is repeated. Car B is
located closer to the reservation node than automobile C, hence in this straight-
forward example, car B comes in last. It is possible to repeat the procedure until
there is just one car left to pass the reservation node.
Integration of Human-Driven and Autonomous Vehicle 447
Data Structure Scenario for the Reserve Node Approach Involves the
Reservation of Future Time Cells: The following is a description of the
structure of the gathered data, their connections, and the storage, actions, or
functions that can be performed on the data in the algorithm: As represented in
Fig. 4, in essence, one can reserve 10 s because the control unit (CU) stores an
array of the upcoming 100 time-steps (100 ms intervals) for each RN.
The mixed behaviour creates a highly complex traffic scenario that has a
considerable negative influence on the capacity and effectiveness of intersection
traffic flow. Since each vehicle behaves differently, uses a distinct communication
medium, and abides by a straightforward regulation at intersection zones, vehi-
cles with one behavioural pattern are dealt with a consistent protocol at inter-
sections. However, research reveals that because they are frequently caused by
human error, car accidents near merging highways are among the most prevalent
parts of traffic studies. This work has extensively researched the coexistence of
mixed behaviour, safe distance, and cell reservation within the suggested traffic
model framework. The majority of models created to analyse how traffic behaves
at intersections with mixed vehicle types are built to prevent collisions between
autonomous and human-driven vehicles, but they degrade traffic flow efficiency.
The influence of autonomous automobiles on merging roadways in a mixed envi-
ronment is explored in this article using various car mix proportions in addition
to researching mixed traffic behaviour. The crucial element for this strategy is
the vehicle occupation time, which is the amount of time a vehicle or group of
vehicles takes to cross at the intersection. At these intersections, the occupation
time of the vehicle mix ratio was investigated and analysed. Additionally, the
information on mixed-traffic volume and the percentage of occupation time by
each vehicle as shown in Table 2 were extracted from the simulation.
448 E. F. Ozioko et al.
The United Kingdom’s left-side-of-the-road driving policy and the need that
drivers respect right-side traffic are the two road systems that are being taken
into consideration. The issues manifest themselves uniformly across all of the
model’s scenarios, regardless of the type of vehicle. Due to the merging angle,
this model typically has trouble detecting and treating the car that is following
from the other lane. By making sure that the car is looking at an angle that
covers the merging space, this issue is being solved. Looking at Fig. 2, it shows
merging T-junction whose non-priority road feeds into the priority road at node
8, not at 90◦ but at 45◦ , and this road is at a bend from the lane leading up to
it at node 12. The autonomous vehicles must rely on communication between
other vehicles and the road infrastructure because there is a traffic light signal
in this T-junction, while human drivers utilise their vision to detect signals from
traffic lights and other vehicles.
According to their communication capability, vehicles are classified as HV or
AV within the junction zone. The vehicle presence detector detects the presence
of vehicles, and the intersection controller classifies and specifies the vehicle type
based on communication capability. There is a control zone at the intersection
where the central controller gathers information from the connected vehicles
(lane id, time, position, velocity, and the number of vehicles in a platoon). The
sample data structure below represents the data from the RN algorithm:
Cars 1 and 2 are initially driven from road nodes 11 and 7, respectively, in
order to go from node 8 to their final destination at node 9:
– Car1: RN data structure = [0,., 0] (100 0s)
– Car2: RN data structure = [0,., 0] (100 0s)
While the HV employs the traffic light signal control, all the AV interact with
one another and the central controller as they approach the intersection to assign
an RN. For efficiency and central control, the two forms of vehicle control media
are in the sink. The traffic schedule is determined by a number of predetermined
choices, including the access protocol, the distance/position of the cars from the
merging point, the kind of cars, and the number of cars in the platoon or queue.
The two types of vehicle mixed behavior to the occupancy time with the
traffic flow in a merging single lane road system is described by the mathematical
Integration of Human-Driven and Autonomous Vehicle 449
lroad = 600 + 600 + 600 + 600 + 49.5 + 106.1 + 49.5 + 106.1 + 106.1 +
49.5 + 106.1
Therefore: lroad = 2972.9 m (approximately) lcar = 4.5 m (Average) v =
10 m/s
ncars = lroad /(S + lcar ) (2)
where
saf edistance , S is 5 m for AVs,
S is 7 m for HVs during platooning and
After merging, S is 3 m. The moment after a merge occurs when the vehicles are
traveling straight ahead at a constant speed.
Vehicle Model
There are two (2) types of vehicles being considered:
While the HV driving was developed as a standard driving system with the
following features.
Due to the inherent disparities in vehicle behavior between AVs and HVs,
it will be difficult to construct a safe distance model that results in collision-
free traffic flow. Human drivers are less accurate, more prone to errors, and have
stochastic behavior that makes them unpredictable. According to [22] given that
autonomous vehicles react almost instantly, whereas human drivers need roughly
6 s to react to unexpected occurrences, the following space should be maintained
between autonomous vehicles:
Integration of Human-Driven and Autonomous Vehicle 451
sr = v · tr (3)
While that for human-driven vehicles will be:
sr = v · tr + 6 (4)
1. Longitudinal Control for Car Following model (Fig. 5): One of the core tenets
of the car-following model is that for a given speed, “V” (mi/hr), vehicles
follow one another with an average distance, “S,” (m). In order to access the
Car-following model’s throughput, this parameter is important. The average
speed-spacing relation in Eq. 5 proposed by [19] deals with the longitudinal
characteristics of the road and is related to the assessment of the single-lane
road capacity “C” (veh/hr) in the following way:
V
C = (100) (5)
S
where the number 100 denotes the intersection’s default maximum carrying
capacity.
However, the average spacing relations could be represented as:
S = α + βV + γV 2 (6)
vx = v · cos θ (7)
In addition, the steering angle θ of the vehicle front wheel λi the angle of the
steering wheel, βv and steering ration iu is represented in Eq. 8.
βv
λi = (8)
iu
To handle longitudinal and lateral driving behaviour successfully at road
crossings, a combination of these two approaches is essential. The optimal veloc-
ity function was applied by the longitudinal car-following model to relax the equi-
librium value of the distance between vehicles. Additionally, after a vehicle cuts
in front, there are still issues with high acceleration and deceleration, however
the intelligent Driver Model corrected this issue. The lateral model determines
if lateral vehicle control is possible, necessary, and desirable by maintaining the
safe distance braking procedure. According to [12], the lateral approach model
is targeted on a streamlined decision-making process using acceleration.
Vehicle Queue
Queue calculates the amount of time a vehicle must wait before receiving the
right of way at a junction from another vehicle. Transportation planning rule
(TPR) may be necessary for operational analysis, depending on the intersection-
specific conditions and at the city’s discretion. TPR is used in queuing assess-
ments for transportation system plans. Stop-and-go traffic, slower speeds, longer
travel times, and increased vehicular queuing are features of traffic congestion.
The quantity of cars waiting at any intersection can be used to measure this.
It is the result of an intersection’s cumulative effects over a period of time. The
time difference between the average vehicle occupancy was used to analyze the
vehicle delay time for the three intersection control systems.
– Since there is no other vehicle to affect its speed, the leading car can increase
its speed to the desired level.
454 E. F. Ozioko et al.
– Because drivers aim to maintain a fair distance or time between vehicles, the
speed of the leading vehicle mostly determines the status of the following
vehicle.
– To avoid the collision, the braking procedure uses changing amounts of brak-
ing force.
– The collision avoidance strategy mandates that a driver keep a safe distance
from other moving vehicles, as seen in Fig. 6.
– The distance between the cars is directly inversely correlated with their speed.
vnt − vn+1
t
= T · atn+1 (11)
1
atn+1 = · [vnt − vn+1
t
] (12)
T
Based on the UK transport authorities, the model prototype’s random values
of (0.3 to 1.7) were selected for the safe driving distance for people [21]. Using
Integration of Human-Driven and Autonomous Vehicle 455
According to [14], many researchers have spent their time to modelling driv-
ing behaviour, analysing conflict processes, and enhancing traffic safety in the
field of driving behaviour. The S.I units of meters, seconds, and kilograms con-
stitute the foundation for all values. To aid in the prediction of the car’s motion,
consideration is cantered on differentiating between conservative driving and
optimistic driving style: In order to drive cautiously, a vehicle must be able to
stop totally when the vehicle in front of it abruptly or completely stops, as would
occur in a crash. In this scenario, the leading vehicle should maintain a minimum
distance difference of 30 m. In [13], when driving with an optimistic attitude, it
is presumed that the vehicle in front of you would brake as well, and maintaining
456 E. F. Ozioko et al.
a safe distance will take care of the issue. When you’re reacting, the automobile
passes by:
sr = v · tr (14)
The safe separation between vehicles is set to be variable for the HVs and
constant for the AVs based on the aforementioned hypotheses. The safe distance
figures, which are expressed in seconds, accurately capture the distance corre-
sponding to the current vehicle speed. The implication of Condition 1 is that
the leading vehicle’s required stopping distance is provided by
v12
s= (15)
2·a
From condition 2 it follows that to come to a complete stop, the driver of the
considered vehicle needs not only braking distance v 2b2
, but also an additional
reaction distance vδt travelled during the reaction time (the time to decode and
execute the breaking instruction needed).
Consequently, the stopping distance is given by
v2
δx = vδt + (16)
2·b
Finally, if the stopping distance is taken into account and the gap’s’ is more
than the necessary minimum final value of 0, condition 3 is satisfied.
v2 v2
δx = δt + − 1 (17)
2b 2 · b
The “safe speed” is determined by the speed “v” for which the equal sign is
valid (the maximum speed).
vsaf e (s, v1 ) = b · δt + b2 δt2 + 2 · (s − s0 ) (18)
What occurs when the vehicle in front automatically applies the brakes? In
order to stop and avoid colliding with the available space, you must have enough
time (response time) to deploy an automated brake. Using the 2 s rule proposed
by [20], a distance of 20 m to begin braking is ideal if v = 40 m/s on the highway.
Condition for the Minimum Distance y[m] from the Lead Vehicle
The merging AV chooses to enter the intersection if the gap between the lead
vehicle and the following one exceeds the computed value of y.
For AV
y =v·t (19)
where
– t[s] = Transit time of the T-junction
– v[km/h]= velocity of coming vehicle
Integration of Human-Driven and Autonomous Vehicle 457
c=v·y (20)
and
y = lcar + treaction · v + a · v 2 · t (21)
where
– lcar = vehicle length
– t = reaction time
– a= deceleration rate
– v = speed
According to the analysis equations shown before, the following formula can
be used to determine the inter-vehicle distance for the various types of cars:
For HV
y = v · (t + 1.8) (22)
where the value 1.8 is the inter-vehicle time of transit for HV.
Nevertheless, taking into account the human worry caused by AV, we have
added a stopping distance of d to ensure our safety. We have
y =v·t·d (23)
where d is the safe distance.
In the interest of transparency, the times required for halting, braking, and
reacting have been itemized.
v2
ss = v0 · tl + 0 · aF (24)
2
The principle of “first in, first out” dictates that from Fig. 8, the right of
way belongs to the human driver of the first vehicle, then to the driver of the
first autonomous vehicle, and so on, in ascending order of time of arrival. You
should also take note of the similarities in the plots, which show that the cars
1 and 2 that are driving aggressively as they approach a curve have a similar
velocity pattern to the cars 1 and 2 that are driving gently as they slow down
to maintain a safe distance.
k − k2
q k = vf · ( ) (27)
kmax
– Filling out the form with autonomous vehicles and human-driven vehicles,
for the sake of simplicity, let’s suppose that HVs are on-road A and AVs are
on-road B. Road A is a direct route, thus the HVs do not need to negotiate
any turns or bends as they go along it.
– While road B links or combines with road A in the middle at node 8, which
is located after a turn and at a crossroads.
Integration of Human-Driven and Autonomous Vehicle 459
– The AVs slow down significantly as they approach the curve in order to gauge
how far the other vehicle (HV) may be from the closest RVC server or node
based on how close they are to intersection node 8.
– The RVC server determines, and this is of the utmost importance, how far
apart both vehicles are from one another.
– After then, the RVC makes use of this information in order to award RN to the
vehicle. It shows a traffic signal for the human driver in the HV that pushes
them to move, slow down, or stop, and it sends a signal to the autonomous
vehicle that tells it to decelerate, maintain driving, or stop.
– As a consequence of this, other cars that are following the car that slows down
while communicating with an RVC node or due to traffic or while arriving
at an intersection will also slow down to obey the safe-distance model by
judging how far they are from the car that is in front of them. This is done
by comparing how far away they are from the car that is in front of them
(which is where Inter-Vehicle Communication applies).
– When they reach this location, two cars coming from opposite roads must
first comply with the rule of the merging algorithm before they can combine
into a platoon.
Vehicle Routing
This is a reference to a series of procedures that are carried out in sequential
order in order to guide cars in an effective manner from the starting point to the
destination. When a vehicle departs the node that served as its origin, it has the
option of taking any one of a number of alternative routes to reach its destination
or objective. The road node system is utilized in the process of selecting traffic
trajectories in the planned road intersection in order to transport vehicles from
the starting node to the destination nodes.
The developed traffic routing system maps the vehicle’s path and aim in an
effective manner, beginning from the start node and linking all joining nodes
as well as the destination. The process of determining the best route to take
is reliant on the type or design of the road intersections as well as the existing
traffic rules. The edges indicate the vehicle trajectory, and the nodes themselves
stand in for lanes of traffic on the road. The reservation nodes are situated along
the edges. The vehicle path is directed from its origin node all the way to its
destination node by the routing algorithm. Table 1, is a representation of the
routing process table dictionary, which is the location where the routing algo-
rithm begins the process of updating the nodes of the intersection configuration
using the road node catalogue. This is done in order to compute the node routing
requirement for each vehicle trajectory. A practical routing algorithm is devised
with the help of the UK traffic regulation in order to facilitate the movement of
cars from the starting point to the destination in an actual traffic scenario. The
road nodes system is used to develop the intersection-based routing protocols
that are designed for the vehicular communications process of picking a traffic
path.
460 E. F. Ozioko et al.
In a 4-way intersection, the road Node Layout Table 1 defines the road edges
that identify all of the possible route connections that can be taken from the
starting node to the destination node for a given trajectory. In the simulation,
the associated modules known as the Layout List and the Road System are
responsible for drawing out the road design by connecting lines between the
nodes. The layout list specifies all of the central nodes that appear in the road
diagram, together with the coordinates that relate to those nodes. In this sce-
nario, a dictionary is created to record the pattern of how to navigate depending
on the road layout and the regulations that govern traffic. The system that leads
each vehicle from the start node to the target node is called the routing pattern
mechanism, and it is guided by an estimation of the routing traffic latency. The
ICU makes use of the road nodes in order to compute the following parameters:
car position, individual and total delays of the vehicle at each lane, and then
it determines the routes that vehicles will take. Each vehicle is responsible for
defining its own itinerary, beginning at the starting node and ending at the des-
tination node. Intersection state It is defined as a column vector of all road lane
delays. From the traffic model in Fig. 2 with vertices (road lanes) L1 , to L2 , and
their corresponding connecting edges of 7, 8, 9, 11, and 12, the traffic state at
time t, is described by
It = [L1t , L2t ]T (28)
where:
Li−j (t) is the delay from Li to Lj as a function of time representing the
dynamic nature of the traffic flow.
Car Physics for Curved Movement: To simulate car movement at the curve as
described in Fig. 9, one needs some geometry and kinematics and considers forces
and mass. The curve movement model is illustrated in Fig. 9 which describes
how vehicles move to their coordinate position. This experiment is doomed to
fail in the absence of the curve movement model due to the requirement that
the vehicles keep their lane track at all times.
Based on the above calculations, the road capacity for the different category
of cars are calculated as follows:
When the front wheels turn at an angle of ‘theta’ while the car maintains a
constant speed, the vehicle precisely depicts a curved circular path. Maintaining
a constant speed for the vehicle while simulating the mechanics of turning at
both low and high speeds will result in the best possible performance. It is
possible for the wheels of a car to have a velocity that does not correspond with
the orientation of the wheel. This is due to the fact that when travelling at high
speeds, one may see that a wheel may be travelling in one way while the body
of the car is still moving in another direction. This indicates that a velocity
component is perpendicular to the wheel, which causes frictional forces to be
generated.
After Merging: This is when all of the vehicles have converged onto a single
roadway, where they are now incorporated and shared by all drivers. From this
point on, the minimum safe distance between the two types of vehicles is kept at
3 m for the following reasons: we assume that there will be no overtaking, and
the vehicles will maintain the same relative and constant speed:
Therefore, ncars = 2972.9/(3 + 4.5) ncars = 396.39 (approx.) Then, ncars =
396 for both AVs, and HVs. The route can accommodate a maximum of 396
vehicles at once, including both AVs and HVs. This technique, on the other hand,
is primarily focused on the security of intersections, specifically on the question
Integration of Human-Driven and Autonomous Vehicle 463
Because of this, the automobile that follows the one in front of it will have
a lower velocity (and speed), and as a result, it will be less likely to collide with
the car that is in front of it.
464 E. F. Ozioko et al.
– ssaf e = 3 m (we converted to pixel from the screen as the default safe distance)
– QAV = Ssaf e + 3 = 6 m, (Queening distance for autonomous vehicles)
– QHV = Ssaf e + 5 = 8, (Queening distance for autonomous vehicles)
It is important to keep in mind that these values pertain to the pixel rep-
resentation that is displayed on a computer screen. The following are the
reaction thresholds and braking forces that are utilised for automobiles and
heavy vehicles:
For AVs,
– Sreaction AV <= Ssaf e + 1 m <= 4 m (AV reaction threshold)
– FAvreactionb AV = 60000 N (braking forces)
– Sreaction HV <= Ssaf e + 3 <= 6 m (HV reaction threshold)
– Freactionb HV = 72000 N (braking forces)
HA
v
Ch = qmax = (32)
vTh + L
Integration of Human-Driven and Autonomous Vehicle 465
VA
v
Ca = (33)
vTa + L
When HV and AV are coupled, one will be able to generate the expected
influence that AV will have on HV when implemented on a graph with different
parameters. This will be possible since HV and AV will be working together.
Ca vTh + L
= (34)
Ch vTa + L
For traffic mix, n represent Av
capacity cm is now dependent on n
n represent the ratio of AV integrated into the road.
v
cm = (35)
nvT a + (1 − n)vTh + Lpkw
Considering maintaining a greater gap between an autonomous vehicle and
a vehicle driven by a human in order to prevent the annoyance of drivers.
466 E. F. Ozioko et al.
1
cm = (36)
n2 vT aa + n(1 − n)vTah + (1 − n)vThx + L
4 Experiments
This scientific approach explores the various management strategies of mixed
traffic (AVs and HVs) at a road intersection in order to provide an alternative
control strategy that allows for the movement of traffic to be conducted in a
manner that is both safe and efficient. This inquiry procedure included doing an
evaluation of the existing cutting-edge strategy for managing mixed traffic.
The operation of the city’s road traffic system is mostly based on the control
strategy, capabilities, and the traffic lights that are in use at the junctions. The
use of a method for managing traffic is dependent on the drivers and other vehi-
cles on the road, as well as the control signals. There are a variety of different
types of traffic control media, some of which include traffic light signals, roadside
traffic signs, wireless communication, and road markings. The management of
traffic necessitates clear and effective communication between the vehicles that
utilize the roads (traffics) and the infrastructure that supports them. The plan is
to apply intersection cell reservation to mixed traffic management on the proto-
type simulator and analyze their performance in comparison to the most recent
developments in the field. The effectiveness of each control method is judged
according to the impact that its strategies have on the performance of the rele-
vant traffic parameters. The prototypical model for a road intersection consists
of the crossing over of two streets or roads that are perpendicular to one another.
When two or more automobiles drive up to a four-way stop simultaneously, the
road segments maintain the same crossing angle with traffic coming from the
right having priority by default to go first at a four-way stop. In this paradigm,
the control of the traffic light signals is shared with the control of the wireless
communication between vehicles and between vehicles and infrastructure. The
following methods of controlling traffic at intersections are utilized during the
experiments: the Traffic Light, the Collision Avoidance System, and the Cell
Reservation System.
Integration of Human-Driven and Autonomous Vehicle 467
The following traffic control strategies were evaluated based on their effec-
tiveness and level of safety in relation to the research criteria, which made use
of the traffic control framework and method.
primary source of the problem. The consideration is based on two different kinds
of vehicles, each of which has a maximum speed that is different from the other:
slow (Vs ) and fast (Vf ) which denotes the fraction of the slow and fast vehicles,
respectively.
The purpose of this experiment is to test the hypothesis, which proposes that a
road intersection cell with reserved space will result in more effective movement
of vehicles. When the Avs inter-vehicle distance is changed, the performance
of HVs improves, and the amount of time that a vehicle is occupied by its
occupants grows when the proportion of human-driven vehicles rises. An analysis
of variance in the time analysis of different ratio simulation tests is conducted
Table 2 which gives statistics for the variation in time occupancy with vehicle
mix ratio. This is because of the behavioral differences between cars driven by
humans and those driven by computers or other autonomous systems.
Integration of Human-Driven and Autonomous Vehicle 469
Stability:
In the contest of this research, traffic flow stability as represented in Fig. 14 is
analysed with the number of traffic braking in response to the volume for the
different control methods under the same condition. At road intersections, the
effectiveness of the flow of traffic depends, in part, on the stability of the flow
of traffic, which can be evaluated by counting the number of times a control
method causes vehicles to brake. The consistency of the flow speed is a metric
that can be used to measure the stability of the traffic. It is a condition in which
all vehicles move at the same optimal speed and maintain the same safe distance
from one another. A speed fluctuation impacts the vehicle flow stability when in
motion as shown in Fig. 14. It has been discovered that the various approaches to
traffic control are associated with varying degrees of predictability. The process
of maintaining a safe distance between vehicles requires both deceleration and
acceleration, which causes a disturbance in the flow stability of the entire system.
Discussions
The methodology that has been proposed for analysing the impact of combining
AVs and HVs will assist in determining the integration pattern of an autonomous
vehicle for the transition period between the two types of vehicles. In addition,
traffic engineers can estimate the capacity of a road intersection in a mixed-traffic
environment by using the models developed in this study. These models can be
used by traffic engineers Fig. 10 and Fig. 11. According to the findings of this
investigation, autonomous vehicles are not only significantly safer but also more
time-efficient and contribute to the reduction of road congestion. It is evident
from Fig. 11 that intersection efficiency increases with an increase in the ratio
of an autonomous vehicle. This is due to the fact that AVs combine and inter-
pret the sensory data they receive from their surroundings in order to identify
appropriate navigation paths, obstacles, and appropriate signage. The perfor-
mance metrics relating to throughput and delay described in Fig. 13 are used to
conduct the measurement of intersection efficiency using traffic parameters. The
performance of various traffic control strategies is analyzed by using a variety of
parameter values based on simulations to see how the different parameter values
affect the throughput performance of the system.
In each simulation, the values of the vehicle mixed ratio were made higher in
order to establish the impact of the ration variation on the integration pattern
and to guide the pattern. Under each of the three different approaches to traffic
control, the performance of a variety of ratio cases is analyzed and compared.
Because of this trend, the HV will benefit from the AV’s inefficiency in a scenario
in which they co-exist.
6 Contributions to Knowledge
As a result of the work that was done, some new knowledge that is founded on
the knowledge that was already available has been created. They are as follows:
Vehicle Models
– Model varying vehicle lengths to reflect the real city traffic situation
– Conduct research into the effect that safe distance and reaction time distri-
bution have on one another.
– Make use of machine learning in order to manage traffic and provide real-life
physics.
– Investigate non-compliance to an emergency
8 Conclusion
support by this body of work. It will make mixed-traffic more efficient, it will
help alleviate traffic congestion at road intersections, and it will provide technical
support for future research in traffic control systems. The use of hybrid vehicles
that combine human and automated driving is gradually becoming the standard
across the globe. The widespread development and implementation of innovative
technologies in the management of vehicles and traffic will significantly advance
urban traffic control systems and provide support for the implementation of
intelligent transportation on a broad scale.
The cell reservation method was used to investigate the effect that driverless
cars would have on human-driven cars at a road intersection with merging lanes
by measuring the distance between vehicles using the inter-vehicle distance. The
vehicle occupation time was observed at a merging road as reflected in Fig. 12,
and mixed mathematical relations relating to occupation time of different vehicle
types were developed. A vehicle ratio occupancy pattern was developed as a
valuable tool for evaluating the process of integrating autonomous cars onto
public roads as a result of our findings. This pattern will serve as a basis for
future research.
The following are the most important takeaways from this research:
1. When cells at road intersections are reserved, the efficiency of the flow of
traffic is increased.
2. It has been demonstrated that the introduction of autonomous vehicles onto
public roads will have a beneficial effect on the operational effectiveness of
vehicles driven by humans.
3. The length of time that a vehicle is occupied is reliant on the traffic mixed
ratio.
8.1 Summary
The process of integrating autonomous vehicles into existing traffic systems has
been supported by the development of related traffic technologies. This process
is essential for making full use of the advantages offered by autonomous vehicles.
In a merging T-junction, a mathematical model describes the two different
types of vehicle mixed behaviour to the occupation time with the traffic flow.
It has been observed that the ratio of autonomous vehicles to other types of
vehicles in a mixed traffic flow has an effect on the amount of time that a
vehicle spends occupied in that flow. Additionally, increasing the distance that
separates vehicles results in a higher throughput. The methodology that has been
proposed will be useful in determining the integration pattern of an autonomous
vehicle for the transition period involving mixed vehicle types. Additionally, the
models that were developed as a result of this research can be utilised by traffic
engineers in order to estimate the capacity of a merging road intersection in an
environment with mixed traffic.
According to the findings of the investigation, autonomous cars are not only
significantly safer but also more time-efficient and contribute to the reduction of
road congestion.
474 E. F. Ozioko et al.
The work that has been done up to this point represents steps toward a
system of safe and efficient mixed traffic management schemes that will assist
in the implementation of an environment with mixed traffic integration. The
objectives of this project have been identified, autonomous cars have arrived to
stay, and it is inevitable that they will co-exist with human-driven cars. This
is an essential goal because our reliance on these autonomous cars is growing
at an ever-increasing rate. To this end, there is a potentially fruitful method
of managing traffic mix that is amenable to implementation. The experimental
results that were generated hold out hope for the production of a traffic sched-
ule that will maintain the state of the art in the management of mixed traffic
environments.
The findings are based on an intersection that can accommodate a total
of one hundred vehicles and has a variable proportion of both driver-less and
human-operated vehicles. Looking at the result in Table 2, the research hypoth-
esis is supported by the fact that the result that was obtained demonstrates
that an increase in the ratios of autonomous cars is inversely proportional to
a decrease in the amount of time spent simulating, and this is observed. As
a result, we have reached the conclusion that the efficiency of the intersection
increases with an increase in the ratio of autonomous cars to human-driven cars,
which demonstrates that autonomous cars improve the efficiency of the flow of
traffic. We have investigated the possible repercussions that could result from
allowing driver-less cars and vehicles driven by humans to coexist on the road.
Our evaluation was carried out using parameters that are consistent with the
actual operating environment of the city’s traffic flow system. This ensured that
our findings are as accurate as possible. However, despite their use of real-time
event-driven-based control models, modern traffic lights are built to simulate
a homogeneous traffic system. The AVHV control model, on the other hand,
allows for wireless communications for controlling AVs in addition to support-
ing a traffic schedule that includes a traffic signal light to control HVs. This
control method involves the dynamic representation of a mix-traffic system at
road intersections in order to help plan, design, and operate traffic systems while
they are moving through time. This is done in order to improve the efficiency of
these processes. The utilisation of reservation cells to improve the performance
of the traffic flow was selected as the direction to take the research in. When
compared to other methods, such as using traffic lights or avoiding collisions,
increasing the traffic flow throughput can be accomplished by reserving one of
the twelve intersection reservation cells for a vehicle at each and every instance.
The findings that were obtained indicate that the cell reservation strategy has a
performance margin of approximately 18.2% performance margin.
References
1. Arnaout, G.M., Arnaout, J.-P.: Exploring the effects of cooperative adaptive cruise
control on highway traffic flow using microscopic traffic simulation. Transp. Plann.
Technol. 37(2), 186–199 (2014)
2. Asaithambi, G., Anuroop, C.: Analysis of occupation time of vehicles at urban
unsignalized intersections in non-lane-based mixed traffic conditions. J. Mod.
Transp. 24(4), 304–313 (2016). https://doi.org/10.1007/s40534-016-0113-7
3. Azlan, N.N.N., Rohani, M.Md.: Overview of application of traffic simulation model.
MATEC Web Conf. 150, 03006 (2018). EDP Sciences
4. Chan, E., Gilhead, P., Jelinek, P., Krejci, P., Robinson, T.: Cooperative control of
SARTRE automated platoon vehicles. In: 19th ITS World CongressERTICO-ITS
EuropeEuropean CommissionITS AmericaITS Asia-Pacific (2012)
5. Chen, D., Srivastava, A., Ahn, S., Li, T.: Traffic dynamics under speed disturbance
in mixed traffic with automated and non-automated vehicles. Transp. Res. Procedia
38, 709–729 (2019)
6. Duarte, F., Ratti, C.: The impact of autonomous vehicles on cities: a review. J.
Urban Technol. 25(4), 3–18 (2018)
7. Dutta, M., Ahmed, M.A.: Gap acceptance behavior of drivers at uncontrolled T-
intersections under mixed traffic conditions. J. Mod. Transp. 26(2), 119–132 (2018)
8. Gipps, P.G.: A behavioural car-following model for computer simulation. Transp.
Res. Part B Methodol. 15(2), 105–111 (1981)
9. Gong, S., Lili, D.: Cooperative platoon control for a mixed traffic flow including
human drive vehicles and connected and autonomous vehicles. Transp. Res. Part
B Methodol. 116, 25–61 (2018)
10. Huang, S., Ren, W.: Autonomous intelligent vehicle and its performance in auto-
mated traffic systems. Int. J. Control 72(18), 1665–1688 (1999)
11. Hussain, R., Zeadally, S.: Autonomous cars: research results, issues, and future
challenges. IEEE Commun. Surv. Tutorials 21(2), 1275–1313 (2018)
12. Kesting, A., Treiber, M., Helbing, D.: General lane-changing model MOBIL for
car-following models. Transp. Res. Rec. 1999(1), 86–94 (2007)
13. Lertworawanich, P.: Safe-following distances based on the car-following model. In:
PIARC International seminar on Intelligent Transport System (ITS) in Road Net-
work Operations (2006)
14. Li, H., Li, S., Li, H., Qin, L., Li, S., Zhang, Z.: Modeling left-turn driving behavior
at signalized intersections with mixed traffic conditions. Math. Probl. Eng. 2016,
1–11 (2016)
15. Matcha, B.N., et al.: Simulation strategies for mixed traffic conditions: a review of
car-following models and simulation frameworks. J. Eng. 2020, 1–11 (2020)
16. Mathew, T.: Lecture notes in transportation systems engineering. Indian Institute
of Technology (Bombay) (2009). www.civil.iitb.ac.in/tvm/1100 LnTse/124 lntse/
plain/plain.html
17. Miloradovic, D., Glišović, J., Stojanović, N., Grujić, I.: Simulation of vehicle’s
lateral dynamics using nonlinear model with real inputs. IOP Conf. Ser. Mater.
Sci. Eng. 659, 012060 (2019). IOP Publishing
18. Pawar, N.M., Velaga, N.R.: Modelling the influence of time pressure on reaction
time of drivers. Transp. Res. Part F Traffic Psychol. Behav. 72, 1–22 (2020)
19. Rothery, R.W.: Car following models. Trac Flow Theory (1992)
20. Saifuzzaman, M., Zheng, Z., Haque, M.M., Washington, S.: Revisiting the task-
capability interface model for incorporating human factors into car-following mod-
els. Transp. Res. Part B Methodol. 82, 1–19 (2015)
476 E. F. Ozioko et al.
21. UK Transport Authority: Driving test success is the UK’s copyright 2020 - driving
test success, May 2020. www.drivingtestsuccess.com/blog/safe-separation-distance
22. van Wees, K., Brookhuis, K.: Product liability for ADAS; legal and human factors
perspectives. Eur. J. Transp. Infrastruct. Res. 5(4) (2020)
23. Vial, J.J.B., Devanny, W.E., Eppstein, D., Goodrich, M.T.: Scheduling
autonomous vehicle platoons through an unregulated intersection. arXiv preprint
arXiv:1609.04512 (2016)
24. Zhu, W.-X., Zhang, H.M.: Analysis of mixed traffic flow with human-driving and
autonomous cars based on car-following model. Phys. A 496, 274–285 (2018)
Equivalence Between Classical Epidemic Model
and Quantum Tight-Binding Model
Krzysztof Pomorski1,2(B)
1 Faculty of Computer Science and Telecommunications, Technical University of Cracow,
ul. Warszawska 24, 31-155 Cracow, Poland
[email protected], [email protected]
2 Quantum Hardware Systems, ul. Babickiego 10/195, 94-056 Lodz, Poland
https://www.quantumhardwaresystems.com
Epidemic model can model sickness propagation and various phenomena in sociology,
physics and biology. The most basic form of epidemic model relies on co-dependence
of two probabilities of occurrence of state 1 and 2 that can be identified at the state of
being healthy and sick as it is being depicted in Fig. 1. It is expressed in the compact
way in the following way:
(s11 (t) |1 1| + s22 (t) |2 2| + s12 (t) |2 1| + s21 (t) |1 2|)
d
(p1 (t) |1 + p2 (t) |2) = (p1 |1 + p2 |2)
dt
d p1 (t) s11 (t) s12 (t) p1 (t)
= =
dt p2 (t) s21 (t) s22 (t) p2 (t)
s (t) s12 (t)
= 11 |ψclassical >=
s21 (t) s22 (t)
d
= Ŝt (p1 (t) |1 + p2 (t) |2) = |ψclassical > . (1)
dt
Fig. 1. Illustration of epidemic model referring to stochastic finite state machine being 2 level
system with 2 distinguished states 1 and 2. 4 possible transitions are characterized by 4 time-
dependent coefficients s1→1 (t) = s11 (t), s1→2 (t) = s12 (t), s2→1 (t) = s21 (t), s2→2 (t) = s22 (t).
Quite naturally such system evolves in natural statistical environment before the
measurement is done. Once the measurement is done the statistical system state is
changed from undetermined and spanned by two probabilities into the case of being
either with probability p1 = 1 or p1 = 0 so it corresponds to two projections:
10 00
P̂→1 = |1 1| = , P̂→2 = |2 2| = ,
00 01
10
P̂→1 + P̂→2 = Iˆ = (2)
01
00
P̂→1 |ψ classical = |1 = |ψ1 a f ter , P̂→1 = ,
01
10
P̂→2 |ψ classical = |2 = |ψ2 a f ter , P̂→2 = (3)
00
where occurrence of |ψ 1a f ter and |ψ 2a f ter occurs with probability p1 (tmeasurement )
and p2 (tmeasurement ).
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 479
s11 (t) s12 (t)
We notice that matrix Ŝ has 2 eigenvalues
s21 (t) s22 (t)
1
E1 (t) = [− (s11 (t) − s22 (t))2 + 4s12 (t)s21 (t) + s11 (t) + s22 (t)],
2
1
E2 (t) = [+ (s11 (t) − s22 (t))2 + 4s12 (t)s21 (t) + s11 (t) + s22 (t)] (4)
2
and we have the corresponding classical eigenstates
2s21
ψE = ×
1
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
√
− (s11 −s22 )2 +4s12 s21 +s11 −s22
2s21 , (5)
1
2s21
ψE = ×
2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
√
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
2s21 . (6)
1
We recognize that two states |ψE1 and |ψE2 are orthogonal, so ψE1 | |ψE2 =
ψE2 | |ψE1 = 0. We also recognize that
− (s11 − s22 )2 + 4s12 s21 + s11 − s22
ψE1 ψE1 = [ ]2
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
2s21
+[ ]2
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−[ ] = nE1 (t), (7)
(2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
+ (s11 − s22 )2 + 4s12 s21 + s11 − s22
ψE2 ψE2 = [ ]2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
2s21
+[ ]2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−[ ] = nE2 (t). (8)
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
|ψ (t)classical = pI (t) ψE1 + p2 (t) ψE2
√
2s21 − (s11 −s22 )2 +4s12 s21 +s11 −s22
= pI (t) 2s21
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 1
√
2s21 + (s11 −s22 )2 +4s12 s21 +s11 −s22
+pII (t) 2s21
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 1
⎛ √ ⎞
− (s11 −s22 )2 +4s12 s21 +s11 −s22
⎜ √ ⎟
2
= pI (t) ⎝ 2s21 +(− (s11 −s222s) +4s12 s21 +s11 −s22 ) ⎠
√ 21
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 )
⎛ √ ⎞
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
⎜ 2s21 +(+√(s11 −s22 )2 +4s12 s21 +s11 −s22 ) ⎟
+pII (t) ⎝ ⎠
√ 2s21
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )
⎛ √ √ ⎞
− (s11 −s22 )2 +4s12 s21 +s11 −s22 + (s11 −s22 )2 +4s12 s21 +s11 −s22
⎜ pI (t) 2s21 +(−√(s11 −s22 )2 +4s12 s21 +s11 −s22 ) + pII (t) √
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 ) ⎟ p1 (t)
= ⎝ ⎠ =
pI (t) √ 2s21
+ pII (t) √ 2s21 p2 (t)
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )
= |ψ (t)classical .
(9)
We have superposition of states with two statistical ensembles occurring with probabil-
ities pI (t) and pII (t) that are encoded in probabilities p1 (t) and p2 (t) that are directly
observable. We can extract probabilities pI (t) and pII (t) from |ψ (t)classical in the fol-
lowing way
1
4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
pI (t) = ψE1 |ψ (t)classical = 1 −
nE1 (t) (2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
ψE1 |ψ (t)classical
4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1− ×
(2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
√
− (s11 −s22 )2 +4s12 s21 +s11 −s22
√ √ 2s21 p1 (t),
× ,
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) p2 (t)
(10)
1
ψE2 |ψ (t)classical
pII (t) =
nE2 (t)
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1− ψE2 |ψ (t)classical
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
√
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
× √ , √ 2s21
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )
p1 (t),
=
p2 (t)
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
√
2
+ (s11 −s22 ) +4s12 s21 +s11 −s22
× √ p1 (t) + √ 2s21
p2 (t) . (11)
2s21 +(+ 2
(s11 −s22 ) +4s12 s21 +s11 −s22 ) 2s21 +(+2 (s11 −s22 ) +4s12 s21 +s11 −s22 )
Probabilities pI (t) and pII (t) will describe the occupancy of energy levels E1 and E2 in
real time domain of epidemic simplistic model. We have the same superposition of two
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 481
eigenergies as in the case of quantum tight-binding model. The same reasoning can be
conducted for N-th state classical epidemic model expressed as
(s11 (t) |1 1| + s12 (t) |2 1| + s13 (t) |3 1| + .. + s1N (t) |N 1|
+s21 (t) |2 1| + s22 (t) |2 2| + s23 (t) |3 2| + .. + s2N (t) |N 2| +
....
+s1N (t) |1 N| + s2N (t) |2 N| + s3N (t) |3 N| + .. + sNN (t) |N N|)
d
(p1 (t) |1 + p2 (t) |2 + .. + pN (t) |N) = (p1 |1 + p2 |2 + .. + pN |N) =
dt
⎛ ⎞ ⎛ ⎞
p1 (t) s11 (t) s12 (t) .. s1N (t) ⎛ ⎞
d ⎜ p2 (t) ⎟ ⎜ s21 (t) s22 (t) .. s1N (t ⎟ ⎝ p1 (t) ⎠
= ⎝ .. ⎠ = ⎝ .. ⎠ p2 (t) =
dt ..pN (t)
pN (t) sN1 (t) sN2 (t) .. sNN (t
⎛ ⎞
s11 (t) s12 (t) .. s1N (t)
⎝ .. ⎠ |ψclassical >=
s1N (t) s2N (t) .. sNN (t)
d
= Ŝt (p1 (t) |1 + p2 (t) |2 + .. + pN (t) |N) =
dt
|ψclassical > . (12)
with
t t t t
S11 (t,t0 ) = s11 (t)dt , S12 (t,t0 ) = s12 (t)dt , S22 (t,t0 ) = s22 (t)dt , S21 (t,t0 ) = s21 (t)dt , (14)
t0 t0 t0 t0
and
S11 (t,t0 )+S22 (t,t0 ) (S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U1,1 (t,t0 ) = e 2 +
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
1
+ cosh
2
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) (15)
S11 (t,t0 )+S22 (t,t0 ) (S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U2,2 (t,t0 ) = e 2 −
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
1
+ cosh
2
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) (16)
482 K. Pomorski
2S21 (t,t0 ) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
S11 (t,t0 )+S22 (t,t0 )
p2 (t) = e 2 p1 (t0 )
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
(S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
+ −
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
1
+ cosh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) p2 (t0 )
2
. (20)
It is useful to express the ratio of probabilities p1 (t) and p2 (t) in the following
analytical way as
p1 (t)
r12 (t) =
p2 (t)
1
= [(S11 (t,t0 ) − S22 (t,t0 ))p1 (t0 ) + 2S21 (t,t0 )p2 (t0 )] tanh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
2
+[p1 (t0 ) (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )] /
1
− [(S11 (t,t0 ) − S22 (t,t0 ))p2 (t0 ) + 2S21 (t,t0 )p1 (t0 )] tanh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
2
+[p2 (t0 ) (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )]
(21)
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 483
1 1
(E1 |ψE1 ψE1 | + E2 |ψE2 ψE2 |)(pI |ψE1 + pII |ψE2 )
nE1 nE2
d d d
= (pI |ψE1 + pII |ψE2 ) = (( pI ) |ψE1 + ( )pII |ψE2 ), (22)
dt dt dt
since E1 , E2 , |ψE1 and |ψE2 are time independent. By applying |ψE1 and |ψE1
on the left side we obtain and using orthogonality relation between |ψE1 and |ψE2 we
obtain set of equations
E1 E2
E1 d E2 d (t−t0 ) (t−t0 )
pI = pI , pII = pII , e nE1 pI (t0 ) = pI (t), e nE2 pII (t0 ) = pII (t).
nE1 dt nE2 dt
(23)
Sum of probabilities is not normalized. However physical significance between ratio
p1 (t) and p2 that is expressed by ratio
p1 (t) pI (t0 ) E1 E2
r12 (t) = = exp(( − )(t − t0 )). (24)
p2 (t) pII (t0 ) nE1 nE2
It means that Rabi oscillations or more precisely change of occupancy among levels
is naturally build in classical epidemic model. Still superposition of two states is main-
tained so the analogy of classical epidemic model to quantum tight-binding model is
deep.
(E1 (t) |ψE1 t ψE1 |t + E2 (t) |ψE2 t ψE2 |t )(pI |ψE1 t + pII |ψE2 t )
d d d
= (pI |ψE1 t + pII |ψE2 t ) = (pI ) |ψE1 t + (pII ) |ψE2 t ). (25)
dt dt dt
We obtain the set of 2 equations
d d
E1 (t)pI = (pI ψE1 | ) |ψE1 + (pII ψE1 | ) |ψE2 ),
dt dt
d d
E2 (t)pII = (pI ψE2 | ) |ψE1 + (pII ψE2 | ) |ψE2 ). (26)
dt dt
484 K. Pomorski
Consequently we obtain
1
(E1 (t) |ψE1
ψE1 | + E2 (t) |ψE2 ψE2 |
nE1
+e12 (t) |ψE2 ψE1 | + e21 (t) |ψE1 ψE2 |)(pI |ψE1 + pII |ψE2 )
d
= (pI |ψE1 + pII |ψE2 ). (28)
dt
This equation is equivalent to the set of 2 coupled ordinary differential equations
given as
d
d
E1 (t)pI (t) + e21 (t)pII (t) = ψE1 (t)| (pI (t) |ψE1 (t)) + ψE1(t) (pII (t) |ψE2 (t)),
dt dt
d d
E2 (t)pII (t) + e12 (t)pI (t) = ψE2 (t)| (pI (t) |ψE1 (t)) + ψE2 (t)| (pII (t) |ψE2 (t)).
dt dt
(29)
pI (t)
=
pII (t)
(33)
where
t
d t
d
g1,1 (t,t0 ) = dt [E1 (t ) − ( ψE1 (t ) ψE1 (t ) )], g1,2 (t,t0 ) = dt [e21 (t ) − ψE1 (t ) (ψE2 (t ) )], (35)
t0 dt t0 dt
t
d t
d
g2,1 (t,t0 ) = dt [e12 (t ) − ψE2 (t ) (ψE1 (t ) )], g2,2 (t,t0 ) = dt [E2 (t ) − ( ψE2 (t ) ψE2 (t ) )]. (36)
t0 dt t0 dt
d
ψE2 (t ) ( ψE1 (t ) ) =
dt
1
+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 , 2s21 ,
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
d
1 − (s11 − s22 )2 + 4s12 s21 + s11 − s22 ,
×
dt 2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 2s21
d
ψE2 (t ) ( ψE2 (t ) ) =
dt
1
+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 2s21
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
d 1 + (s11 − s22 )2 + 4s12 s21 + s11 − s22
×
dt 2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 2s21
, (38)
486 K. Pomorski
and
t t
1
dt E1 (t ) = dt − (s11 (t ) − s22 (t ))2 + 4s12 (t )s21 (t ) + s11 (t ) + s22 (t ) (39)
t0 t0 2
t t
1
dt E2 (t ) = dt + (s11 (t ) − s22 (t ))2 + 4s12 (t )s21 (t ) + s11 (t ) + s22 (t ) . (40)
t0 t0 2
and
g11 (t,t0 )+g22 (t,t0 ) (g11 (t,t0 ) − g22 (t,t0 )) sinh 12 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G1,1 (t,t0 ) = e 2 +
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
1
+ cosh
2
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 ) (41)
g11 (t,t0 )+g22 (t,t0 ) (g11 (t,t0 ) − g22 (t,t0 )) sinh 12 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G2,2 (t,t0 ) = e 2 −
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
1
+ cosh
2
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 ) (42)
⎛ √ ⎞
4(S−S12)(S−S21)+(S11−S22)2 +S11−S22
⎜ 2(S−S21) ,⎟
⎜ −1, ⎟
|V2 (t) = ⎜
⎜
√
2 +S11−S22 ⎟
⎟ (48)
⎝− 4(S−S12)(S−S21)+(S11−S22)
2(S−S21) ,⎠
1
⎛ √ ⎞
− 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22
⎜ 2(S+S21) ,⎟
⎜ 1, ⎟
|V3 (t) = ⎜ √
⎜ − 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22 ⎟
⎟ (49)
⎝ 2(S+S21) ,⎠
1
⎛√ ⎞
4(S+S12)(S+S21)+(S11−S22)2 +S11−S22
⎜ 2(S+S21) ,⎟
⎜ 1, ⎟
|V4 (t) = ⎜ √
⎜ 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22 ⎟
⎟ (50)
⎝ 2(S+S21) ,⎠
1
and |V3 (t) and |V4 (t) are physically justifiable in framework of epidemic model that
can be used for classical entanglement. We have four projectors corresponding to the
measurement of p1A , p2A , p1B and p2B represented as matrices
⎛ ⎞ ⎛ ⎞
1000 0 0 0 0
⎜0 0 0 0⎟ ⎜0 1 0 0⎟
P̂p1A = ⎝ , P̂ =
0 0 1 0⎠ p2A ⎝0 0 1 0⎠
, (51)
0001 0 0 0 1
⎛ ⎞ ⎛ ⎞
1 0 0 0 1 0 0 0
⎜0 1 0 0⎟ ⎜0 1 0 0⎟
P̂p1B = ⎝
0 0 1 0⎠ , P̂p2B = ⎝
0 0 0 0⎠
(52)
0 0 0 0 0 0 0 1
Measurement conducted on system A also brings the change of system B state what
is due to presence of non-diagonal matrix elements in system evolution equations (gen-
eralized epidemic model). It is therefore analogical to measurement of quantum entan-
gled state. After the measurements we obtain the following classical states
488 K. Pomorski
⎛ ⎞⎛ ⎞ ⎛ ⎞
1 0 0 0 pA1 (t) 1
⎜0 0 0 0⎟ ⎜ pA2 (t)⎟ ⎜ 0 ⎟
P̂p1A |ψclassical = ⎝ = ,
0 0 1 0⎠ ⎝ pB1 (t)⎠ ⎝ pB1 (t)⎠
0 0 0 1 pB2 (t) pB2 (t)
⎛ ⎞⎛ ⎞ ⎛ ⎞
0000 pA1 (t) 0
⎜0 1 0 0⎟ ⎜ pA2 (t)⎟ ⎜ 1 ⎟
P̂p2A |ψclassical = ⎝ =
0 0 1 0⎠ ⎝ p (t)⎠ ⎝ p (t)⎠
B1 B1
0001 pB2 (t) pB2 (t)
⎛ ⎞⎛ ⎞ ⎛ ⎞
1 0 0 0 pA1 (t) pA1 (t)
⎜ 0 1 0 0 p (t)
⎟ ⎜ A2 ⎟ ⎜ A2 ⎟p (t)
, P̂p1B |ψclassical = ⎝ = ,
0 0 1 0⎠ ⎝ p (t)⎠ ⎝ 1 ⎠
B1
0 0 0 0 pB2 (t) 0
⎛ ⎞⎛ ⎞ ⎛ ⎞
1000 pA1 (t) pA1 (t)
⎜ 0 1 0 0 p (t)
⎟ ⎜ A2 ⎟ ⎜ A2 ⎟p (t)
P̂p2B |ψclassical = ⎝ =
0 0 0 0⎠ ⎝ pB1 (t)⎠ ⎝ 0 ⎠
(53)
0001 pB2 (t) 1
where |xkA |xlB are time-independent and k and l are 1 or 2 and γ1 (t) + γ2 (t) + γ3 (t) +
γ4 (t) = 1. We observe that Ĥ0 (t) |ψ t = dtd |ψ t explicitly given as
⎛ ⎞
p1A (t)p1B (t)
d ⎜ p1A (t)p2B (t)⎟
⎝ ⎠
dt p2A (t)p1B (t)
p2A (t)p2B (t)
⎛ ⎞⎛ ⎞
s11A (t) + s11B (t) s12B (t) s12A (t) 0 p1A (t)p1B (t)
⎜ s21B (t) s11A (t) + s22B (t) 0 s12A (t) ⎟ ⎜ p1A (t)p2B (t)⎟
=⎝ ⎠ ⎝ p (t)p (t)⎠
s21A (t) 0 s22A (t) + s11B (t) s12B (t) 2A 1B
0 s21A (t) s21B (t) s22B (t) + s22A (t) p2A (t)p2B (t)
⎛ ⎞⎛ ⎞
s11A (t) + s11B (t) s12B (t) s12A (t) 0 pIQ (t)
⎜ s21B (t) s11A (t) + s22B (t) 0 s12A (t) ⎟ ⎜ pIIQ (t) ⎟
=⎝ ⎠ ⎝ p (t)⎠
s21A (t) 0 s22A (t) + s11B (t) s12B (t) IIIQ
0 s21A (t) s21B (t) s22B (t) + s22A (t) pIV Q (t)
⎛ ⎞
pIQ (t)
d ⎜ pIIQ (t) ⎟
= ⎝
dt pIIIQ (t)
⎠, (56)
pIV Q (t)
where pIQ (t), pIIQ (t), pIIIQ (t) and pIV Q (t) (with pIQ (t) = p1A (t)p1B (t), pIIQ (t) =
p1A (t)p2B (t), pIIIQ (t) = p2A (t)p1B (t), pIV Q (t) = p2A (t)p2B (t)) are describing proba-
bilities for four different states of statistic finite machine. Similar situation occurs in the
case of Schroedinger equation written for two non-interacting systems, but instead of
probabilities we have square root of probabilities times phase factor. In general case of
interaction between A and B systems in classical epidemic model we have
Ĥ0 (t) + ĤA−B (t)
= EI (t) ψE1A (t) ψE1B (t) ψE1A (t) ψE1B (t)
+EII (t) ψE1A (t) ψE2B (t) ψE1A (t) ψE2B (t)
+EIII (t) ψE2A (t) ψE1B (t) ψE2A (t) ψE1B (t)
+EIV (t) ψE2A (t) ψE2B (t) ψE2A (t) ψE2B (t)
+e(1A,1B)→(1A,2B) (t) ψE1A (t) ψE1B (t) ψE1A (t) ψE2B (t)
+e(1A,2B)→(1A,1B) (t) ψE1A (t) ψE2B (t) ψE1A (t) ψE1B (t)
+e(2A,1B)→(1A,1B) (t) ψE2A (t) ψE1B (t) ψE1A (t) ψE1B (t)
+e(1A,1B)→(2A,1B) (t) ψE1A (t) ψE1B (t) ψE2A (t) ψE1B (t)
+e(1A,2B)→(2A,2B) (t) ψE1A (t) ψE2B (t) ψE2A (t) ψE2B (t) +
e(2A,2B)→(1A,2B) (t) ψE2A (t) ψE2B (t) ψE1A (t) ψE2B (t)
+e(2A,1B)→(2A,2B) (t) ψE2A (t) ψE1B (t) ψE2A (t) ψE2B (t) +
e(2A,2B)→(2A,1B) (t) ψE2A (t) ψE2B (t) ψE2A (t) ψE1B (t)
+e(1A,1B)→(2A,2B) (t) ψE1A (t) ψE1B (t) ψE2A (t) ψE2B (t) +
e(2A,2B)→(1A,1B) (t) ψE2A (t) ψE2B (t) ψE1A (t) ψE1B (t) . (57)
The presented analytical approach can be applied for system A with four distinct
states as well as for system B with four distinct states since isolated system A(B) with
four states can be described by evolution matrix four by four that has four analytical
eigenvalues and eigenstates and becomes non-analytical for five and more distinct states
due to fact that roots of polynomial of higher order than four becomes non-analytical
and becomes numerical with some limited exceptions. We can write the matrix
⎛ ⎞
EIQ (t) e(1A,2B)→(1A,1B) e(2A,1B)→(1A,1B) e(2A,2B)→(1A,1B)
⎜e(1A,1B)→(1A,2B) E (t) e e ⎟
ĤEIQ ,..,EIV Q =⎜ IIQ
⎝e(1A,1B)→(2A,1B) e(1A,1B)→(2A,1B)
(2A,1B)→(1A,2B) (2A,2B)→(1A,2B) ⎟
(58)
EIIIQ (t) e(2A,2B)→(1A,2B) ⎠
e(1A,1B)→(2A,2B) e(1A,1B)→2A,2B) e(2A,1B)→(2A,2B) EIV Q (t)
490 K. Pomorski
Let us be motivated by work on single electron devices by [1–4, 6, 12, 15–17, 21–24].
Instead of probabilities it will be useful to operate with square root of probabilities as
they are present in
quantum mechanics and in Schroedinger
orDirac equation. Since
d √ √ d √ √
dt ( p1 p1 ) = 2 p1 (t) d
dt p 1 (t), dt ( p2 p 2 ) = 2 p2 (t) d
dt p2 (t) we can rewrite
the epidemic equation as
⎛ ⎞
p2(t) d
1
s (t) 1
p1(t) s12 (t)⎠ p1 (t) dt p1 (t) = p1 (t)
⎝ 2 11 2 ih̄ d
p1(t) p2 (t)
= d
p2 (t) ih̄ dt p2 (t)
(59)
p2(t) s21 (t) 2 s22 (t)
1 1
2 dt
⎛ ⎞
t iΘ (t)
d ⎝ p1 ( ih̄1 )e 1 ⎠
= ih̄ .
p2 ( ih̄1 )eiΘ2 (t)
dt t
⎛ ⎞ ⎛ ⎞
p1 (t0 )cos(Θ1 (t0 )) p1 (t)cos(Θ1 (t))
t ⎜ ⎟ ⎜ ⎟
p1 (t0 )sin(Θ1 (t0 )) ⎟ ⎜ p1 (t)sin(Θ1 (t)) ⎟
e t0 Â(t )dt ⎜
⎝ p2 (t0 )cos(Θ2 (t0 ))⎠ = ⎝ p2 (t)cos(Θ2 (t))⎠ (63)
p2 (t0 )sin(Θ2 (t0 )) p2 (t)sin(Θ2 (t))
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 491
8 Conclusions
There are various deep analogies between classical statistical mechanics and quantum
mechanics as given by [10, 11]. The obtained results shows that quantum mechanical
phenomena might be almost entirely simulated by classical statistical model. It includes
the quantum like entanglement [9, 19] and superposition of states. Therefore coupled
epidemic models expressed by classical systems in terms of classical physics can be
the base for possible incorporation of quantum technologies and in particular for quan-
tum like computation and quantum like communication. In the conduced computations
Wolfram software was used [18]. All work presented at [12, 20] can be expressed by
classical epidemic model. It is expected that time crystals can be also described in the
given framework [13, 14]. It is open issue to what extent we can parameterize various
condensed matter [3–5, 7, 8, 22, 25] phenomena by stochastic finite state machine.
References
1. Likharev, K.K.: Single-electron devices and their applications. Proc. IEEE 87, 606–632
(1999)
2. Leipold, D.: Controlled Rabi Oscillations as foundation for entangled quantum aperture
logic, Seminar at UC Berkley Quantum Labs (2018)
3. Fujisawa, T., Hayashi, T., Cheong, H.D., Jeong, Y.H., Hirayama, Y.: Rotation and phase-
shift operations for a charge qubit in a double quantum dot. Physica E Low-Dimensional
Syst. Nanostruct. 21(2–4), 10461052 (2004)
4. Petersson, K.D., Petta, J.R., Lu, H., Gossard, A.C.: Quantum coherence in a one-electron
semiconductor charge qubit. Phys. Rev. Lett. 105, 246804 (2010)
5. Giounanlis, P., Blokhina, E., Pomorski, K., Leipold, D.R., Staszewski, R.B.: Modeling of
semiconductor electrostatic qubits realized through coupled quantum dots. IEEE Access 7,
49262–49278 (2019)
6. Bashir, I., et al.: A mixed-signal control core for a fully integrated semiconductor quantum
computer system-on-chip. In: Proceedings of IEEE European Solid-State Circuits Confer-
ence (ESSCIRC) (2019)
7. Spalek, J.: Wstep do fizyki materii skondensowanej. PWN (2015)
492 K. Pomorski
8. Jaynes, E.T., Cummings, F.W.: Comparison of quantum and semiclassical radiation theories
with application to the beam maser. Proc. IEEE 51(1), 89–109 (1963)
9. Angelakis, D.G., Mancini, S., Bose, S.: Steady state entanglement between hybrid light-
matter qubits. arXiv:0711.1830 (2008)
10. Wetterich, C.: Quantum mechanics from classical statistics. arxiv:0906.4919 (2009)
11. Baez, J.C., Pollard, B.S.: Quantropy. http://math.ucr.edu/home/baez/quantropy.pdf
12. Pomorski, K., Staszewski, R.B.: Analytical solutions for N-electron interacting system con-
fined in graph of coupled electrostatic semiconductor and superconducting quantum dots in
tight-binding model with focus on quantum information processing (2019). https://arxiv.org/
abs/1907.03180
13. Wilczek, F.: Quantum time crystals. Phys. Rev. Lett. 109, 160401 (2012)
14. Sacha, K., Zakrzewski, J.: Time crystals: a review. Rep. Prog. Phys. 81(1), 016401 (2017)
15. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Staszewski, R.B.: Analytic view on
coupled single-electron lines. Semicond. Sci. Technol. 34(12), 125015 (2019)
16. Pomorski, K., Staszewski, R.B.: Towards quantum internet and non-local communication
in position-based qubits. AIP Conf. Proc. 2241, 020030 (2020). https://doi.org/10.1063/5.
0011369+ arxiv:1911.02094
17. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Peczkowski, P., Staszewski, R.B.:
From two types of electrostatic position-dependent semiconductor qubits to quantum univer-
sal gates and hybrid semiconductor-superconducting quantum computer. In: Proceedings of
SPIE, vol. 11054 (2019)
18. Wolfram Mathematica. http://www.wolfram.com/mathematica/
19. Wikipedia: Bell theorem
20. Pomorski, K.: Seminars on quantum technologies at YouTube channel: quantum hardware
systems (2020). https://www.youtube.com/watch?v=Bhj ZF36APw
21. Pomorski, K.: Analytical view on non-invasive measurement of moving charge by position
dependent semiconductor qubit. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC,
vol. 1289, pp. 31–53. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63089-8 3
22. Pomorski, K.: Analytical view on tunnable electrostatic quantum swap gate in tight-binding
model. arXiv:2001.02513 (2019)
23. Pomorski, K.: Analytic view on N body interaction in electrostatic quantum gates and deco-
herence effects in tight-binding model. Int. J. Quantum Inf. 19(04), 2141001 (2021)
24. Pomorski, K.: Analytical view on N bodies interacting with quantum cavity in tight-binding
model. arXiv:2008.12126 (2020)
25. Pomorski, K., Peczkowski, P., Staszewski, R.: Analytical solutions for N interacting elec-
tron system confined in graph of coupled electrostatic semiconductor and superconducting
quantum dots in tight-binding model. Cryogenics 109, 103117 (2020)
Effects of Various Barricades on Human Crowd
Movement Flow
Andrew J. Park1(B) , Ryan Ficocelli2 , Lee Patterson3 , Frank Dodich3 , Valerie Spicer4 ,
and Herbert H. Tsang1
1 Trinity Western University, Langley, BC V2Y 1Y1, Canada
{a.park,herbert.tsang}@twu.ca
2 Thompson Rivers University, Kamloops, BC V2C 0C8, Canada
[email protected]
3 Justice Institute British Columbia, New Westminster, BC V3L 5T4, Canada
[email protected]
4 Simon Fraser University, Burnaby, BC V5A 1S6, Canada
Abstract. Human crowd movement flow has been studied in various disciplines
such as computing science, physics, engineering, urban planning, etc., for many
decades. Some studies focused on the management of big crowds in public events
whereas others investigated the egress of the crowd in emergency cases. Optimal
flows of a human crowd have been a particular interest among many researchers.
This paper presents how various physical barricades affect human crowd move-
ment flow using a social force model. Simulation experiments of bidirectional
crowd flows were conducted with/without barricades in a straight-line street. The
barricades with various lengths and rotations were tested to discover optimal flows
of a crowd with various densities of the crowd. The experimental results show that
setting up barricades with a particular length and rotation generates a better flow
of the crowd compared to the situations without both or either of them. This study
can help the management of the crowd in public events by setting up physical
barricades strategically to produce optimal flows of a crowd.
1 Introduction
A traditional school of thought was holding the notion of a crowd to be homogeneous and
irrational. However, contemporary studies support that a crowd can be heterogeneous and
rational. The flow of human crowd movement is different from that of fluid or liquid since
each member of the crowd can make one’s own rational decision although one may have a
tendency to be coherent with the crowd and become part of the crowd flow. Emotions can
influence such crowd flow, which might lead to a disastrous result such as a panic situation
in the case of an emergency. Well-planned crowd management strategies can generate
an optimal flow of a crowd and mitigate potential harms and dangers in public events.
Crowd modeling and simulation has been a popular research topic in various disciplines
including computer graphics and animation, physics, engineering, urban planning, safety
science, transportation science, etc. Some studies were interested in generating realistic
crowd movement while others examined various crowd management strategies. Still
other studies investigated the crowd emergency egress in the case of fire or any disasters.
Although their applications might be different, optimal flows of a human crowd have
been a common interest among these researchers.
Various modeling techniques have been used to model a crowd or pedestrians. Some
common techniques for crowd/pedestrian modeling are agent-based modeling, social
force modeling, cellular automata modeling, and fluid dynamic modeling [13]. This
study has used the techniques of social force modeling.
Although there have been some studies of crowd flows with physical barricades,
it seems that systematic studies of the effects of barricades with various lengths and
rotations on crowd flows are lacking. This paper tries to fill such a gap by systematically
investigating how various settings of physical barricades affect the flow of a crowd. Sim-
ulation experiments were conducted with various lengths of barricades that are rotated
at various degrees in a straight-line street where two groups of a crowd were going to
the other side from each end. The experimental results show that barricade settings with
a specific length and rotation produce a better flow of a crowd than barricades with other
lengths and rotations or no barricade at all.
The paper is organized as follows: the background section surveys the concept and
model of a crowd and reviews various crowd modeling techniques including social force
modeling. Various studies of crowd or pedestrian flows are reviewed. The section on
simulation experiments introduces the experiments conducted with barricades of vari-
ous lengths and rotations and shows the results. The discussion section has comprehen-
sive analyses of the experimental results. The future plan is proposed followed by the
acknowledgement.
2 Background
The definition of a crowd has been debated for decades. Le Bon defined a crowd in
his seminal book, “The Crowd: A Study of the Popular Mind” as “a group of individu-
als united by a common idea, belief, or ideology” [14]. A crowd may show a collective
behavior such as a protest against authorities, which seems emotionally charged and irra-
tional. The understanding of a crowd as a group of homogeneous entities was prevalent
due to early sociological writings during French Revolution in the 1790s [11]. Contem-
porary studies, however, regard a crowd as a group of heterogeneous entities who are
rational and make their own decisions [3, 12]. It can be argued that both kinds of a crowd
can exist depending on what kind of events they participate in and what kind of situations
they are in. Political or sporting events can create the former kind of a crowd (traditional)
such as protests against governments or hooliganism against rival fans whereas the latter
kind of a crowd (contemporary) is common in peaceful, public events such as events that
celebrate national holidays. This study considers a peaceful crowd at a celebratory event.
When a big crowd gathers at this kind of event, strategies for managing such a crowd
need to be implemented well in advance. In particular, making optical flows of a crowd
Effects of Various Barricades on Human Crowd Movement Flow 495
to the event venue and out of the venue are of importance. Our previous paper shows the
crowd swirling at the intersection when they are not managed [17]. A simple strategy of
placing a police line (a barricade) diagonally at the intersection helps a better flow of a
crowd although some members of the crowd are forced to certain directions. This study
is an extension of the former study to investigate the effects of physical barricades on
the flow of a crowd.
Various methods and techniques have been used to study crowd behaviors. Field
observation of a crowd produces ground truth. However, it is time-consuming and labor-
intensive. Recent studies show that a crowd or pedestrian movement is tracked by ana-
lyzing video records or tracking GPS information from each member’s phone [1, 24].
Controlled experiments with human participants in artificial, physical settings can be an
alternative to field observation for crowd study in a systematic way [4, 10]. Developing
mathematical or computational models of a crowd has been popular among researchers
who study crowd behaviors. Some common mathematical or computational modeling
techniques are as follows:
• Agent-Based Modeling: Each agent can represent a pedestrian with simple and nec-
essary characteristics. Multiple agents are simulated to generate emergent behaviors
[15, 19, 21].
• Social Force Modeling: Social force modeling is a kind of agent-based modeling. It
is particularly used for human crowd modeling with three forces: a force that accel-
erates towards the desired velocity of motion; a force that keeps a certain distance
between pedestrians and between pedestrians and borders; and a force that attracts
other pedestrians [8].
• Cellular Automata Modeling: Colored cells of a grid can represent pedestrians which
evolve at each discrete time step with a set of rules [18, 22, 23].
• Fluid Dynamic Modeling: A crowd with high density behaves like fluid flows [2, 5,
9].
Some emergent behaviors of a crowd are observed, which include lane formation
(channeling), self-organization, swirling, and bottleneck [16].
There have been studies of crowd (pedestrian) flows with physical barricades (barri-
ers or obstacles). Either unidirectional or bidirectional flows of a crowd (pedestrians) are
observed in a field or a controlled setting or computationally simulated in a straight-line
street (or corridor), a T-junction, or an intersection with barricades with various shapes
[6, 7, 20].
This study uses social force modeling techniques to simulate a crowd with a large
number of pedestrians and generate emergent behaviors. Barricades can be either static
(physical barricades) or dynamic (human cordons). This study focuses on effects of
physical, static barricades in a straight-line street with bidirectional flows of a crowd
with various lengths and rotations of the barricades.
that did not collide with an obstacle but are further away from the center ray than a
ray that did collide with an obstacle are not included in the average calculation. This
allows for the calculated value to be farther from any obstacles, and hence leads to more
movement away from any obstacles.
4 Simulation Experiments
The system was used to simulate groups of people moving down a street. In order to
observe the effect of setting up barricades on the flow of people through the street,
immovable barricades were present in some of the simulations. The environment, which
includes a blue barricade, can be seen in Fig. 1. Then to move through the street is counted
each second of the simulation and written into an external file in order to quantify the
movement of the crowd agents through the street (Fig. 2).
Fig. 2. The simulation street environment with the spawn and destination possible location
rectangles in red
The crowd agents in the system are represented by colored capsules, which when
viewed from the top-down perspective look like circles. These crowd agents can be seen
in Fig. 3. The dark blue agents on the left side of Fig. 3 began on the left side of the street
and are moving to the right side, while the yellow agents are moving from the right side
to the left side. The coloring of the agents remains constant throughout the simulations,
so the blue agents always move from the left to the right, while the yellow agents always
move from the right to the left.
The effect of the rotation of the barricade on the crowd movement was observed
by running the simulation multiple times with the barricade in different rotations. In
each of the simulations, the barricade had a length of 30 units. This batch of simulations
included scenarios where there was no barricade, where the barricade was straight up
498 A. J. Park et al.
and down the street, where the barricade had been rotated counterclockwise by 5°, and
where the barricade had been rotated counterclockwise by 10°. The environment where
the barricade had been rotated counterclockwise by 10° can be seen in Fig. 4.
The simulation was also run several times in order to observe the effect of the length
of the barricade on the flow of people through the street. The barricade was kept in the
orientation parallel to the direction of the street in all of these experiments. The possible
barricade lengths were: no barricade, 5-unit length barricade, 15-unit length barricade,
and 30-unit length barricades. A 5- unit length barricade can be seen in Fig. 5, a 15-unit
length barricade can be seen in Fig. 6, and a 30-unit length barricade can be seen in
Fig. 3.
Fig. 4. The simulation environment with the barricade rotated counterclockwise by 10°
In both sets of experiments, the density of the agents was also varied to see if the
density of the crowd agents in the street changed the effects of the barricades. This was
accomplished by changing the number of crowd agents spawned in each spawning cycle.
The number of agents spawned was set to either 10, 75, or 100.
Effects of Various Barricades on Human Crowd Movement Flow 499
5 Experimental Results
In the absence of barricades, the groups of crowd agents exhibited different macroscopic
behaviors based on the density of crowd agents in the simulation. In the low-density
scenarios, where only 10 agents were spawned each spawning cycle, the crowds tended
to form straight-line groups as they moved towards each other. As these straight-line
groups meet up with each other, the crowd agents at the front of the group move around
each other, both moving to different sides. These agents continue moving forward, but
avoid moving inside the group moving in the opposite direction. Agents that follow the
head of the group exhibit cohesion towards the leader and avoidance from the agents
moving in the opposite direction, leading to a natural channel forming between the two
groups. This channelization can be seen in Fig. 7.
In the high-density scenarios, where agents are spawned in groups of 75 or 100, the
groups tend to form more spherical shapes, which can be seen in Fig. 8. This is due to
both the large number of agents present and the cohesion behavior of the agents. As these
large, spherical groups collide with each other, the two groups form one giant sphere as
the crowds attempt to move past each other, which can be seen in Fig. 9. People near
the center of this mass are unable to make meaningful progress through the group, as
they keep colliding and avoiding each other. People that are located near the edges of the
sphere are able to move past each other, as there is more room to move. This leads to the
people around the outside of the group being able to move faster than those in the center,
500 A. J. Park et al.
and as they move, they create space for the more central agents to be able to move. This
in turn means that the large crowds move past each other with the most exterior agents
leaving first, then the more central agents leaving afterwards. This phenomenon is best
seen in Fig. 10, where the blue agents can be seen moving farther to the right (towards
their destination) around the edges, and the yellow agents can be seen moving farther
to the left around the edges. By looking at Fig. 8, Fig. 9, and then Fig. 10 in sequence,
one can see how the two groups move around each other. The fact that the centers of the
two groups must wait for the peripheral crowd agents to clear before being able to move
means that the group as a whole ends up moving slower.
The clustering of the groups can be seen in the low-density scenarios as well, albeit
at a smaller scale. Even though the two groups naturally form channels to avoid each
Effects of Various Barricades on Human Crowd Movement Flow 501
other, at some point some of the agents will need to cross the opposite direction group
in order to reach their destination. As they move across, they cause collisions with the
other group and cause them to slow down, forming smaller spherical clusters. Although
it does not take as long for these clusters to clear due to their smaller size, they do lead to
a slower overall movement of the group, just like in the high-density cluster case. These
small clusters can be seen in Fig. 11.
The graph of the results of the simulations where the crowd was spawned in groups
of 75 and where there were straight barricades of varying lengths can be seen in Fig. 12.
The simulation where there was no barricade was the worst for the flow of crowd agents,
while the best flow was achieved when there was a 30-unit length barricade present. The
5-unit length barricade scenario led to a better flow than in the no barricade scenario, and
the 15-unit length barricade scenario had a better flow than the 5-unit length barricade.
502 A. J. Park et al.
Fig. 12. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 75, and the simulations varied the length of the central barricade.
When looking at the graph of the results from the highest density scenarios with the
straight barricades, as in Fig. 13, it can be seen that the efficacy of the longer barricades
is more pronounced.
The graph of the results of the low-density experiments, where crowd agents were
spawned in groups of 10, where there were straight barricades of varying lengths can be
seen in Fig. 14.
The other set of simulations looked at the effect of the rotation of the barricade on the
flow of the crowd through the street, where the barricade was either not present, straight
down the street, rotated counterclockwise by 5°, or rotated counterclockwise by 10°.
The graph of the results of these simulations, where the crowd was spawned in groups
of 10, can be seen in Fig. 15. In these low-density scenarios, it can be seen in Fig. 15
that the worst-performing group was when there was no barricade, and the flow of the
crowd increased as the rotation of the barricade became greater.
While the low-density scenarios have better flows with higher rotation of the barri-
cade, the opposite is true in the high-density scenarios. The graph of the results of the
simulations where the crowds were spawned in groups of 75 can be seen in Fig. 16, and
the graph of the results of the simulations where the crowds were spawned in groups of
100 can be seen in Fig. 17. In both of the high-density scenarios, the lower the rotation
of the barricade, the better the crowd flow was through the street.
Effects of Various Barricades on Human Crowd Movement Flow 503
Fig. 13. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 100, and the simulations varied the length of the central barricade.
Fig. 14. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 10, and the simulations varied the length of the central barricade.
6 Discussion
The graphical results of the simulation experiments where the crowd was spawned in
groups of 75 and the length of the barricade was varied between experiments are shown
504 A. J. Park et al.
Fig. 15. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 10, and the simulations varied the rotation of the central barricade.
Fig. 16. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 75, and the simulations varied the rotation of the central barricade.
in Fig. 12. Since the 30-unit length barricade scenario had better crowd flow than the
15-unit length barricade scenario, which in turn had a better crowd flow than the 5-unit
length barricade scenario, the conclusion is that longer barricades lead to a better flow
through the street at high crowd densities. This conclusion is also supported by the fact
Effects of Various Barricades on Human Crowd Movement Flow 505
Fig. 17. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 100, and the simulations varied the rotation of the central barricade.
that the scenario with no barricade was the worst-performing scenario out of all of the
scenarios in Fig. 12. While the poor performance with no barricade can be explained by
the clustering of the groups, as seen in Fig. 9, the better performance due to a barricade
can be gleaned from looking at several images of the simulations with the barricades
present. As the two groups collide, they appear to form a cluster that has been bisected
by the barricade, as seen in Fig. 18. This is because the crowd agents tend to move
towards the center of the street due to cohesion, but then spread outwards as the two
groups meet in the middle. While this by itself does not explain the better performance
of when the barricades are present, the reason can be found in Fig. 19. What happens
when the two groups meet is that the furthest forward member of the groups meets, and
eventually one moves away from the barricade. Without the barricade, the other leader
would likely move in the opposite direction, but due to the barricade, they simply move
against the barricade. As the one leader moves away from the barricade, agents that
come afterwards move to that side as well due to cohesion, which allows the barricade-
side agent to progress further. This effectively makes a wedge-like channel through the
group, as the barricade-side group tends to follow each other. In Fig. 19, this wedging
can be seen as the blue agents are forced away from the barricade, and the yellow agents
are allowed to progress further to the right along the barricade. If the barricade is long
enough, this wedge can get through the entire group and allow for easier movement of
the crowd. Conversely, if the barricade is not long enough, the structure of the wedge
does not hold as the crowd leaves the barricade, and thus they must wait for the exterior
crowd agents to clear before moving further. Although the shorter barricades do not see
the agents through the entire cluster, they do allow for some progress to be made, which
leads to a better flow that is proportional to the length of the barricade.
506 A. J. Park et al.
Fig. 18. Two high-density crowds colliding with each other with a barricade present
Fig. 19. The yellow group presses against the barricade, allowing them to wedge past the blue
group and progress to the left.
Similar to the simulations where the crowd was spawned in groups of 75, the graph
showing the results of the simulations where the crowd was spawned in groups of 100
and the effects of varying lengths of barricades, Fig. 13, shows that longer barricades
lead to better crowd flow through the street. This is a consequence of the large cluster of
agents that forms due to the higher density, which the wedging helps to pierce through
the facilitate movement.
Likewise, in the low-density scenarios, the worst-performing group was where there
was no barricade, while the best-performing group was where there was the longest
barricade of 30-unit length. The differences of crowd flows can be seen in Fig. 14. The
effects of the small clusters on the flow of the overall group can be best seen in the no
barricade scenario. As the clusters form, the overall movement of the group slows down,
causing the graph to go more horizontal. Then, as the clusters move past each other, the
graph becomes more vertical. This process oscillates, causing the S-shaped curve in the
graph. While clusters do form in the scenarios where there is a barricade, the wedging
phenomenon helps lessen the effect of the clusters by allowing the crowds to move past
each other. Once the crowds move past the barricades, the small clusters have more of a
Effects of Various Barricades on Human Crowd Movement Flow 507
slowing effect on the group movement, and thus the shorter barricades perform slightly
worse than the longer barricades, albeit not as noticeable as in the high-density scenarios.
Whether it is a high-density scenario or a low-density scenario, the results of the
experiments lead to the recommendation of the use of long barricades to facilitate group
movements down a street.
The next set of experiments looked at the effects of rotating a 30-unit length barricade
on the crowd flow through the street. The results of the low-density scenarios, where
the crowd was spawned in groups of 10, can be seen in Fig. 15. In the graph, the worst-
performing group was when there was no barricade, and the flow of the crowd increased
as the rotation of the barricade became greater. The rotation of the barricade leads the
crowds to be naturally separated from each other, as the wider side of the street funnels
the crowd to a particular side. This separation can be seen in Fig. 20. This separation
works better than no barricade due to the minimal clustering that occurs. Similarly,
the rotated barricade leads to less clustering than the straight barricade, and while the
straight barricade allows for wedging, that process is slower than if the groups never
meet. The difference between the 10° rotation and the 5° rotation comes from the point
where the crowds have moved past the barricade. As the group nears their destination,
some of them have to move across the newly spawned group to reach their destination.
This cross-flow can be seen in Fig. 21. In the 5° rotation case, this can cause some of
the newly spawned crowd to move to the “wrong side” of the barricade, the side with
the smaller opening. This means that as they move towards their destination, they are
fighting against the flow of the more numerous group, which slows down both groups.
This “wrong side” movement can be seen in Fig. 22. This “wrong side” movement is
much more frequent in the 5° rotation scenario than in the 10° rotation scenario, as there
is a larger opening in the 5° rotation scenario than in the 10° rotation scenario, which
leads to a better flow in the 10° rotation scenario than in the 5° rotation scenario.
Fig. 20. The rotated barricade leads to a separation of the two crowds.
The next two sets of simulations were where the effect of rotating the barrier on
crowd flow was measured for crowds that were spawned in groups of 75 and 100, the
graphical results of which can be seen in Fig. 16 and Fig. 17, respectively. In both high-
density scenarios, the worst flow comes when there is no barricade, which is explained
from the clustering of the groups as they meet in the middle of the street. The best flow
508 A. J. Park et al.
Fig. 21. The crowds cross each other as they emerge from the rotated barricade.
Fig. 22. Some of the crowd agents are forced to the side of the barricade with more oncoming
crowd agents.
in both scenarios came from the case where the barricade had no rotation, and the rate
of flow was lessened as the rotation of the barricade increased. This negative impact of
rotating the barricade is more pronounced in the highest density simulations, where the
crowds were spawned in groups of 100, while it is less pronounced in the simulations
where the crowds were spawned in groups of 75. The reason for this negative impact can
be seen in the bottlenecking that occurs due to the rotation, which is visible in Fig. 23.
As the barricade is rotated more, the high-density crowds end up at a bottleneck as they
try to leave the part of the street with the barricade, which causes the crowd to cluster
tightly. Since the crowd cannot move past one another, they have to slow down and wait
for those before them to leave before they can exit, which causes the whole group to
slow down and the flow of the entire group to slow down.
While the more rotated barricade leads to better group flow in the low-density sce-
nario, the straight barricade performed the best in the high-density scenarios. That leads
to the recommendation of the use of rotated barricades when a low-density group is
expected, and using a straight barricade when a high-density group is expected.
Effects of Various Barricades on Human Crowd Movement Flow 509
Fig. 23. The high-density crowds are forced into a bottleneck due to the rotated barricade.
References
1. Blanke, U., Tröster, G., Franke, T., Lukowicz, P.: Capturing crowd dynamics at large scale
events using participatory GPS-localization. In: 2014 IEEE Ninth International Conference
on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), pp. 1–7. IEEE
(2014)
2. Farooq, M.U., Saad, M.N.B., Malik, A.S., Salih Ali, Y., Khan, S.D.: Motion estimation of
high density crowd using fluid dynamics. Imaging Sci. J. 68(3), 141–155 (2020)
510 A. J. Park et al.
3. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83, 1420–1443
(1978)
4. Guo, N., Hao, Q.Y., Jiang, R., Hu, M.B., Jia, B.: Uni-and bi-directional pedestrian flow in the
view-limited condition: experiments and modeling. Transp. Res. Part C Emerg. Technol. 71,
63–85 (2016)
5. Helbing, D.: A fluid dynamic model for the movement of pedestrians. arXiv preprint cond-
mat/9805213 (1998)
6. Helbing, D., Buzna, L., Johansson, A., Werner, T.: Self-organized pedestrian crowd dynamics:
experiments, simulations, and design solutions. Transp. Sci. 39(1), 1–24 (2005)
7. Helbing, D., Farkas, I.J., Molnar, P., Vicsek, T.: Simulation of pedestrian crowds in normal
and evacuation situations. Pedestr. Evacuation Dyn. 21(2), 21–58 (2002)
8. Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E 51(5),
4282 (1995)
9. Henderson, L.F.: On the fluid mechanics of human crowd motion. Transp. Res. 8(6), 509–515
(1974)
10. Hoogendoorn, S., Daamen, W.: Self-organization in pedestrian flow. In: Hoogendoorn, S.P.,
Luding, S., Bovy, P.H.L., Schreckenberg, M., Wolf, D.E. (eds.) Traffic and Granular Flow’03,
pp. 373–382. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-28091-X_36
11. Hughes, R.L.: The flow of human crowds. Annu. Rev. Fluid Mech. 35(1), 169–182 (2003)
12. Jager, W., Popping, R., Van de Sande, H.: Clustering and fighting in two-party crowds:
Simulating the approach-avoidance conflict. J. Artif. Soc. Soc. Simul. 4(3), 1–18 (2001)
13. Johansson, A., Kretz, T.: Applied pedestrian modeling. In: Heppenstall, A., Crooks, A., See,
L., Batty, M. (eds.) Agent-Based Models of Geographical Systems, pp. 451–462. Springe r,
Dordrecht (2012). https://doi.org/10.1007/978-90-481-8927-4_21
14. Le Bon, G.: The Crowd: A Study of the Popular Mind. Fischer, London (1897)
15. Macal, C.M., North, M.J.: Agent-based modeling and simulation. In: Proceedings of the 2009
Winter Simulation Conference (WSC), pp. 86–98. IEEE (2009)
16. Manocha, D., Lin, M.C.: Interactive large-scale crowd simulation. In: Arisona, S.M.,
Aschwanden, G., Halatsch, J., Wonka, P. (eds.) Digital Urban Modeling and Simulation.
CCIS, vol. 242, pp. 221–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-29758-8_12
17. Park, A.J., Ficocelli, R., Patterson, L.D., Spicer, V., Dodich, F., Tsang, H.H.: Modelling
crowd dynamics and crowd management strategies. In: 2021 IEEE 12th Annual Information
Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0627–0632.
IEEE (2021)
18. Peng, Y.C., Chou, C.I.: Simulation of pedestrian flow through a “t” intersection: a multi-floor
field cellular automata approach. Comput. Phys. Commun. 182(1), 205–208 (2011)
19. Rozo, K.R., Arellana, J., Santander-Mercado, A., Jubiz-Diaz, M.: Modelling building emer-
gency evacuation plans considering the dynamic behaviour of pedestrians using agent-based
simulation. Saf. Sci. 113, 276–284 (2019)
20. Severiukhina, O., Voloshin, D., Lees, M.H., Karbovskii, V.: The study of the influence of
obstacles on crowd dynamics. Procedia Comput. Sci. 108, 215–224 (2017)
21. Wagner, N., Agrawal, V.: An agent-based simulation system for concert venue crowd evac-
uation modeling in the presence of a fire disaster. Expert Syst. Appl. 41(6), 2807–2815
(2014)
22. Wolfram, S.: Statistical mechanics of cellular automata. Rev. Mod. Phys. 55(3), 601 (1983)
23. Yue, H., Guan, H., Zhang, J., Shao, C.: Study on bi-direction pedestrian flow using cellular
automata simulation. Phys. A 389(3), 527–539 (2010)
24. Zacharias, J.: Pedestrian dynamics on narrow pavements in high-density Hong Kong. J. Urban
Manag. 10(4), 409–418 (2021)
The Classical Logic and the Continuous
Logic
Xiaolin Li(B)
1 Introduction
The classical logic is bivalent. It only permits propositions having a value of truth
or falsity and there is no other value between. Consequently, given a proposition
P , P or not P is always true. In the real world, however, there exist certain
propositions with variable answers, such as asking various people to identify a
color. The notion of truth doesn’t fall by the wayside, but rather a means of
representing and reasoning over partial knowledge afforded, by aggregating all
possible outcomes into a dimensional spectrum. This leads to the so called many
value logic. The many-valued logic is a propositional calculus in which there are
more than two truth values [1–5,8,9,20].
The first known classical logician who didn’t fully accept the law of excluded
middle was Aristotle, who is also generally considered to be the first classical
logician and the “father of logic” [5]. However, Aristotle didn’t create a system
of multi-valued logic to explain this isolated remark. In 1920, the Polish logician
and philosopher Jan Lukasiewicz began to create systems of many-valued logic
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 511–525, 2023.
https://doi.org/10.1007/978-3-031-18461-1_33
512 X. Li
[2], where he introduced a “possible” value in addition to false and true values.
In 1967, Kleene introduced a three-valued logic where an “unknown” value is
used in logic inference [8].
In 1965, Lotfi A. Zadeh introduced the fuzzy set theory and fuzzy logic [18],
[19]. Fuzzy logic is a form of many-valued logic. It deals with reasoning that is
approximate rather than fixed and exact. Compared to the classical logic where
variables may take on true or false values, fuzzy logic variables may have a truth
value that ranges in degree between 0 and 1. Fuzzy logic has been extended to
handle the concept of partial truth, where the truth value may range between
complete truth and complete falsity. Since then, fuzzy logic has been studied
by many researchers and applied to many fields, from applied mathematics to
control theory and artificial intelligence [6,7,10–17,21].
In this paper, we present the concepts of continuous logic. Similar to the
fuzzy logic, the truth value of a variable in the continuous logic is within the
closed interval [0, 1], where 0 stands for complete falsity and 1 for complete
truth. We define primitive logic operators and compound logic operators that
map variables or propositions from a domain [0, 1]n to the range [0, 1], where n
is an integer and n ≥ 1. We also present various laws and rules in logic inference.
We show that the classical logic is simply a special case of this continuous logic.
The remainder of this paper is divided into six sections. Section 2 reviews the
classical logic. Section 3 defines the continuous logic. Section 4 presents inference
rules in both the classical and the continuous logic. Section 5 discusses the con-
sistency between the classical logic and the continuous logic. And finally, Sect. 6
concludes this paper.
x y x y xy x + y
0 0 1 1 0 0
0 1 1 0 0 1
1 0 0 1 0 1
1 1 0 0 1 1
The truth values of the primitive logic operators are defined by Table 1,
where x, y ∈ B and (x, y) ∈ B 2 . Notice that not is a unary operator. It is also
The Classical Logic and the Continuous Logic 513
Table 2 defines the above operators. We say they are compound operators
because they can be expressed by the primitive operators:
(xy) = x + y
(x + y) = x y
x ⊕ y = xy + x y
(x ⊕ y) = x y + xy
x → y = x + y
Notice that all the above operators are commutative except for the implica-
tion →, since operators · and + are commutative. It is obvious that the domain
of operators ·, +, (·) and (+) can be extended to B n , where n > 1.
We have the following laws in classical logic:
1. Double negation
x = x (1)
2. Annihilation
x0 = 0
(2)
x+1=1
514 X. Li
3. Identity
x+0=x
(3)
x1 = x
4. Idempotence
xx = x
(4)
x+x=x
5. Commutativity
x+y =y+x
(5)
xy = yx
6. Associativity
x + (y + z) = (x + y) + z
(6)
x(yz) = (xy)z
7. Absorption
x(x + y) = x
(7)
x + xy = x
8. Complement
xx = 0
(8)
x + x = 1
9. Distributivity
x(y + z) = xy + xz
(9)
x + yz = (x + y)(x + z)
(x + y) = x y
(10)
(xy) = x + y
Let us consider the closed continuous truth interval C = [0, 1], where 0 stands
for complete falsity and 1 for complete truth. Similar to the case of the classical
logic, we can define three primitive operators from a domain C n to the range C,
where n ≥ 1:
The Classical Logic and the Continuous Logic 515
x = 1 − x (11)
x · y = min{x, y} (12)
Definition 3. Logic or
+ : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
x + y = max{x, y} (13)
x1 x2 · · · xn = min{x1 , x2 , · · · , xn }
(14)
x1 + x2 + · · · + xn = max{x1 , x2 , · · · , xn }
Based on the above primitive operators, we can define the following com-
pound operators:
Definition 4. Not-and
(·) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
Definition 5. Not-or
(+) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
Definition 6. Exclusive-or
⊕ : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
x ⊕ y = xy + x y
(17)
= max{min(x, 1 − y), min(1 − x, y)}
Definition 7. Not-exclusive-or
(⊕) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
(x ⊕ y) = (x y + xy )
= x y + xy (18)
= max{min(1 − x, 1 − y), min(x, y)}
516 X. Li
Definition 8. Implication
→: C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,
x → y = x + y
(19)
= max{1 − x, y}
It is obvious that all the above operators are Commutative except for the
implication →. And the operators (·) and (+) can also be extended onto the
domain C n , where n > 2.
The continuous logic keeps all laws in the classical logic:
1. Double negation
x = x (20)
2. Annihilation
x0 = 0
(21)
x+1=1
3. Identity
x+0=x
(22)
x1 = x
4. Idempotence
xx = x
(23)
x+x=x
5. Commutativity
x+y =y+x
(24)
xy = yx
6. Associativity
x + (y + z) = (x + y) + z
(25)
x(yz) = (xy)z
7. Absorption
x(x + y) = x
(26)
x + xy = x
8. Complement
xx = min(x, 1 − x)
(27)
x + x = max(x, 1 − x)
The Classical Logic and the Continuous Logic 517
9. Distributivity
x(y + z) = xy + xz
(28)
x + yz = (x + y)(x + z)
(x + y) = x y
(29)
(xy) = x + y
4 Logic Inference
Let us consider logic inference with the classical logic and the continuous logic.
1. Modus Ponens
∀P, Q ∈ B, we have
P (P → Q) = P (P + Q) (31)
P (P → Q) = Q
2. Modus Tollens
∀P, Q ∈ B, we have
P → Q = Q → P (32)
P → Q = P + Q
= Q + P
= Q + P
= Q → P
518 X. Li
3. Disjunctive Syllogism
∀P, Q ∈ B, we have
P (P + Q) = P P + P Q (33)
When P = 0, it follows that
P (P + Q) = Q
4. Hypothetical Syllogism
∀P, Q, R ∈ B, we have
P (P → Q)(Q → R) = P (P + Q)(Q + R)
(34)
= P QR
When P = Q = 1, it follows that
P (P → Q)(Q → R) = R
5. Logic Equivalence
∀P, Q ∈ B, we have
P ↔ Q = (P → Q)(Q → P )
= (P + Q)(Q + P ) (35)
= P Q + P Q
It follows that
0P =
Q
P ↔Q=
1P =Q
6. Constructive Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(P + R) = (P + Q)(R + S)(P + R) (36)
When P = R = 1, it follows that
(P → Q)(R → S)(P + R) = QS
7. Destructive Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(Q + S ) = (P + Q)(R + S)(Q + S ) (37)
When P = R = 1, it follows that
(P → Q)(R → S)(Q + S ) = 0
8. Bidirectional Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(P + S ) = (P + Q)(R + S)(P + S ) (38)
When P = R = 1, it follows that
(P → Q)(R → S)(P + S ) = QS
The Classical Logic and the Continuous Logic 519
1. Modus Ponens
∀P, Q ∈ C, we have
P (P → Q) = P (P + Q)
= PP + PQ (39)
= max{min(P, 1 − P ), min(P, Q)}
P (P → Q) = Q
2. Modus Tollens
∀P, Q ∈ C, we have
P → Q = Q → P (40)
because
P → Q = P + Q
= Q + P
= Q + P
= Q → P
3. Disjunctive Syllogism
∀P, Q ∈ C, we have
P (P + Q) = P P + P Q
(41)
= max{min(1 − P, P ), min(1 − P, Q)}
P (P + Q) = Q
4. Hypothetical Syllogism
∀P, Q, R ∈ C, we have
P (P → Q)(Q → R) = P (P + Q)(Q + R)
= P P Q + P P R + P QQ + P QR
(42)
= max{min(P, 1 − P, 1 − Q), min(P, 1 − P, R),
min(P, Q, 1 − Q), min(P, Q, R)}
P (P → Q)(Q → R) = R
520 X. Li
5. Logic Equivalence
∀P, Q ∈ C, we have
P ↔ Q = (P → Q)(Q → P )
= (P + Q)(Q + P )
= P Q + P P + QQ + P Q (43)
= P Q + P Q
= max{min(1 − P, 1 − Q), min(P, Q)}
The above equation can be proved in the way similar to the proof of Eq. (18),
see Appendix A for details. It follows that
0 P = 1, Q = 0 or P = 0, Q = 1
P ↔Q=
1 P = Q = 1 or P = Q = 0
In all other cases, 0 < P ↔ Q < 1.
6. Constructive Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(P + R) = (P + Q)(R + S)(P + R)
= min{max(1 − P, Q), max(1 − R, S), (44)
max(P, R)}
When P = R = 1, it follows that
(P → Q)(R → S)(P + R) = QS
7. Destructive Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(Q + S ) = (P + Q)(R + S)(Q + S )
= min{max(1 − P, Q), max(1 − R, S), (45)
max(1 − Q, 1 − S)}
When P = R = 1, it follows that
(P → Q)(R → S)(Q + S ) = min{Q, S, max(1 − Q, 1 − S)}
= QS(Q + S )
Furthermore, when Q, S ∈ B, i.e., Q = 0 or S = 0 or Q = S = 1, it follows
that
(P → Q)(R → S)(Q + S ) = 0
8. Bidirectional Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(P + S ) = (P + Q)(R + S)(P + S ))
= min{max(1 − P, Q), max(1 − R, S), (46)
max(P, 1 − S)}
When P = R = 1, it follows that
(P → Q)(R → S)(P + S ) = QS
The Classical Logic and the Continuous Logic 521
5 Consistency
Let us check the consistency between the classical logic and the continuous logic.
Firstly, consider the three primitive operators , ·, and + in the continuous logic.
∀x, y ∈ C, we have
x = 1 − x
xy = min{x, y}
x + y = max{x, y}
When x, y ∈ B, they become the classical , ·, and +, we get Table 1.
Secondly, consider the compound logic operators (·) , (+) , ⊕, (⊕) , and →.
∀x, y ∈ C, we have
(xy) = 1 − min(x, y)
(x + y) = 1 − max(x, y)
x ⊕ y = max{min(x, 1 − y), min(1 − x, y)}
(x ⊕ y) = max{min(1 − x, 1 − y), min(x, y)}
x → y = max(1 − x, y)
When x, y ∈ B, they become the classical logic operators (·) , (+) , ⊕, (⊕) ,
and →, we get Table 2.
Thirdly, consider the laws in both the classical logic and the continuous
logic. All these laws are in the same forms except the complement law. In the
continuous logic, ∀x ∈ C, we have
xx = min{x, 1 − x}
x + x = max{x, 1 − x}
When x ∈ B, it leads to the classical complement law:
xx = 0
x + x = 1
Now consider the inference rules in both the classical logic and the continuous
logic. When the involved propositions are in B, each of the inference rules in the
continuous logic leads to a corresponding rule in the classical logic. For examples,
consider the equivalence rule in the continuous logic:
P ↔ Q = (P → Q)(Q → P )
= (P + Q)(Q + P )
= P Q + P P + QQ + P Q
= P Q + P Q
= max{min(1 − P, 1 − Q), min(P, Q)}
When P, Q ∈ B, it leads to the equivalence rule in the classical logic, and we
get Table 3.
From the above discussion, one can see that the continuous logic is consistent
with the classical logic. As a matter of factor, since B ⊂ C, the classical logic is
simply a special case of this continuous logic.
522 X. Li
P Q P Q P → Q Q → P P ↔ Q
0 0 1 1 1 1 1
0 1 1 0 1 0 0
1 0 0 1 0 1 0
1 1 0 0 1 1 1
6 Conclusion
In this paper, we have presented the continuous logic. The truth values of vari-
ables and propositions in this continuous logic are within the closed interval
C = [0, 1], where 0 stands for complete falsity and 1 for complete truth. We
have defined three primitive logic operators: 1)Logic not : C −→ C, 2)Logic
and · : C 2 −→ C, and 3) Logic or + : C 2 −→ C. Based on these primitive oper-
ators, we have derived some compound operators: 4) Not-and (·) : C 2 −→ C,
5)Not-or (+) : C 2 −→ C. 6) Exclusive-or ⊕ : C 2 −→ C, 7) Not-exclusive-or
(⊕) : C 2 −→ C, and 8) Implication →: C 2 −→ C. Many of the above operators
can be extended such that operator : C n −→ C, where n is an integer and n > 2.
In addition to the above operators, we have presented some laws and inference
rules in this continuous logic. The laws include 1) Double nigation, 2) Annihila-
tion, 3) Identity, 4) Idemptonce, 5) Commutativity, 6) Associativity, 7) Absorp-
tion, 8) Complement, 9) Distributivity, and 10) De Morgan Law. The inference
rules include 1) Modus Ponens, 2) Modus Tollens, 3) Disjunctive Syllogism, 4)
Hypothetical Syllogism, 5) Logic Equivalence, 6) Constructive Dilemma, 7) Dis-
tructive Dilemma, and 8) Bidirectional Dilemma.
Furthermore, We have also checked the consistency between the classical
logic and the continuous logic. We have shown that the classical logic is simply
a special case of this continuous logic because the truth value set of the classical
logic is a subset of the truth value set of the continuous logic, i.e., B = {0, 1} ⊂
C = [0, 1].
Appendix A
Proof of Not-exclusive-or Equation
(x ⊕ y) = (x y + xy )
= (x y) (xy )
= (x + y )(x + y)
= xx + xy + y x + y y
1. Case of x ≥ y
In this case, we have
xy = y
x y = x
The Classical Logic and the Continuous Logic 523
It follows that
(x ⊕ y) = xx + y + x + y y
= (x + 1)x + y(1 + y )
= x + y
= x y + xy
2. Case of x < y
In this case, we have
xy = x
x y = y
It follows that
(x ⊕ y) = xx + x + y + y y
= x(x + 1) + y (1 + y)
= x + y
= xy + x y
Therefore, in all cases, we have
(x ⊕ y) = x y + xy
It is equivalent to say that ∀x, y ∈ C,
max{min(x, 1 − x), min(x, y), min(1 − x, 1 − y), min(y, 1 − y)}
= max{min(x, y), min(1 − x, 1 − y)}
Appendix B
Proof of the Distributivity Law
1. x(y + z) = xy + xz
(a) Case of x ≥ y
if y ≥ z, we have
x(y + z) = xy = y
xy + xz = y + xz = y + z = y
otherwise, we have
x(y + z) = xz
xy + xz = y + xz = xz.
Appendix C
Proof of the De Morgan Law
1. (x + y) = x y
(a) Case of x ≥ y
In this case, we have x ≤ y . and hence
(x+y)’=x’=x’y’
(b) Case of x < y
In this case, we have x > y , and hence
(x+y)’=y’=x’y’
In all cases, we have (x + y) = x y .
2. (xy) = x + y
This equation can be proved in a way similar to the proof of (x + y) = x y .
References
1. Biacino, L., Gerla, G.: Fuzzy logic, continuity and effectiveness. Arch. Math. Logic
41(7), 643–667 (2002)
2. Cignoli, R.: Proper n-valued Lukasiewicz algebras as S-algebras of Lukasiewicz
n-valued propositional calculi. Stud. Logica. 41(1), 3–16 (1982)
3. Gerla, G.: Effectiveness and multivalued logics. J. Symb. Log. 71(1), 137–162
(2006)
4. Hajek, P.: Fuzzy logic and arithmetical hierarchy. Fuzzy Sets Syst. 3(8), 359–363
(1995)
5. Hurley, P.: A Concise Introduction to Logic, 9th edn. Wadsworth, Belmont (2006)
6. Ibrahim, A.M.: Introduction to Applied Fuzzy Electronics. Prentice Hall, Upper
Saddle River (1997)
7. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Pren-
tice Hall, Upper Saddle River (1995)
8. Kleene, S.C.: Mathematical Logic. Dover Publications, New York (1967)
9. Mundici, D., Cignoli, R., D’Ottaviano, I.M.L.: Algebraic Foundations of Many-
Valued Reasoning. Kluwer Academic, Dodrecht (1999)
10. Novak, V.: On fuzzy type theory. Fuzzy Sets Syst. 149(2), 235–273 (2005)
11. Passino, K.M., Yurkovich, S.: Fuzzy Control. Addison-Wesley, Boston (1998)
12. Pedrycz, W., Gomide, F.: Fuzzy Systems Engineering: Toward Human-Centered
Computing. Wiley-Interscience, Hoboken (2007)
13. Pelletier, F.J.: Review of metamathematics of fuzzy logics. Bull. Symb. Log. 6(3),
342–346 (2000)
14. Santos, E.S.: Fuzzy algorithms. Inf. Control 17(4), 326–339 (1970)
15. Seising, R.: The Fuzzification of Systems. The Genesis of Fuzzy Set Theory and Its
Initial Applications – Developments up to the 1970s. Springer, New York (2007).
https://doi.org/10.1007/978-3-540-71795-9
16. Tsitolovsky, L., Sandler, U.: Neural Cell Behavior and Fuzzy Logic. Springer, New
York (2008). https://doi.org/10.1007/978-0-387-09543-1
17. Yager, R.R., Filev, D.P.: Essentials of Fuzzy Modeling and Control. Wiley, New
York (1994)
18. Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
The Classical Logic and the Continuous Logic 525
19. Zadeh, L.A.: Fuzzy algorithms. Inf. Control 12(2), 94–102 (1968)
20. Zaitsev, D.A., Sarbei, V.G., Sleptsov, A.I.: Synthesis of continuous-valued logic
functions defined in tabular form. Cybern. Syst. Anal. 34(2), 190–195 (1998)
21. Zimmermann, H.: Fuzzy Set Theory and Its Applications. Kluwer Academic Pub-
lishers, Boston (2001)
Research on Diverse Feature Fusion Network
Based on Video Action Recognition
College of Tourism and Culture, Yunnan University, Lijiang 674199, Yunnan Province, China
[email protected]
1 Introduction
Motion recognition because of its in video surveillance, human computer interaction,
health care and other fields are widely used and is regarded as one of the most attractive
topics in computer vision field. Design the local binary pattern [1] and manual functions
such as gradient direction histogram, space or time information extracted from action
video data, which are extensively adopted in simple scene video, but they can hardly
be extended to scenes with complex backgrounds due to various noises. As the deep
network architecture [2] emerged, deep learning function has made great strides. But it
is important to note that, unlike static image recognition, video motion recognition need
not only spatial information.
The dual-stream convolution network [3] not only considers the optical stream, but
also provides the weight of a fixed fraction fusion method. The network benefits from
the pre-training technology of the common still image data set, in which the space flow
extraction from the RGB space characteristics of the framework and the time stream
extracts the temporal features from the optical stream image. The final forecast is deter-
mined by putting the prediction results of two single streams into an average function
or a SVM classifier. The time segment network (TSN) [4] improves the original dual-
stream network through a new approach to sparse sampling. Specifically, It the whole
video into several clips, they are equal in time, and then randomly selects a short clip
from each clip. Dual-stream convolution network enables each segment to get fragment-
level predictions. The last of the video level forecast is the fusion of all segment-level
predictions and segmented consensus functions. TSN overcomes the problem that dual-
stream networks can not generate time characteristics with long-distance dependence.
However, these methods still have some shortcomings, such as: (i) Fusion stage before
the global average of pooling layer due to the average score mechanism will lead to loss
of information; and (ii) the global information fusion stage using only contains and the
lack of the top features of local details.
Accordingly, a third-rate network structure for video action recognition is proposed.
In the network fusion stream design the two modules. The first module uses a vari-
ety of compact bilinear algorithms to integrate a variety of features from multi-layer
basic networks, and effectively balances feature interaction and computational costs.
The feature transformation part of this module solves the problem of underfitting and
further reduces the computational cost, while the second module optimizes the parallel
connection function.
2 Related Work
Feichtenhofer et al. [5] proposed a method to fuse space-time information from dual-flow
network in the convolution layer, and tested several fusion methods including summa-
tion fusion, 2D/3D convolution fusion, bilinear fusion, etc. Unfortunately, the method of
using hybrid space-time network instead of time stream will negatively affect the extrac-
tion of proposed time features. TSN [4] makes effective use of the whole video in the
time dimension and explores various data modes such as RGB, RGB diff, optical stream,
Twisted optical stream, etc. Liu et al. [6]. An implicit dual-flow convolution network
is innovatively introduced to capture the motion information between adjacent frames.
Furthermore, some literatures generalize the dual-stream approach from 2D networks to
3D networks. However, 3D CNN [7] for recognition of human behavior does not meet
the proposed requirements as expected due to its rich parameters and high training data
requirements.
All stacked convolution layers for the depth of the action recognition network architec-
ture, For instance Inception, ResNet, DenseNet [8], features are extracted in a bottom-up
manner. The early layers of the network are characterized by local information, while
the top-level features contain a large amount of global information. Inception from
multi-scale receiving domain feature extraction, and collected each initial modules. The
small-scale convolution core in the Inception module can reduce the computing cost
and network parameters. ResNet connects input and output features through residual
connections, which helps to solve the challenges of better deep network model training.
In the end, whereas, they only take advantage of top-level features. DenseNet connects
the characteristics of all front-end modules densely. DenseNet shows good performance
on the recognition of still images, but its benefit is not obvious after merging the early
features of the network into the action recognition task [5]. Moreover, the excess feature
involved in convolution calculation is inefficient.
528 C. Bin and W. Yonggang
wherein, Net ST represents the fusion stream network, and SF(•) represents the fusion
function. Put forward the overall architecture of the network is shown in Fig. 1. Multiple
compact bilinear fusion (DCBF) modules 1, 2 and 3 fuse pairs of spatial and temporal
features into compact space-time features. DCBF module 4 further integrates these
space-time features into a variety of compact space-time features. The following CSA
modules can use adaptive weights to refine a variety of compact space-time features.
The purpose of a converged stream is to obtain information that complements the other
two streams. The final classification prediction output is determined by all the prediction
outputs of the three streams with a weighted average function.
Z =X ⊗Y (5)
wherein, ⊗ represents the each pixel position matrix outer product of X and Y, that is
XY T . Z represents the Double linear fusion is the output of the feature vector. However,
the two problems of computational cost and residual large amount of noise challenge
the application of bilinear fusion. Assuming that the dimension of both input features is
103 , the dimension of the output fusion feature will be 106 . The two features of bilinear
fusion need a lot of computational cost, while the compact bilinear fusion takes into
account both feature crossover and computational cost. Compact bilinear is applied to
fuse features from multiple layers of a dual-stream network. In this work, the module
is named as diversified compact bilinear fusion module. The DCBF module consists
of a single diversified compact bilinear pooling part or a diversified compact bilinear
pooling part with additional feature conversion parts. These two types of fusion modules
are shown in Fig. 2. Applying the Count Sketch projection function [12] to a variety of
compact bilinear pooling helps to reduce data dimensions and find more frequent data
in the original input data. This feature will significantly project high-dimensional input
data to low-dimensional projection data.
υRm and ωRn are applied to represent input features and projection features, and
ω is first initialized to zero vectors. Two vectors S ∈ {−1 + 1}m and h ∈ {1, · · · , n} are
sampled. Vector s selects -1 or + 1, h samples its elements between 1 and υ. They all
satisfy the uniform distribution and remain the same. h maps each index i of the input
υ to the index t of the output ω. Using t = h(i) as the index of w, w can be updated by
adding s(i) • υ(i) to w itself. h(i) can be found in h by indexing i ∈ {1, · · · , m}. When
the Count Sketch projection function is expressed as:
ω = (υ, s, h) (6)
530 C. Bin and W. Yonggang
wherein, υ1 and υ2 represent two input features. For the convenience of calculation, the
convolution in time domain can be transformed into the product of elements in frequency
Research on Diverse Feature Fusion Network 531
domain by Fourier transform and Fourier inverse transform. According to the fast Fourier
transform (FFT) [13], the output characteristics of compact bilinear can be expressed as
follows:
Recently, channel attention and spatial attention have been frequently used in the field
of computer vision. The CSA module in combination with the two in parallel, and the
two submodules work at the same time. The weights within the channel are the weight
matrix of channel concern, and the weights across the channel are the weight matrix of
multi-scale space concern. These two weights are calculated as:
CA = δ(FC wt1 (FC wt0 (fA )) + FC wt1 (FC wt0 (fM )))
fA = AugC(fin )
fM = MaxC(fin )
Sa = δ(Conv1 ∗ 1(f2 ))
4 Experimental Analysis
4.1 Public Datasets
UCF101: a dataset containing 13320 short videos from YouTube video clips. It is an
extension of the UCF50 data set, in order to save the real action in the video. It contains
101 classes, is a typical data set of assessment action recognition model.
HMDB51: a HMDB51 dataset that includes 6766 short videos from a variety of
sources, such as movies, YouTube videos, and Google Video. It has 51 action classes,
each containing at least 101 clips.
The task list of the dataset is set up before the experiment, as shown in Table 1. This
work takes the video of the database as the research object, but only four simple actions
of the human body: running, walking, jumping and gestures are mainly studied because
the scene is relatively simple, and these four actions are detected and classified.
Firstly, the training space stream network and time stream network are constructed
respectively. The RGB frame is extracted from the original video by OpenCV, and the
optical stream image is extracted by the TVL1 optical stream algorithm implemented by
CUDA in OpenCV. For dual-stream network training, the learning rate is initialized to
10–3 . For the UCF101 dataset, the learning rate of the spatial stream network is divided
by 10 at the 40th epoch, 80th epoch and 120th epoch, and the spatial stream network is
trained at the 150th epoch. Divide the 170th epoch, 280th epoch and 320th epoch by 10
for the time stream network, and complete the time stream network training at the 350th
epoch. Some adjustments were made during the experiment on the HMDB51 dataset
because it was harder to train with the HMDB51 data set. The period of change in the
learning rate is 60, 120, 160, and the training of spatial stream ends in the 200th period.
For the time stream, the learning rate is changed at 170, 280, 320 epoch and the training
ends at 400th epoch.
The Fusion of flow network training process is divided into two stages: freezing
phase and thawing stage. In the freezing phase, all parameters are frozen and there is
no gradient back propagation in the dual-stream network to maintain the integrity of
the feature extractor and accelerate the fusion stream training. In the thawing stage, the
parameters of the two streams are thawed and the gradient back propagation is carried
out in the whole three-stream network. The batch size for the freeze phase is set to 60
and the thaw phase is set to 15. The dropout rate of the two stages was 0.5. For the
UCF101 dataset, in the 50th epoch, 100th epoch, and 150th epoch during the freeze
phase, the learning rate is initialized to 10–3 and divided by 10, with a maximum epoch
of 200th. The initial learning rate during the thawing phase is 10–4 , and the learning rate
changes during periods 50 and 100. The maximum epoch of thawing stage is 150th. For
HMDB51, the period of change in the learning rate is 80, 160, 230, and the training ends
at the 300th period of the freeze phase and the 80, 160, 200th of the thaw phase. The
final predicted score is the weighted average of the scores from the three streams. Space
stream is set to 0.5, time stream is set to 2.0, and fusion stream is set to 1.0. In the test
phase, the scores of 25 evenly spaced sampling segments in the whole video frame were
averaged to obtain the final score of each stream.
534 C. Bin and W. Yonggang
The proposed model is compared with the current state model on UCF101 common
data sets and HMDB51 common data sets. Even the number of segments of the model
is 3, which is divided into manual feature group and deep learning feature group. The
result of comparison are shown in Fig. 3. It can be seen from the table that most deep
learning feature methods are superior to artificial feature methods in terms of accuracy.
The algorithms, which one we proposed achieves the accuracy of 94.96% and 95.27%
respectively in the freezing phase and thawing phase of UCF101 data sets, that is better
than other algorithms. Correspondingly, on the HMDB51 dataset, the proposed model
reached 71.09% and 71.33% in the two stages, respectively.
model for this article. All three models retain the CSA module. The accuracy of Level
1 in the two stages is about 94.32% and 94.64%, respectively. Level 2 is 0.20% higher
than Level 1 in the freezing phase and 0.34% higher than Level 1 in the thawing phase.
The accuracy of Level 3 is 0.27% and 0.24% higher than that of the others.
5 Conclusion
In this paper, a new video based action recognition network, multi-feature fusion net-
work, is proposed. A variety of compact bilinear algorithms are used to fuse the spatial
and temporal characteristics acquired before the global average pooling layer of two
single streams, which is called fusion stream. In addition, considering the local life
information, it combines the spatio-temporal characteristics corresponding to the multi-
ple layers of two kinds of flows, which is called diversification feature. Channel attention
and multi-size spatial attention are connected in parallel to obtain a set of weights to
better select the feature part. Experiments on real data sets show that the proposed model
has good performance.
Video action recognition is an important research topic in the field of computer
vision. In this study, video action recognition is improved to some extent, but it is far
from enough. In the future, the research may focus on: (1) the study of occlusion in
complex video; (2) the analysis of local joint movements; and (3) the fusion of joint and
manual apparent features.
References
1. Deng, H., Jun, K., Min, J., Liu, T.: Diverse Features Fusion Network for video-based action
recognition (2021)
2. Wei, X., Yu, X., Zhang, P., Zhi, L., Yang, F.: CNN hyperspectral image classification combined
with local binary mode. J. Remote Sens. 24(8), 107–113 (2020)
3. Lu, Y.: Research on Deep Neural Network Architecture Improvement and Training Perfor-
mance Improvement. North Central University
4. Li, Q., Li, A., Wang, T., et al.: Behavioral recognition combining ordered optical stream graph
and dual-stream convolutional network. J. Opt. 38(6), 7 (2018)
5. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal Segment Networks: Towards Good Practices
for Deep Action Recognition. Springer, Cham (2016)
6. Zhang, J., Qinke, P., Sun, S., Liu, C.: Collaborative filtering recommendation algorithm based
on user preference derived from item domain features (2014)
7. Dean, J., Ghemawat, S., et al.: MapReduce: simplified data processing on large clusters.
Commun. ACM, 51(1), 107-113 (2008)
8. Liu, H., Tu, J., Liu, M.: Dual-stream 3D convolutional neural network for skeleton-based
action recognition (2017)
9. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
10. Huang, G., Liu, Z., Laurens, V., et al.: Densely connected convolutional networks. IEEE
Comput. Soc. (2016)
11. Yin, W., Schütze, H., Xiang, B., et al.: ABCNN: attention-based convolutional neural network
for modeling sentence Pairs. Comput. Sci. (2015)
Research on Diverse Feature Fusion Network 537
12. Lei, S., Yi, W., Ying, C., et al.: A review of attention mechanism research in natural language
processing. Data Anal. Knowl. Discov. 4(5), 14 (2020)
13. Subakan, C., Ravanelli, M., Cornell, S., et al.: Attention is all you need in speech separation.
In: ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE (2021)
14. Caracas, A., Kind, A., Gantenbein, D., et al.: Mining semantic relations using NetFlow. In:
IEEE/IFIP International Workshop on Business-driven It Management. IEEE (2008)
Uncertainty-Aware Hierarchical
Reinforcement Learning Robust to Noisy
Observations
1 Introduction
Reinforcement learning (RL) has regained attention in the last years with
achievements that range from learning to play video games from scratch [18]
to beating masters in the game of Go [22,27,28]. All of this was possible due to
the integration of Deep Neural Networks (DNNs) that can be efficiently used as
feature extractors and function approximators for high-dimensional problems.
However, despite these impressive accomplishments, RL is still overlooked as a
viable solution for controlling dynamic systems in real-world applications. The
lack of strong safety guarantees, unacceptable level of robustness, and weak gen-
eralization can be indicated as some of the reasons for not choosing RL.
However, not only safety-related properties are necessary for RL systems. The
challenges of learning from complex environments are diverse, ranging from dif-
ficulties in dealing with large state spaces and high-dimensional data encodings
to dealing with temporal abstraction and increasing sample efficiency. Research
in the field of hierarchical reinforcement learning (HRL) alongside evidence col-
lected from studies in neuroscience and behavioral psychology suggest that hier-
archical structures can help RL tackle these issues [3,4,21].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 538–547, 2023.
https://doi.org/10.1007/978-3-031-18461-1_35
Uncertainty-Aware HRL Robust to Noisy Observations 539
2 Related Work
Hierarchical reinforcement learning is credited as a good representation of the
human biological brain structure and of our reasoning process [5,6,21]. HRL’s
capacity to excel in complex problems is also outlined in a myriad of publications
(e.g. [20,32]). One important HRL formulation is given by the options framework
[29], which still influences (directly or indirectly) most of the state-of-the-art
HRL models [2]. The importance of achieving temporal abstraction with RL
agents is not only highlighted by the options framework but by other works that
have also addressed this topic [13].
Two-level HRL architectures are quite popular due to their simple structure,
which helps in designing and deploying the model. It consists of a high-level
model responsible to break down the problem into sub-goals, and a low-level
controller responsible to accomplish the determined sub-goals. An issue with
this approach comes from the policies working at different temporal abstractions,
making learning a complicated task considering that the environment outputs a
single feedback signal. The author in [15] show how intrinsic motivation can be
used to train the low-level controller while the high-level system learns directly
through the extrinsic reward given by the environment. The author in [19] also
presents a similar HRL structure.
Uncertainty in deep learning models is now thoroughly studied since there is
a bigger appeal for deploying such systems in real (and potentially safety-critical)
environments. The author in [1] present an overview of different methods used for
uncertainty quantification in deep learning. The author in [10] show a comparison
between different methods available for uncertainty estimation. Regarding RL,
uncertainty is often associated with safety [12,14,17].
540 F. S. Roza
3 Preliminaries
In this section, the classical RL framework and the HRL variants are formalized.
In RL, we consider an agent that sequentially interacts with an environment
modeled as a Markov Decision Process (MDP) [19]. An MDP is a tuple M :=
(S, A, R, P, μ0 ), where S is the set of states, A is the set of actions, R : S ×
A × S → R is the reward function, P : S × A × S → [0, 1] is the transition
probability function which describes the system dynamics, where P (st+1 |st , at )
is the probability of transitioning to state st+1 , given that the previous state
was st and the agent took action at , and μ0 : S → [0, 1] is the starting state
distribution. At each timestep the agent observes the current state st ∈ S, takes
an action at ∈ A, transitions to the next state st+1 drawn from the distribution
P (st , at ), and receives a reward R(st , at , st+1 ).
For the Hierarchical model formulation, a two-level structure similar to the
one presented by [31] will be used. The top-level policy μhi observe the state
st and set a high-level action (or goal) gt . The bottom-level policy μlo observes
the state st and the goal gt and outputs a low-level action at which should
move the agent towards accomplishing the high-level action. The high-level agent
receives the environment reward R(st , at , st+1 ) every time step, while an intrinsic
reward R(st , gt , at , st+1 ) is given by the high-level agent to the low-level agent
every time step. Temporal abstraction is provided by the high-level policy only
deriving a new high-level action once the low-level agent finished the task, either
by completing it or failing at it.
High-level agent
Uncertainty
esƟmator
Low-level agent
observe the speed to decrease after a break (if reasons like mechanical failures are
disregarded). In other words, if an accurate model for the underlying dynamics
of the environment is available, feeding the model with the history of previous
states and actions and comparing the predicted states to the actually measured
states will give some good hints on both data and model quality.
One of the main subsystems that composes the framework is the uncertainty
estimator. The best approach for capturing uncertainty in deep learning would
be to use Bayesian networks, able to inherently represent uncertainty by learning
distributions over the network weights. However, due to the amount of computing
necessary to process such models, approximations are usually used. Different
techniques can be used to approximate Bayesian networks, such as Monte-Carlo
Dropout, Deep Ensembles and deterministic uncertainty quantification [7,8,30].
Considering the good performance achieved by Deep Ensembles [10], this
method was chosen to achieve uncertainty quantification in the proposed frame-
work. The uncertainty is approximated by the variance of predictions given by
the ensemble members for the same input state. The intuition behind it is that
assuming that training leads to near-optimal approximations, the models should
converge to similar decisions when a sample similar to those experienced during
training but probably will diverge when out-of-the-ordinary inputs are given,
enforced by random weight initialization and using a dropout layer during train-
ing.
In the proposed method, uncertainty estimation is performed on top of the
frame stacked observations. The ensemble consists of m models. Each model
predicts the next state observation based on a stacking of N past observations
and actions, i.e., the model input is {s0 , ..., sN , a0 , ..., aN } and it outputs sN +1 .
The uncertainty estimates will be given by the variance over the m ensemble
predictions at every time step.
542 F. S. Roza
5 Experiments
The experiments were conducted in the windy maze environment, proposed by
[16]. This environment consists of a discrete maze, as shown in Fig. 2. The goal
is to cross the maze from the starting position and reach the goal position. In
every time step, a wind randomly blows in one of the four directions (north,
south, east, west) and the agent can choose to be carried by the wind or stay in
its position. The observation consists of the [x,y] position of the agent and the
wind direction. The reward is sparse, with the agent receiving 100 when reaching
the goal and 0 otherwise. The episode is finished when the agent achieves the
goal or after 200 steps.
Start
Goal
In the hierarchical setting, the high-level agent defines the direction the agent
should move next and the low-level decides if it should move or not. By not
moving, a reward of 0 is returned to the low-level agent. By deciding to move,
the low-level will receive +1 if the high-level goal is achieved and −1 otherwise.
This environment was modified to fit the uncertainty setting this paper is
focusing on. To mimic sensor noise reasonably, noise with a magnitude of 1
Uncertainty-Aware HRL Robust to Noisy Observations 543
5.1 Results
Fig. 3. Results for the Vanilla HRL Model (without uncertainty estimation). The high-
level reward represents extrinsic reward and, therefore, reflects the ability to solve the
task. The low-level reward is the intrinsic reward and the combined is the sum of both.
The first set of experiments was done with an HRL model without any uncer-
tainty estimation. The probabilities of adding noise to the observation were
ρ = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}, ranging from 0.0 (no noise) to 0.5 (50% of proba-
bility for each observation to be contaminated with noise). The results are shown
in Fig. 3.
In Fig. 3a we can see how the noise affects the capacity of the agent of solving
the given task. For ρ values above 0.3, the agent is not able to solve the problem
544 F. S. Roza
at all. Figure 3c shows how the low level agents still can learn something in most
cases. That means that the HRL fails in learning due to the inability of the high-
level agent to correctly set goals to be followed and that should move the agent
towards the goal. This result is expected considering that in this environment
only the positions are affected by noise, while the wind measurement remains
free of noise.
Fig. 4. Results for UA-HRL with uncertainty estimator ensemble based on 5 models.
The high-level reward represents extrinsic reward and, therefore, reflects the ability to
solve the task. The low-level reward is the intrinsic reward and the combined is the
sum of both.
Figure 4 shows the results obtained for the same settings with UA-HRL, the
framework proposed in this paper. For this experiment, the values of ρ were
changed in the same way as in the previous experiment. By analyzing Fig. 4a
it becomes clear how the UA-HRL model is able to learn meaningful policies
regardless of the amount of noisy data experienced during training. It is also
evident how much slower is the learning rate. That can be easily explained by
the fact that in this case, the model not only must learn how to solve the task
but also how to interpret and take advantage of the uncertainty estimates.
Uncertainty-Aware HRL Robust to Noisy Observations 545
It is not clear why the system is performing slightly worse for low ρ values.
One hypothesis is that for low-noise data, the uncertainty estimates are not
necessary but variances are still produced and introduced into the model since
realistically the learned models will slightly diverge even for noise-free samples.
6 Conclusion
Acknowledgments. This work was funded by the Bavarian Ministry for Economic
Affairs, Regional Development and Energy as part of a project to support the thematic
development of the Institute for Cognitive Systems.
References
1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech-
niques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
2. Bacon, P.-L., Harb, J., Precup, D.: The option-critic architecture. In: Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
3. Badre, D., Hoffman, J., Cooney, J.W., D’esposito, M.: Hierarchical cognitive con-
trol deficits following damage to the human frontal lobe. Nat. Neurosci. 12(4),
515–522 (2009)
4. Botvinick, M., Ritter, S., Wang, J.X., Kurth-Nelson, Z., Blundell, C., Hassabis, D.:
Reinforcement learning, fast and slow. Trends Cogn. Sci. 23(5), 408–422 (2019)
546 F. S. Roza
5. Botvinick, M.M., Niv, Y., Barto, A.G.: Hierarchically organized behavior and its
neural foundations: a reinforcement learning perspective. Cognition 113(3), 262–
280 (2009)
6. Botvinick, M.M.: Hierarchical reinforcement learning and decision making. Curr.
Opinion Neurobiol. 22(6), 956–962 (2012)
7. Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: a loss landscape perspec-
tive. arXiv preprint arXiv:1912.02757 (2019)
8. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing
model uncertainty in deep learning. In: International Conference on Machine Learn-
ing, pp. 1050–1059. PMLR (2016)
9. Haider, T., Roza, F.S., Eilers, D., Roscher, K., Günnemann, S.: Domain shifts in
reinforcement learning: identifying disturbances in environments (2021)
10. Henne, M., Schwaiger, A., Roscher, K., Weiss, G.: Benchmarking uncertainty esti-
mation methods for deep learning with safety-related metrics. In: SafeAI@ AAAI,
pp. 83–90 (2020)
11. Henne, M., Schwaiger, A., Weiss, G.: Managing uncertainty of AI-based perception
for autonomous systems. In: AISafety@ IJCAI (2019)
12. Hoel, C.-J., Wolff, K., Laine, L.: Tactical decision-making in autonomous driving
by reinforcement learning with uncertainty estimation. In: 2020 IEEE Intelligent
Vehicles Symposium (IV), pp. 1563–1569. IEEE (2020)
13. Jong, N.K., Hester, T., Stone, P.: The utility of temporal abstraction in reinforce-
ment learning. In: AAMAS (1), pp. 299–306. Citeseer (2008)
14. Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S.: Uncertainty-aware rein-
forcement learning for collision avoidance. arXiv preprint arXiv:1702.01182 (2017)
15. Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep
reinforcement learning: integrating temporal abstraction and intrinsic motivation.
In: Advances in Neural Information Processing Systems, vol. 29 (2016)
16. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune:
a research platform for distributed model selection and training. arXiv preprint
arXiv:1807.05118 (2018)
17. Lütjens, B., Everett, M., How, J.P.: Safe reinforcement learning with model uncer-
tainty estimates. In: 2019 International Conference on Robotics and Automation
(ICRA), pp. 8662–8668. IEEE (2019)
18. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602 (2013)
19. Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement
learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
20. Pertsch, K., Lee, Y., Lim, J.J.: Accelerating reinforcement learning with learned
skill priors. arXiv preprint arXiv:2010.11944 (2020)
21. Ribas-Fernandes, J.J.F., et al.: A neural signature of hierarchical reinforcement
learningd. Neuron 71(2), 370–379 (2011)
22. Schrittwieser, J., et al.: Mastering Atari, go, chess and shogi by planning with a
learned model. Nature 588(7839), 604–609 (2020)
23. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
24. Schwaiger, A., Sinhamahapatra, P., Gansloser, J., Roscher, K.: Is uncertainty quan-
tification in deep learning sufficient for out-of-distribution detection? In: AISafety@
IJCAI (2020)
25. Schwaiger, F., et al.: From black-box to white-box: examining confidence calibra-
tion under different conditions. arXiv preprint arXiv:2101.02971 (2021)
Uncertainty-Aware HRL Robust to Noisy Observations 547
26. Sedlmeier, A., Gabor, T., Phan, T., Belzner, L., Linnhoff-Popien, C.: Uncertainty-
based out-of-distribution detection in deep reinforcement learning. arXiv preprint
arXiv:1901.02219 (2019)
27. Silver, D., et al.: Mastering the game of go with deep neural networks and tree
search. Nature 529(7587), 484–489 (2016)
28. Silver, D., et al.: Mastering the game of go without human knowledge. Nature
550(7676), 354–359 (2017)
29. Sutton, R.S., Precup, D., Singh, S.: Between MDPS and semi-MDPS: a framework
for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211
(1999)
30. Van Amersfoort, J., Smith, L., Teh, Y.W., Gal, Y.: Uncertainty estimation using a
single deep deterministic neural network. In: International Conference on Machine
Learning, pp. 9690–9700. PMLR (2020)
31. Vezhnevets, A.S., et al.: Feudal networks for hierarchical reinforcement learning.
In: International Conference on Machine Learning, pp. 3540–3549. PMLR (2017)
32. Yang, Z., Merrick, K., Jin, L., Abbass, H.A.: Hierarchical deep reinforcement learn-
ing for continuous action control. IEEE Trans. Neural Netw. Learn. Syst. 29(11),
5174–5184 (2018)
Resampling-Free Bootstrap Inference
for Quantiles
1 Introduction
The use of randomized experiments in product development has seen an enor-
mous increase in popularity over the last decade. Modern tech companies now
view experimentation, often called A/B testing, as fundamental and have tightly
integrated practices around it into their product development. The vast major-
ity of A/B testing compares two groups, treatment and control, with respect to
average treatment effects through calculation of difference-in-means. These com-
parisons are operationalized through standard z-tests that are simple to perform.
With the rise of A/B testing, tests that do not compare average effects are also
gaining more and more interest. Difference-in-quantiles, where treatment and
control quantiles are compared, is one such test, where reasons for it might be
that effects are not expected, or that they are difficult to identify, on average.
For example, a change could be targeting users experiencing the largest amount
of buffering, which means a difference-in-quantiles comparison for, say, the 90th
percentile may be of more interest than the average buffering amount experi-
enced by users. These tests are, however, much more difficult to perform, with
non-standard sampling distributions, that severely complicate implementations.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 548–562, 2023.
https://doi.org/10.1007/978-3-031-18461-1_36
Resampling-Free Bootstrap 549
then the interval (Y(r) , Y(s) ) is a 1 − α CI for F −1 (q), where Y(i) refers to the
ith order statistic. There is not a unique pair (r, s) that satisfies the preceding
equation. Additional restrictions can be added to find a unique pair, like assign-
ing equal probability to each tail. See, e.g., [10, p. 158] for details. Throughout
this paper we will refer to the CI given above as analytical order-statistic-based
CIs (AOS-CI).
For the purposes of this paper, the most important aspect of the AOS-CI is
that they use the fact that the distribution of order-statistic indexes is indepen-
dent of the outcome data distribution. The binomial probabilities used above
are valid due to the properties of quantile definition rather than properties of
the outcome data. An important limitation of the AOS-CIs is that they are not
applicable to difference-in-quantiles; the relation between the population quan-
tiles and order statistics does not translate to the difference-in-quantiles and the
difference in order statistics. Nevertheless, in this paper we use similar argu-
ments in a bootstrap context to enable two-sample inference for quantiles. The
following section gives a brief introduction to bootstrap inference.
ordered such that the order statistic that is the sample quantile estimate can
be extracted. This is a memory and computationally intensive exercise. In the
following section, we exploit theoretical properties of τ̂q to provide substantial
simplifications to Algorithm 1 that ultimately will enable bootstrap CIs for the
difference-in-quantiles case.
We now aim to build intuition for this index distribution. Consider an exam-
ple where we are interested in a confidence interval for the median in a sample
of N = 10 observations. According to the quantile estimator in Definition 1, the
median is y(5) or y(6) with equal probability. It is intuitively apparent that the
middle order statistics y(4) , y(5) , y(6) , y(7) in the original sample are more likely
to be medians in the bootstrap sample. The first and the last order statistics
y(1) , y(10) have little chance of being the medians in the bootstrap sample. For
N
example, y(1) will be the median in a bootstrap sample only if p1 ≥ i = 2 pi
is satisfied. If, e.g., p1 = 2 then the sum of all 9 remaining Poisson random
variables (p2 , . . . , p10 ) must be smaller than or equal to p1 = 2, which is unlikely
N
given that i = 2 pi ∼ Poi(9). Following this logic, it is clear that order statistics
that are closer in rank to the original sample quantile have higher probability
of being observed as the sample quantile in a bootstrap sample. It is easy to
simulate the distribution of what index in the original sample is observed as the
desired quantile across bootstrap samples. Figure 1 displays this distribution for
the 10th percentile in 1, 000, 000 Poisson bootstrap samples in a sample of size
N = 2000. If indexes could be generated directly from the index distribution,
30000
20000
Frequency
10000
140 150 160 170 180 190 200 210 220 230 240 250 260
Index of order statistic in the original sample
Fig. 1. Distribution of index of the order statistics from a sample of N = 2000 that
became the 10th percentile over 1M poisson bootstrap samples.
X<i ≤ q(n + 1) − 1
Pi > 0
X>i ≤ (1 − q)(n + 1) − 1.
If r
= 0 the index selected will be q(n+1) with probability r and q(n+1)
with probability 1−r. In the former case, the conditions that need to be satisfied
Resampling-Free Bootstrap 555
X<i ≤ q(n + 1) − r
Pi > 0
X>i ≤ (1 − q)(n + 1) + r − 2.
X<i ≤ q(n + 1) − r − 1
Pi > 0
X>i ≤ (1 − q)(n + 1) + r − 1.
The expression in the theorem then follows from the law of total probability.
Proof. Since
L
Cψ,α/2 ≥ Y(iL )
U
Cψ,α/2 ≤ Y(iU ) ,
0.03
0.015
0.02
Frequency
Frequency
0.010
0.01
0.005
0.00 0.000
160 200 240 900 950 1000 1050 1100
Index of order statistic in the original sample Index of order statistic in the original sample
Fig. 2. Two examples of index distributions and their respective binomial approxi-
mations. The area and dotted curves show the estimated densities for the binomial
approximation and index distribution, respectively.
can provide strong evidence that generalizes to all such data-generating pro-
cesses. The setup of the simulation is the following. The number of bootstrap
samples is B = 106 and the sample size is set to N ∈ {100, 200, 500, 1000,
5000, 10000}. The quantile of interest is q ∈ {0.01, 0.1, 0.25, 0.5}, which, due
to symmetry, generalizes also to q ∈ {0.75, 0.9, 0.99}. For each combination of
sample size and quantile, 106 bootstrap samples are realized, and it is recorded
which index from the ordered original sample that is observed as estimate of
the quantile. The empirical distribution function of the indexes across the boot-
strap samples are fitted. The Kolmogorov-Smirnov (KS) distance is calculated
comparing the empirical bootstrap distribution to a Bin(N + 1, q) distribution.
Figure 3 displays the Kolmogorov-Smirnov distance for each combination of
quantile and sample size. In support of Conjecture 1, the KS distance is decreas-
ing in sample size, indicating that the approximation of ψ using the Bin(N +1, q)
distribution is improving as the sample size increases. Perhaps surprisingly, the
approximation is not strictly improving as the quantile comes closer to 0.5. A
likely explanation is that although the skewness and boundedness of the distri-
bution is less heavy the closer the quantile is to 0.5, the variance in the index
distribution also increases.
0.100
Kolmogorov−Smirnov distance
0.075
Quantile
0.01
0.050 0.1
0.25
0.5
0.025
0.000
In this section, the coverage of the confidence intervals resulting from algorithm
3 are studied using Monte Carlo simulation. The algorithms are implemented in
Julia version 1.6.3 [1], and the code for the algorithms and the Monte Carlo
simulations can be found here https://github.com/MSchultzberg/fast quantile
bootstrap.
The data-generating process is similar to the previous simulation. The num-
ber of Monte Carlo replications is 104 . For each replication, two samples of
Nt = Nc = 105 , respectively, are generated from a standard normal distribution.
The number of bootstrap samples for each Monte Carlo replication is B = 105
and the two-sided 95% confidence interval is returned. The study is repeated
for the quantiles 0.01, 0.1, 0.25, and 0.5. The coverage rate is the proportion
of the CIs that covered the true population difference-in-quantiles, i.e., zero.
To quantify the error due to a finite number of Monte Carlo replications, the
two-sided 95% confidence intervals of the coverage rate (using standard normal
approximation of the proportion) are again presented with the results.
Table 1 displays the results from the Monte Carlo simulation. Again, it is
clear that the coverage is close to the intended 95% for all quantiles, with no
observable systematic deviations.
This section presents memory and time consumption comparisons to build intu-
ition for the impact of the reduction in complexity enabled by Theorem 1 and
Conjecture 1. The comparisons are between Algorithm 2 and 3 implemented in
Julia version 1.6.3 [1] and benchmarked using the BenchmarkTools package [3].
560 M. Schultzberg and S. Ankargren
Table 1. Empirical coverage rate for the confidence intervals produced by algorithm 3
for the Difference-in-Quantiles for Quantiles 0.01, 0.1, 0.25, and, 0.25, for Sample Size
105 with 105 Bootstrap Samples over 10000 Replications.
Empirical 95% CI
q coverage Lower Upper
0.01 0.953 0.949 0.957
0.10 0.949 0.944 0.953
0.25 0.950 0.946 0.955
0.50 0.949 0.945 0.954
The setup for the comparison is the following. Two samples of floats are
generated of size 1000 each. B is set to 10000. The setup is selected to enable
100 evaluations of Algorithm 2 within around 200 s on a local machine. The
results are displayed in Table 2.
Table 2. Time and mempry consumption comparison between a standard poisson boot-
strap clgorithm (Algorithm 2) for a Difference-in-Quantiles CIs and the Corresponding
Proposed Binomial-Approximated Poisson Bootstrap Algorithm (Algorithm 3).
References
1. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to
numerical computing. SIAM review, 59(1), pp. 65–98 (2017)
2. Chamandy, N., Muralidharan, O., Najmi, A., Naidu, S.: Estimating Uncertainty
for Massive Data Streams. Technical report, Google (2012)
3. Chen, J., Revels, J.: Robust benchmarking in noisy environments. arXiv e-prints,
arXiv:1608.04295 (2016)
4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms,
Third Edition. The MIT Press, 3rd edition (2009)
5. David, H.A., Nagaraja, H.N.: Order statistics. John Wiley & Sons (2004)
6. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters.
Commun. ACM 51(1), 107–113 (2008)
7. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26
(1979)
8. Falk, M., Reiss, R.-D.: Weak convergence of smoothed and nonsmoothed bootstrap
quantile estimates. Ann. Probab. 17(1), 362–371 (1989)
9. Ghosh, M., Parr, W.C., Singh, K., Babu, G.J.: A Note on Bootstrapping the Sample
Median. The Annals of Stat. 12(3), 1130–1135 (1984)
10. Gibbons, J.D., Chakraborti, S.: Nonparametric statistical inference. CRC press
(2014)
11. Hanley, J.A., MacGibbon, B.: Creating non-parametric bootstrap samples using
poisson frequencies. Comput. Methods Programs Biomed. 83(1), 57–62 (2006)
12. Hutson, A.D.: Calculating nonparametric confidence intervals for quantiles using
fractional order statistics. J. Appl. Stat. 26(3), 343–353 (1999)
562 M. Schultzberg and S. Ankargren
13. Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: A scalable bootstrap for mas-
sive data. J. Royal Stat. Soc.: Series B (Statistical Methodology) 76(4), 795–816
(2014)
14. Liu, M., Sun, X., Varshney, M., Xu, Y.: Large-Scale Online Experimentation with
Quantile Metrics. arXiv e-prints, arXiv:1903.08762 (2019)
15. Nyblom, J.: Note on interpolated order statistics. Stat. Probab. Lett. 14(2), 129–
131 (1992)
16. Rao, C.R., Statistiker, M.: Linear statistical inference and its applications vol. 2,
Wiley New York (1973)
17. Scheffe, H., Tukey, J.W.: Non-Parametric Estimation. I. Validation of Order Statis-
tics. Ann. Math. Stat. 16(2), 187–192 (1945)
Determinants of USER’S Acceptance of Mobile
Payment: A Study of Cambodia Context
Royal University of Phnom Penh, Russian Federation Blvd, Phnom Penh, Cambodia
[email protected]
Abstract. All transactions between buyers and sellers and businesses are done
using payment systems. In the past, people made payments using traditional means
such as cash and checks. However, due to advanced technology, e-commerce, the
internet, and mobile devices, payment systems have transformed from cash-based
to digital-based transactions. Mobile payment refers to modern payment practices
via mobile devices such as cellphones, smartphones, or tablets. Mobile payment
allows consumers to reduce the use of cash and offers efficient and fast performance
as well as the secure transfer of information between consumers when conducting
payments and transactions. Even though mobile payment provides these benefits,
non-cash payment practices are still new among most consumers in Cambodia
due to limited knowledge of digital-based payment. Some companies succeeded,
while some failed due to limited insights into the success factors that predict
users’ intention to use mobile payment. Therefore, the researchers decided to
conduct this study by extending TAM with trust, innovativeness, and functionality,
aiming to explore the user’s acceptance of mobile payment services. This study
was conducted on 301 respondents who have experiences of using mobile payment
applications. The collected data was analyzed using the Statistical Packages of
SPSS 25 and AMOS 23. Based on the results, perceived usefulness and perceived
ease of use have a positive effect on behavioral intention to use mobile payment
services. Perceived usefulness is positively predicted by perceived ease of use,
innovativeness, and functionality. Trust and innovativeness positively influence
perceived ease of use.
1 Introduction
From a business perspective, all transactions between buyers and sellers or banks and
financial institutions are made using payment systems, and no business activities can
be done without payments and financial transactions. Payments in the past were made
through traditional methods such as cash and checks, which were the basic payment
instruments in the period [1]. With e-commerce and mobile devices, the payment system
has gradually changed from traditional cash-based transactions to cashless transactions
[2]. Additionally, mobile commerce has also emerged and gained popularity due to fast-
growing technology, the internet, and mobile devices. As a result, digital, electronic,
and mobile payments were critical in facilitating mobile commerce payment processes.
Mobile commerce refers to the transaction of goods and services via mobile devices [3].
According to the National Bank of Cambodia [4], Cambodia’s payment landscape has
gradually transformed from cash-based to digital-based payment over the last decade due
to advanced technology, economic development, and the demand for fast and efficient
services.
In line with the trend of the digital economy and mobile commerce, there is immense
potential in mobile payment applications and services in Cambodia. As defined by Lerner
[5], mobile payment refers to the present payment practices using mobile devices such
as mobile phones and tablets. Moreover, mobile payments are becoming very popular in
the era of e-commerce and the digital economy, enabling consumers to reduce their use
of cash and offer efficient and fast performance as well as secure transfer of informa-
tion when conducting payments and transactions. Further, with the significant growth
of mobile phone usage and the internet, the mobile payment platforms in Cambodia
have significantly developed over the past 5 years, and the market is currently crowded
with start-ups, international and domestic firms, and various digital companies all try-
ing to benefit from the electronic and mobile payments [6]. Based on the researcher’s
knowledge, there are many key mobile payment services and platforms in operation in
Cambodia, such as ABA Mobile, ACLEDA Mobile, Canadia Mobile Banking, PPCB
Mobile Banking, FTB Mohabot App, Wing, TrueMoney, Lyhour, Pi Pay, SmartLuy,
E-Money, and so forth.
Despite the benefits of adopting mobile payment services to perform payments and
financial transactions, these non-cash payment practices are still new among most con-
sumers in Cambodia. With limited knowledge and awareness of financial technology,
most consumers find it difficult to use mobile payments for their payments and money
transfers. Thus, some payment companies and providers succeeded, and some failed due
to these issues. Simply put, some consumers use mobile payment due to its usefulness
and ease of use, while others may use mobile payment because of the helpful functions
provided and trust in the company. Also, some consumers will use mobile payments
because they are innovative and willing to try sophisticated products such as financial
innovations. This implies that it is crucial to identify the success factors that affect the
consumer’s intention to use mobile payment.
There were several preceding studies that adopted different theories to investigate
the factors affecting consumers’ adoption of digital payment, internet banking, and
mobile devices in the context of Cambodia. Chav and Ou [7] found that attitude is a
critical predictor affecting the intention to use mobile banking, and attitude is influenced
by usefulness, ease of use, trust, and job relevance. In addition, Do et al. [8] noticed
that performance expectancy, effort expectancy, and transaction speed have a positive
impact on behavioral intention to use mobile payment. Consequently, the study adopted
the Technology Acceptance Model (TAM) extended with three prominent factors such
as trust, innovativeness, and functionality to find out the factors affecting the user’s
intention to accept mobile payment services.
Determinants of USER’S Acceptance of Mobile Payment 565
This study aims to explore the determinants of users’ acceptance of mobile payment
services. The determinant factors are represented in Fig. 1. To achieve the aim of this
study, five major objectives were formulated:
O1: To examine the relationships between perceived usefulness, perceived ease of
use, and behavioral intention.
O2: To examine the relationship between perceived ease of use and perceived
usefulness.
O3: To examine the relationships between innovativeness, perceived ease of use, and
perceived usefulness.
O4: To examine the relationship between trust and perceived ease of use.
O5: To examine the relationship between functionality and perceived usefulness.
The findings of this study contribute to major stakeholders such as future researchers,
entrepreneurs, marketers, and developers of mobile payment apps, banks, governments,
and policymakers. The literature, methodologies, and results of this study will be benefi-
cial to students and researchers who are interested in this area of mobile payment systems
and similar topics. Likewise, entrepreneurs, banks, developers, and mobile payment ser-
vice providers can benefit from the insights on the factors affecting users’ acceptance of
mobile payment services, so they can formulate strategies to improve their products or
services in the mobile payment area. Importantly, the findings of this study also benefit
the government and policymakers from relevant departments to capture clear landscapes
of what makes users use mobile payment services, so that they can develop effective
mechanisms to encourage users toward the use of digital and mobile payment, which is
an extremely crucial participant in the digital economy.
2.1 Perceived Usefulness (PU), Perceived Ease of Use (PEU), and Behavioral
Intention (BI)
According to Davis [10], PU refers to the degree that a person thinks that a particular
system improves his or her work, while PEU is defined as the degree that a person thinks
using a particular system is easy and effortless. Accordingly, the study found that PU
566 S. Soun et al.
and PEU have a significant impact on BI’s ability to adopt new technology systems,
and PU is positively affected by PEU [10]. Further, PU and PEU were confirmed on
BI to accept social media transactions [16], mobile banking apps [11], mobile payment
[14], and wireless internet service through mobile technology [17]. Saprikis et al. [18]
also noticed that PU has a positive relationship with BI for using mobile shopping.
In addition, a positive effect of PEU on PU was also validated by previous findings
in different contexts of innovation adoption [7, 11, 13, 15–19]. Thus, the following
hypotheses were developed:
H1: Perceived usefulness has a positive effect on behavioral intention.
H2: Perceived ease of use has a positive effect on behavioral intention.
H3: Perceived ease of use has a positive effect on perceived usefulness.
2.3 Innovativeness (INN), Perceived Ease of Use (PEU), and Perceived Usefulness
(PU)
INN is one of the most eminent factors that has been studied in several previous studies
of technology acceptance. In the terminology of information technology, INN refers to
a person’s willingness to undertake any modern information technology [22]. Further,
Agarwal and Prasad [23] defined INN as a person’s willingness to try sophisticated
systems and information technology. Simply put, innovative consumers have a higher
likelihood of trying new technology systems than less innovative consumers. INN was
used by several researchers to investigate users’ intention to adopt new information
systems. Previous studies validated that INN has a positive impact on PEU for mobile
shopping [18], NFC payment systems [13], and mobile payment [14]. In the study of
the adoption of wireless internet services via mobile technology, Lu et al. [17] also
confirmed that both PEU and PU were positively predicted by INN. Likewise, Zarmpou
et al. [19] also observed that INN is positively correlated with PEU and PU of mobile
service adoption. Therefore, the researchers proposed the following hypotheses.
H5: Innovativeness has a positive effect on perceived ease of use.
H6: Innovativeness has a positive effect on perceived usefulness.
Determinants of USER’S Acceptance of Mobile Payment 567
Trust
H4
Perceived Ease
of Use
H2
H5
H3 Behavioral
Innovativeness
Intention
H6
H1
Perceived
Usefulness
H7
Functionality
3 Methodology
3.1 Research Site
In this study, Phnom Penh, the capital of Cambodia, was selected as the study site due
to favourable access to targeted respondents to collect data for the study. According to
Southeast Asia Globe Magazine by Retka [24], more than 71% of Cambodia’s population
had access to financial services, with 59% using formal banking systems. In addition,
Phnom Penh has the largest population, with a statistic of 2,281,951, equivalent to 14.7
per cent of Cambodia’s total population as reported in the General Population Census of
Cambodia in 2019 [25]. Noticeably, Phnom Penh is the centre of economic development
and investment. Thus, Phnom Penh is where the use of e-banking, mobile payment and
e-commerce is concentrated.
568 S. Soun et al.
In this study, the researchers administered online questionnaires in Google Form to target
respondents in Phnom Penh who had experiences of using mobile payment services to
collect primary data for the study. On the other hand, secondary data such as information
about the mobile payment services in Cambodia was obtained from reliable websites,
government publications, and previous studies. For unknown populations, Bowerman
et al. [26] suggested that there needs to be at least 196 to produce a reliable result for a
quantitative study. Due to its nature as quantitative research and an unknown population,
the research employed convenient sampling, snowball sampling, and purposive sampling
to collect data from target respondents.
In this study, there are a total of 22 questionnaire items adapted from prior studies
measured using a 5-Point Likert Scale. The BI consists of four items adopted from
reference [18]. PU has four items obtained from reference [19], and PEU also has three
items adapted from reference [19]. TR has 4 questionnaire items selected from reference
[18], while INN has 3 items adopted from reference [18]. F has four items adopted from
reference [19].
Data collected from target respondents was analyzed using the Statistical Package for
the Social Sciences version 25 (SPSS 25) and Analysis of Moment Structures version 23
(AMOS 23). While SPSS 25 was employed to analyze personal information, descriptive
statistics, factor analysis and reliability tests, and correlation matrix, AMOS 23 was
used to conduct confirmatory factor analysis (CFA) to check convergent reliability and
validity and structural equation modeling (SEM) to test the hypothesis.
4 Results
At the end of data collection, 301 samples were collected and feasible for data analysis.
ABA Mobile accounted for more than 71% of mobile payment app brands, followed
by ACLEDA Unity Toanchet (21%), Canadia Mobile (3%), FTB Mohabot App (2%),
and other mobile payment apps (2%). In terms of the reasons for using these mobile
payment applications, approximately 30% were for convenience; approximately 23%
were for online shopping; approximately 28% were for business; approximately 3%
were for discount; nearly 3% were for booking; and 13% were for other reasons. This
implies that most users among the 301 respondents used the ABA Mobile and ACLEDA
Unity Toanchet for convenience, business purposes, online shopping, and other purposes.
Related to gender, nearly 59 percent were female, and about 41 percent were male. The
majority of respondents (74.8 percent) were 21–25 years old, followed by 26–30 years
(12.6%), less than 20 years (10.6%), 31–35 years (1.3%), and over 35 years (0.7%).
For education, undergraduate and master’s degrees dominate other education levels as
more than 80 percent were undergraduate or bachelor’s degree holders, while nearly 14
percent were pursuing or holding a master’s degree, followed by high school (2.7%) and
Ph.D. (1 percent). Speaking of occupations, most respondents were students and from
Determinants of USER’S Acceptance of Mobile Payment 569
the private sector because about 48 percent were students, and around 38 percent were
from the private sector, followed by the public sector (8.3 percent) and other occupations
(5.3 percent). In terms of income, 45.5 percent earned $301–500, followed by less than
$300 (35.5%), more than $700 (11.3%), and $501–500 (8%).
Table 1 presents the results of the mean and standard deviation of the research variables
for all research constructs in this study, measured using a 5-Point Likert Scale assessment.
The results illustrate that the mean score of all research variables ranges from 3.44 to
3.91, moving toward the “Agree” statement, showing a satisfying level of agreement.
Further, the standard deviation score of all research variables ranges from 0.95 to 1.11,
providing good variability of the data collected from respondents.
Factor analysis was employed to sort, purify, and find misfit variables for studying the
structure of the variables in the analysis of the study of the research constructs to avoid
producing poor results, and reliability test was adopted to examine the reliability of
the research variables in the research study [27]. Hair et al. [27] further stated some
specifications as follows: factor analysis (factor loading ≥ 0.6, Kaiser Meyer-Olkin or
KMO > 0.5, cumulative percentage ≥ 60%, eigenvalue value > 1) and reliability test
(item-total correlation ≥ 0.5 and Cronbach alpha ≥ 0.6). Table 2 shows that the scores
of the factor loading, KMO, cumulative percentage, eigenvalue, inter-total correlation,
and Cronbach alpha (α) meet the requirement of the rule of thumb.
Table 2. (continued)
In this study, Pearson’s coefficient (r) was computed to assess the direction and strength
of the relationship between research constructs [28]. The correlation matrix tested the
correlations between perceived usefulness, perceived ease of use, innovativeness, func-
tionality, trust, and behavioral intention by computing the mean score of the research
constructs and testing its relationship. Table 3 summarizes the results of the positive
correlation among research constructs.
As shown in Fig. 2, the results of SEM indicated that model fit assessment was
achieved (χ2/d.f. = 1.604, GFI = 0.931, AGFI = 0.900, NFI = 0.933, CFI = 0.973,
Determinants of USER’S Acceptance of Mobile Payment 573
Table 4. (continued)
RMSEA = 0.045, P = 0.000), so the model was confirmed to be the best fit. In addition,
Hair et al. [27] further recommended the specifications of hypothesis testing as follow:
critical ratio of two-tailed test (t-value) > |1.96| and significant level of (p-value) < 0.05.
According to Table 5, the results showed that PU has a positive effect on BI (β = 0.46, p
= 0.017), so H1 was supported. Further, the results confirmed that there was a positive
impact of PEU on BI (β = 0.45, p = 0.018) and PU (β = 0.95, p < 0.001). Therefore,
H2 and H3 were supported. The results also identified a positive influence of TR on
PEU (β = 0.28, p < 0.001) accepting H4. Further, the results confirmed that there is a
significant and positive influence of INN on PEU (β = 0.69, p < 0.001) and PU (β =
0.44, p = 0.049). Thus, H5 and H6 were accepted. Finally, the results found that F has
a positive effect on PU (β = 0.34, p = 0.03) supporting H7.
mobile payment [14], and wireless internet service [17]. This finding is also consistent
with the study of mobile shopping acceptance by Saprikis et al. [18] that confirmed a
positive effect of PU on BI. This proves that users will have the intention to use mobile
payment because mobile payment is convenient, productive, and effective for dealing
with payments and transactions. Meanwhile, simple processes and ease of use of mobile
payment will also contribute to the user’s intention to accept mobile payment. Besides,
the results also identified a strong and positive association between PEU and PU, which
is in line with existing literature [7, 11, 13, 15–19].
The results further illustrated that TR has a positive influence on the PEU of mobile
payment. This finding is consistent with the findings of Muñoz-Leiva et al. [11], who
discovered a positive relationship between TR and PEU in mobile banking applications.
Besides, in the study of consumers’ use of social media for transactions, TR was proved
to have a positive impact on PEU [16]. Interestingly, the finding of this study is also
consistent with Zarmpou et al. [19], who noticed that TR has a very strong connection
with PEU of mobile services, as well as Pavlou [21], who confirmed a positive connection
between TR and PEU. If users have high confidence and trust in mobile payment, they
are likely to find mobile payment easy to use.
In addition, INN was noted to have a positive impact on PEU of mobile payment
services, which supports prior findings of mobile shopping adaptation [18], NFC pay-
ment system adoption [13], mobile payment acceptance [14], and usage intention of
wireless internet services [17]. Users who are willing to try new things are more likely
than less innovative users to find mobile payment easy to use when dealing with pay-
ments and transactions. The results also revealed that INN also has a positive influence
on PU, which validates earlier studies [17, 19]. Thus, innovative consumers are likely
to perceive mobile payments as useful in their financial transactions.
Further, the results confirmed that F positively predicted the PU of mobile payment
services. This finding is consistent with the previous finding by Zarmpou et al. [19] that
identified F as a positive driver of PU for mobile services. This infers that usefulness is
influenced by the functionality of mobile payment systems, such as convenient interfaces,
quick response times, fast connection and transactions, and mobile infrastructure that
allows users to use mobile payment systems anywhere and anytime.
In conclusion, users will accept the use of mobile payment services due to two main
factors, which are usefulness and ease of use. Simultaneously, usefulness is affected by
the ease of use, innovativeness, and functionalities of mobile payment, such as trans-
action speed, response time, mobile infrastructure, and user-friendly interface. On the
other hand, ease of use is predicted by two important indicators, which are the user’s
innovativeness to accept sophisticated systems, including mobile payment, as well as
the user’s trust in mobile payment services.
of use regarding time and place. Developers also need to maintain system security to
make users feel safe, confident, and trustworthy with mobile payment. Furthermore, the
results suggested that banks should consider these findings because of their reflection
on successful factors affecting mobile payment use to formulate effective marketing
strategies to encourage users to accept mobile payment. Besides, the government and
policymakers are also encouraged to employ the results of this study to support the banks
and developers in attracting citizens to use mobile payment services so that all stake-
holders can benefit from the digital economy. Likewise, policymakers should put great
effort into building knowledge and awareness of financial technology and its benefits,
and encourage citizens to try new technologies and mobile payment applications.
576 S. Soun et al.
References
1. Evolution of digital payment industry, https://financebuddha.com/blog/evolution-digital-pay
ment-industry/. Accesed 04 Oct 2021
2. Bezhovski, Z.: The future of the mobile payment as electronic payment system. Eur. J. Bus.
Manage. 8(8), 127–132 (2016)
3. Tiwari, R., Buse, S.: The Mobile Commerce Prospects: A Strategic Analysis of Opportunities
in the Banking Sector. Hamburg University Press, Hamburg (2007)
4. National Bank of Cambodia: Project Bakong the next generation payment system. National
Bank of Cambodia, Phnom Penh (2020)
5. Lerner, T.: Mobile payment. 1st ed. Springer Vieweg, Mainz (2013)
6. The Top Mobile Payment Systems in Cambodia, https://cryptoasia.co/news/top-mobile-pay
ment-systems-cambodia/. Accessed 29 Sep 2021
7. Chav, T., Ou, P.: The factors influencing consumer intention to use internet banking and apps:
a case of banks in Cambodia. Int. J. Soc. Bus. Sci. 15(1), 92–98 (2021)
8. Do, N.H., Tham, J., Khatibi, A.A., Azam, S.M.F.: An empirical analysis of Cambodian
behavioral intention towards mobile payment. Manage. Sci. Lett. 9(12), 1941–1954 (2019)
9. Venkatesh, V.: Determinants of perceived ease of use: integrating control, intrinsic, motivation,
and emotion into the technology acceptance model. Inf. Syst. Res. 11(4), 342–365 (2000)
10. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information
technology. MIS Q. 13(3), 319–340 (1989)
11. Muñoz-Leiva, F., Climent-Climent, S., Liébana-Cabanillasa, F.: Determinants of intention to
use the mobile banking apps: an extension of the classic TAM model. Spanish J. Marketing –
ESIC 21(1), 25–38 (2017)
12. Oliveira, T., Thomas, M., Baptista, G., Campos, F.: Mobile payment: understanding the deter-
minants of customer adoption and intention to recommend the technology. Comput. Hum.
Behav. 61, 404–414 (2016)
13. Ramos-de-Luna, I., Montoro-Ríos, F., Liébana-Cabanillas, F.: Determinants of the intention
to use NFC technology as a payment system: an acceptance model approach. Inf. Syst. E-Bus.
Manage. 14(2), 293–314 (2015)https://doi.org/10.1007/s10257-015-0284-5
Determinants of USER’S Acceptance of Mobile Payment 577
14. Kim, C., Mirusmonov, M., Lee, I.: An empirical examination of factors influencing the
intention to use mobile payment. Comput. Hum. Behav. 26(3), 310–322 (2010)
15. Schierz, P.G., Schilke, O., Wirtz, B.W.: Understanding consumer acceptance of mobile
payment services: an empirical analysis. Electron. Commer. Res. Appl. 9(3), 209–216 (2010)
16. Hansen, J.M., Saridakis, G., Benson, V.: Risk, trust, and the interaction of perceived ease
of use and behavioral control in predicting consumers’ use of social media for transactions.
Comput. Hum. Behav. 80, 197–206 (2018)
17. Lu, J., Yao, J.E., Yu, C.-S.: Personal innovativeness, social influences and adoption of wireless
internet services via mobile technology. J. Strateg. Inf. Syst. 14(3), 245–268 (2005)
18. Saprikis, V., Markos, A., Zarmpou, T., Vlachopoulou, M.: Mobile shopping consumers’
behavior: an exploratory study and seview. J. Theor. Appl. Electron. Commer. Res. 13(1),
71–90 (2018)
19. Zarmpou, T., Saprikis, V., Markos, A., Vlachopoulou, M.: Modeling users’ acceptance of
mobile services. Electron. Commer Res. 12(2), 225–248 (2012)
20. Dahlberg, T., Mallat, N., Ondrus, J., Zmijewska, A.: Past, present and future of mobile
payments research: a literature review. Electron. Commer. Res. Appl. 7(2), 165–181 (2008)
21. Pavlou, P.A.: Consumer acceptance of electronic commerce: integrating trust and risk with
the technology acceptance model. Int. J. Electron. Commer. 7(3), 101–134 (2003)
22. Midgley, D.F., Dowling, G.R.: Innovativeness: the concept and its measurement. J. Consumer
Res. 4(4), 229–242 (1978)
23. Agarwal, R., Prasad, J.: A conceptual and operational definition of personal innovativeness
in the domain of information technology. Inf. Syst. Res. 9(2), 204–215 (1998)
24. How Cambodia can capitalise on strides in financial inclusion, https://southeastasiaglobe.
com/how-cambodia-can-capitalise-on-strides-in-financial-inclusion/. Accessed 02 Oct 2021
25. National Institute of Statistics: General population census of the Kingdom of Cambodia 2019.
Ministry of Planning, Phnom Penh (2020)
26. Bowerman, B.L., O’Connell, R.T., Murphree, E.: S: Business Statistics in Practice, 6th edn.
McGraw-Hill/Irwin, New York (2011)
27. Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E.: Multivariate Data Analysis. 8th ed.
Cengage Learning EMEA, Hampshire (2019)
28. Boslaugh, S., Watters, P.A.: Statistics in a Nutshell. 1st ed. O’Reilly Media (2008)
29. Bollen, K.A.: Indicator: methodology. Int. Encyclopedia Soc. Behav. Sci. 7282–7287 (2001)
30. Anderson, J.C., Gerbing, D.W.: Structural equation modeling in practice: a review and
recommended two-step approach. Psychol. Bull. 103(3), 411–423 (1988)
31. Fornell, C., Larcker, D.F.: Evaluating structural equation models with unobservables variables
and measurement error. J. Mark. Res. 18(1), 39–50 (1981)
A Proposed Framework for Enhancing
the Transportation Systems Based on Physical
Internet and Data Science Techniques
Abstract. Logistics and supply chain processes nowadays are not sustainable and
cause many problems. A traditional freight transportation system that moves the
commodities between nodes of the supply chain takes a lot of time and cost. It
accounts for large quantities of carbon dioxide emissions from fuel consumption.
The physical internet aims to facilitate the flow of moving goods through modular
units and sharing the resources to reduce time, effort, and cost. Also it helps
to change the way of moving goods across participants of the supply chain and
make the way physical goods are moved, handled, stored, and supplied across the
world more economical, environment-friendly, socially-efficient and sustainable.
As the participants of the supply chain will have access to share central hubs and
transportation means, this will help them to move the commodities from one place
to another more efficiently. This research takes Egypt as a case study to investigate
the main transportation problems and how to apply physical internet to solve it.
This paper aims to propose a framework to apply the physical internet with its tools
and data science techniques to overcome transportation problems, as it will help
to reduce transportation costs, reduce the harmful impact on the environment and
enhance the transportation system to be more efficient, through building a pool of
sharable resources and standardizing the goods to enhance collaboration between
the participants of the supply chain. In addition to the use of neural networks
that benefit the supply chain in many areas, it will support in decision making,
forecasting, and choosing the optimal path for transportation.
1 Introduction
Logistics has become an integral part of our way of life, allowing people to consume
products from all over the world all year long at reasonable costs. It has evolved into
the backbone of global trade, owing to the efficiency of container shipping and handling
across continents [1]. Within the current structure of supply chains, logistics performance
is limited in achieving two opposing aims. The first goal is to achieve small, frequent
shipments in a just-in-time manner, while the second goal is to achieve superior environ-
mental performance by making the best use of transportation modes, particularly heavier
but cleaner modes. Increasing supply chain collaboration in logistics networks is one
approach to take advantage of synergies and, as a result, collaboratively improve logis-
tics performance, particularly in transportation while dealing with independent logistics
organizations, enabling a performance that is equivalent to or greater than that produced
by pooling [2]. From an economic, environmental, and social standpoint, transportation,
storage, and product handling in today’s world do not correspond with the policy of
sustainable development. From empty transportation to underutilized distribution sites,
inefficiency can be seen at every level [3].
Upcoming freight transportation will be very different from what it is now. It is
no longer a question of choice, but rather a necessity. Freight transportation is a major
contributor to global warming, accounting for 7–8% of carbon emissions. While other
industries have reduced global greenhouse gas emissions over time, transportation is
the only one where emissions are increasing. It is considered one of the most difficult
economic sectors to decarbonize, partially because freight transportation demand is
anticipated to increase dramatically over the next several decades, but also because it
is largely reliant on fossil fuels [4]. In response to this problem, a framework has been
created in this research for usage in a network of open and interconnected networks,
called the Physical Internet (PI) that will change the way of movement and handling,
storing, and supplying the commodities across the world through the application of its
tools; in addition to using data science techniques through neural networks as it can assist
in choosing the optimal routing during the transportation. This framework will aid in
the reduction of fuel consumption and transportation expenses, as well as the reduction
of harmful environmental effects and the enhancement of the transportation system to
make it more flexible and efficient, in order to enhance the supply chain in economic,
social, and environmental aspects.
The structure of the paper will be as follows: next section will discuss the main
problems of freight transportation system and it will focus more on road transport, then
the physical internet concept will be discussed with its main principles and tools and
previous studies of papers that applied the concept. After that the framework model is
provided with its main stages and the benefits of applying the tools of physical internet
on the supply chain and how it can enhance the transportation process within the supply
chain and the challenges that may face the application of physical internet. Finally, an
analysis of the impact of the physical internet in solving the transportation problems is
discussed and a conclusion of the research paper and future work is provided.
road transport accounts for more than 80% of overland commerce activity. The proper
supply of road transport services is critical for the unrestricted flow of freight and people
along corridors.
In Egypt, as well as the Middle East in general, road safety is a major concern.
Egypt has a high incidence of traffic accidents and road deaths. The majority of traffic
deaths are of young and middle-aged adults, which have a significant impact on Egypt’s
expanding economy. As a result, one of the major concerns of Egypt’s scholars, society,
and the government is to reduce or avoid road accidents [7]. The trucking industry is
important as it is responsible for 66.7% of freight transport on roads ranging from light
to heavy trucks [8]. So, it’s very important to investigate the main problems that lead to
truck accidents.
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
2013 2014 2015 2016 2017 2018 2019
road accidents. Figure 1 shows the number of vehicles accidents on roads from 2013 to
2019.
In addition to that, the large vehicles may cause deficiencies to the roads which
require huge costs to fix roads. Moreover, toll gates costs for trucks that move among
governance is considered another problem for the transportation system. Fuel costs, toll
gates, maintenance costs, and other expenses related to the journey of the truck make a
burden on transportation companies especially in cases of long distances and in empty
journeys. As it makes transportation companies pay costs which in turn will affect their
profit, the social life of the drivers also can be harmed as they are always traveling
for transporting the goods, and unfortunately, it can also cause accidents due to extreme
exhaustion to drive long distances. Truck capacity is not always being utilized effectively
due to the limited standard storage spaces that exist now.
Fig. 2. CO2 emissions from liquid fuel consumption (kt) - Egypt, Arab Rep [11].
Another important aspect that should be considered is the fuel consumed during the
journey of delivering the products and the percentage of CO2 emissions that can harm
the environment. Figure 2 shows the emissions of CO2 from liquid fuel consumption in
Egypt till 2016. Most trucks in Egypt are diesel-fueled vehicles. Trucks handle more than
97% of freight transport in Egypt with 204,377, 200 ton-kilometers per day [8] which
means that there are a large number of trips for long distances. Moreover in some cases
the return journey for the truck can be empty which causes waste of time, effort, costs
and may also harm the environment with more CO2 emissions. So, is there is any way
to avoid empty journeys with affecting drive the orders of the customer when needed
the emissions can be reduced?
To summarize, there are many problems with the freight transportation system in
Egypt that have many effects of inefficiency in three aspects: economic, social, and envi-
ronmental. From economic perspective problems include high costs for the long journey
as it includes extra costs paid for the movement of goods such as toll gate fees, fuel fees,
and car maintenance fees. On the other hand, from the social perspective problems are
related to the drivers as they travel for the long journey they suffered from exhaustion
582 A. Osama et al.
and lack of social life for working for long hours and days, and some health issues
related to driving for many hours that can harm the muscles and some drivers also resort
to taking medications and drugs to stay awake for long hours. From the environmental
aspect, vehicles depend mainly on diesel as a main source which make transportation
one of the most significant sources of carbon dioxide emissions. Finally, there are many
problems regarding the transportation in supply chain such as inefficiencies in utilization
of capacity as there are limited sizes for containers most of them are 20- and 40-feet
containers which require some vendors to consolidate the goods until reaching the full
capacity to save some costs, and the most important problem is sometimes the return
journey is empty which causes many costs. These problems lead to inefficiencies in the
transportation system such as increased costs, harm to the environment, increased deliv-
ery time, and some problems for drivers. So, this research aims to propose a framework
of implementing the physical internet to solve transportation problems.
3 Physical Internet
The Physical Internet (PI, π) was proposed to address the existing global logistics’ lack
of economic, environmental, and social sustainability, and is based on the rapid evolution
of the digital world, owing to different standardizations that have helped reshape digital
communications in networks [12]. Physical internet has many benefits that can solve
problems that face the transportation system, and it will enhance the whole supply chain
performance. Physical Internet will improve the efficiency and sustainability of logistics
in its broadest meaning by an order of magnitude. The concept of the universal inter-
connectedness of logistics networks and services is exploited by the Physical Internet. It
proposes encapsulating goods and products in globally standardized, green, modular, net-
worked, and smart containers that can be moved and distributed over rapid, dependable,
and environmentally friendly multimodal transportation and logistics systems [12].
In the beginning, the physical internet concept was introduced, and its standardized
tools and basic guidelines were discussed as Physical internet was first presented by [12,
13]. The basis and core principles of the physical internet were proposed to address the
existing global logistics lack of economic, environmental, and social sustainability, and
is based on the rapid evolution of the digital world, owing to different standardizations
that have helped reshape digital communications in networks, as well as how the Internet
metaphor relates, were discussed.
The Logistic Web is a global network of physical, digital, human, organizational, and
social actors and networks that serve the dynamic and evolving logistics demands of the
world. The Physical Internet intends to enable the Logistics Web to be more open and
global while being dependable, resilient, and adaptable in the pursuit of efficiency and
sustainability.
A Proposed Framework for Enhancing the Transportation Systems 583
As shown in Fig. 3. The Mobility Web, the Distribution Web, the Realization Web,
and the Supply Web are four interconnected webs that make up the Logistics Web. The
Mobility Web is concerned with the movement of physical things across a global network
of open unimodal and multimodal hubs, transits, ports, highways, and paths. The Delivery
Web is concerned with the distribution of things throughout a global network of open
warehouses, distribution hubs, and storage places. Making, assembling, personalizing,
and retrofitting products as best fits inside the worldwide interconnected set of open
factories of all types is what the Realization Web is all about. The Supply Web is a global
interconnected network of open suppliers and contractors for delivering, receiving, and
supplying objects. Each Web makes use of the other Webs to improve its performance
[12, 14].
From a conceptual standpoint, the essential component of the system includes four main
elements, as shown in Fig. 4, PI-containers, PI-nodes, and PI-movers.
The first component is PI-container, Physical Internet encapsulates physical items
in physical packets or containers, which will be called PI-containers to distinguish them
from current containers. Each PI-container has a unique global identification from an
informational standpoint. It helps to provide container identity, integrity, routing, con-
ditioning, monitoring, traceability, and security via the Physical Internet. Radio Fre-
quency Identification (RFID) and/or Global Positioning System GPS technologies are
now thought to be suitable for equipping PI-Container tags [15].
584 A. Osama et al.
PI-containers
PI-site
PI-nodes
PI-faciliƟes
PI-system PI-vehicle
PI-movers
PI-transit • PI-boats
• PI-locomoƟve
PI-switch • PI-plane
PI-bridge • PI-robot
• PI-truck
PI-sorter PI-carrier
PI-composer • PI-trailer
• PI-tug
PI-sorter • PI-wagon
PI-gateway PI-conveyor
PI-hub PI-handler
PI-distributor
As stated by [16] all the different packages, cases, totes, and pallets now used in the
encapsulation levels one to four are proposed to be replaced by standard and modular
PI-containers. However, they must be available in a variety of structural grades to meet
the wide range of planned applications.
Fig. 5. Proposed physical internet encapsulation characterization – Source (Montreuil et al., 2014)
A Proposed Framework for Enhancing the Transportation Systems 585
internet over traditional network architecture for flow assignment difficulties. The study
in [23] proposed a freight transportation model based on the Physical Internet, in which
freight is transported from hub to hub utilizing various tractors assigned to each hub.
The concept of combining two trailers into a road train is being considered. The data
shows that both consolidation and waiting to enhance the likelihood of a return hauling
opportunity have a beneficial influence on overall cost, fill rate, the average amount of
the night spent at home, and GHG emissions. A mathematical program and a break-
down algorithm for solving the problem of optimal space utilization by determining the
size and number of modular containers to pack a collection of items as the paper [24]
demonstrate how the deployment of standardized containers results in greater vehicle
space utilization through a case study. Protocols for Physical Internet transportation were
introduced [2, 25]. As an aggregate of optimum point-to-point dispatch models between
pairs of cities, a systems model of conventional and Physical Internet networks is built.
This is then utilized to define the behavior of conventional and Physical Internet logistics
systems for a variety of key performance metrics in logistics systems. The Physical Inter-
net’s advantages include lower inventory costs and lower total logistics system costs.
They simulate asynchronous container shipment and creation inside a linked network of
services, as well as the optimum path routing for each container to save transportation
costs. The study in [26] suggested a multi-agent simulation model for freight transporta-
tion resilience on the Physical Internet, taking disturbances at hubs into account. To deal
with various forms of disturbances, they presented two dynamic transit protocols. The
study [17] presented a novel routing technique based on the physical Internet-Border
Gateway Protocol (PI-BGP), which is the Internet’s version of BGP. It created a new
protocol to provide a fresh approach to the problem of PI-container routing on the Phys-
ical Internet. They can ensure a rapid flow inside and between PI-hubs by focusing on
the exclusive routing of PI-Containers following the PI-BGP, eliminating delays and
solving stocking difficulties. The study [27] uses a basic model to capture the core of
the problem to investigate the resilience of a network delivery system focusing more on
the time and the total cost of travel to measure performance. The findings imply that
networks with redundancy may adapt well to fluctuations in demand, but hub-and-spoke
networks without redundancy cannot make use of the Physical Internet’s benefits. [12].
Physical internet has many benefits that can solve problems that face the transporta-
tion system, and it will enhance the whole supply chain performance. Physical Internet
will improve the efficiency and sustainability of logistics in its broadest meaning by an
order of magnitude. The concept of the universal interconnectedness of logistics net-
works and services is exploited by the Physical Internet. It proposes encapsulating goods
and products in globally standardized, green, modular, networked, and smart containers
that can be moved and distributed over rapid, dependable, and environmentally friendly
multimodal transportation and logistics systems[12], that’s why a framework will be
provided in next sections in order to link some tools of physical internet to the supply
chain which leads to enhancement in transportation process that’s why data science will
integrate the framework in order to facilitate the transportation process.
588 A. Osama et al.
4 Data Science
Data science is an interdisciplinary approach that aims to get value from the data. There is
a massive amount of data that can be collected from websites, sensors, customers, smart-
phones, and other things. As data science process the data to obtain useful insights. Data
science helps to make the supply chain more efficient, as it helps in the forecasting pro-
cesses through the supply chain regarding demand and supply forecasting, networking,
transportation, and inventory optimization. In transportation data science with the help
of machine learning can predict the optimal routes for transport, the congestion areas,
tracking, and scheduling of shipments, etc.
5 Artificial Intelligence
Artificial intelligence (AI) does not imply creating a super-intelligent machine capable
of solving any problem in a flash, but rather creating a machine capable of human-
like behavior. It refers to machines that can perform one or more of the following
tasks: comprehending human language, performing mechanical tasks requiring complex
maneuvering, solving computer-based complex problems involving large amounts of
data in a short amount of time, and providing human-like responses, and so on. The term
machine learning refers to computer software that can learn to perform actions that aren’t
expressly designed by the program’s inventor. Instead, it can reveal and conduct that the
author is fully ignorant of. Despite their origins in fundamentally distinct fields, machine
learning has brought together a large array of algorithms that are essentially created from
the concepts of pure mathematics and statistics. They share one additional element in
addition to the roots: the use of computers to automate difficult computations. These
computations eventually lead to the solution of problems that appear to be so difficult
that they appear to be solved by an intelligent creature, or Artificial Intelligence [28].
Artificial intelligence with the application of machine learning can benefit the supply
chain in many ways from demand prediction to warehouse management, delivery with
drones, and route optimization as it can analyze the data of the routing to find the optimal
route that delivers without any delays and reduces costs.
6 Neural Network
Neural networks are an emerging artificial intelligence system that is built on recent
advances in human brain tissue biology research. Its principle is to simulate the structure
and functioning of the human brain. Neural network technology has made a breakthrough
in the understanding of some of artificial intelligence’s (AI) limitations and has been
effectively implemented in a variety of disciplines, demonstrating that the efficiency and
accuracy of other AI systems cannot be matched. With real-time processing capabilities,
the neural network has a great adaptability capacity and can quickly assess and handle
emergent restrictions [29]. There are three different layers of the neural networks as
shown in Fig. 7 which include: Input, hidden, and output layers are the three layers.
The data is transmitted from one layer to the next, beginning with the input layer and
progressing via the hidden levels to the output layer, where the result is specified. Each
A Proposed Framework for Enhancing the Transportation Systems 589
layer gets an input value and produces an output value. A layer’s input value is the
preceding layer’s output value. Neural networks have the potential to learn from their
surroundings. This is accomplished by adjusting the weights until the artificial neural
network can generalize the outputs’ findings. After the learning process is complete,
the neural network may be utilized with fresh inputs to produce new predictions or
characterize current data patterns [30]. There are two types of networks single hidden
layer and shallow or deep multilayer perceptron neural network.
Neural network technology has been used in many facets of supply chain manage-
ment. The neural network in the supply chain can be used in three different areas including
optimization, forecasting, and decision support: The most widely used computing tech-
nique for solving optimization issues is the neural network. It has a significant impact on
supply chain management. The application of neural networks to handle supply chain
management optimization problems such as shop scheduling, warehouse management,
transportation route selection, and so on are currently being researched [29]. In most
cases, there is a single input layer, one output layer, and a variety of hidden layers.
The number of input and output variables is equal to the number of nodes in the input
and output layers, respectively. The nodes in each layer are connected to nodes in the
following layer by different weights that the neural network programming puts up each
training period, starting with the input layer. The method is repeated until the output
has the smallest loss in comparison to the observation [31]. Generally single and mul-
tilayer neural networks can be used in supply chain issues, but the multilayer will be
more accurate as data pass through many hidden layers which can give more accurate
results to make the right decisions. The multi-layer system can be used in forecasting
transportation to find the optimal route to be taken within the supply chain to save fuel
consumption and time needed to deliver cargo from one node to another throughout the
supply chain.
7 Proposed Framework
Based on the previous literature that has been discussed a framework has been provided
in order to enhance the transportation system and overcome the main problems facing it.
590 A. Osama et al.
This section illustrates the framework with some tools of the physical internet applied
in stages through the supply chain to enhance it. A framework has been proposed as
shown in Fig. 8 that illustrates how physical internet and neural networks can benefit
the supply chain.
The first stage is mainly about the gathering process as goods in the perception
layer that need to be shipped are received and collected into PI-containers with their
different sizes. As mentioned before, the encapsulation is done through only four tiers
including packaging containers, TPI-container, HPI-container, and PPI-container. When
goods will be produced and packaged on PPI containers, then they will be inserted
into HPI-container without any pallets, and with capacity utilization of 100% as the
containers have standardized measures that will enable a full capacity utilization. After
that if the containers need to be shipped and consolidated in TPI-containers or the
HPI containers can be handled by themselves also. These containers send their data
through using internet of things (IoT) technology, as each container has its RFID and
GPS technologies that allows the tracking process of the container. The goods are sent
to the transmission stage through the network stage. In-network stage infrastructure
layer consists of routes that can be used to move PI-containers by PI- vehicles. These
roads should support carrying heavy weights and should also have needed infrastructure
that will allow tracking containers through RFIDs. The artificial intelligence and neural
network can be applied in this layer for routing the means of transport to lead them to
their specified hub. The neural network will analyze many factors such as congestion
A Proposed Framework for Enhancing the Transportation Systems 591
areas, fuel consumption, roads existing to identify the optimal route to be taken by
trucks. In the transmission stage, PI-movers take place as PI-handlers such as PI-forklift
helps to move each container or group of small consolidated containers from one place
to another PI vehicles are structured to fit the new sizes of container. PI-vehicles will
move the PI-container to pass from a hub to another till reaching its destination, and
the PI-vehicle will return back to its hub loaded with other PI-containers. Then in the
processing stage cross-docking process takes place where goods are unloaded from the
trucks then it passes through the PI-Sorters and PI-conveyor which move the containers
into the racks then they are moved again from the racks to the other side of the hub for
reshipment to reach its destination. Finally, in the application stage after containers move
through different hubs the goods are delivered to their customers at their destination and
PI-containers will be used to contain other products.
to handle new containers, material handling equipment need to be applied and vehicles
and other movers will need to be modified in order to fit to the new sizes of containers.
Infrastructure includes building hubs in different suitable points, building containers with
their new sizes, the hubs also should have sorters that will automatically be sorting and
shipping containers to their desired destinations, warehouses in ports and rail stations
should have sorters, and conveyors. Finally, and the most important challenge is that
some countries may not be able to apply physical internet with their tools which may
interrupt the global supply chain they may not be able to change to the new tools of
physical internet.
As discussed in previous sections the application of physical internet with its protocols
and tools can enhance the freight transportation process in the supply chain. As the
paper revealed in the beginning that transportation problems can be divided into three
aspects economic, social, and environmental. So, the implementation of the concept will
help in solving these problems. From the economical aspect applying to the physical
Internet tools will avoid the problem of long journey and the empty return journey as
PI- containers will be moved from hub to another till reaching its destination, which
will shorten the transportation journey as each vehicle will move from a hub to the other
within a specific region and will return back loaded which will enhance the efficiency
transportation as the costs of long journey will be cut as less amount of fuel will be
needed, the cost of repairing cars would it be reduced due to short journeys another cost
will be reduced. From the social perspective the social life of the driver will be enhanced
as he will not need to drive long-distance which will shorten the working hours, let them
go back to their homes on the same day, and they will not need to take medicines to make
them work which may reduce the rate of accidents caused by trucks. The environmental
aspects will be enhanced also as transportation journeys will be shortened which will
reduce the amount of fuel consumed, there will be no empty return journey so the fuel
will be utilized efficiently to move the product without any wastage as the truck will
always be loaded, and finally the number of cars may be reduced due to consolidation of
physical Internet containers. So it can be concluded from the previous research paper that
the physical Internet tools will help to enhance the efficiency of transportation process.
11 Conclusion
To conclude, this research focused on applying the physical Internet tools in order to solve
the transportation problems for its importance in the supply chain. The physical internet
is a new concept which aims to change the way of moving goods across participants
of the supply chain. The mechanism of the physical internet is similar to the digital
internet, where the goods are packaged through PI-containers different encapsulation
layers such as collecting of data, then containers are moved through connected routes to
hubs till reaching the final place. In the beginning, the main programs that face freight
A Proposed Framework for Enhancing the Transportation Systems 593
transportation system in Egypt especially with trucking industry has been discussed
and its impact on environmental social and economic aspects. Then literature about the
Physical Internet concept with its components and protocols has been discussed and
previous studies of papers implemented components of physical Internet in different
areas in supply chain has been reviewed. After that a framework, the stage, is considered
in the supply chain with tools applied from the physical Internet in each stage and
artificial intelligence also takes place. Then the benefits of application of the proposed
framework of five layers with the tools of physical Internet has been discussed and a
neural network also can lead to the optimal route to be taken by trucks. The application of
the new paradigm will affect the transportation process and make it more efficient besides
that it will have other social, economic, and environmental sectors. Then and analysis
of applying the framework in the transportation system would enhance the efficiency
well the system and solve many problems from a financial social and environmental
parts. It’s recommended for further research to include studying the infrastructure need
to implement the physical Internet and the effects of the concept in other different areas
in the supply chain.
References
1. Montreuil, B., Russell, D.M., Eric, B.: Physical internet foundations. In: Service Orientation
in Holonic and Multi Agent,-Studies in Computational Intelligence, vol. 472, pp. 151–166.
Bucharest, Springer-Verlag (2013)
2. Sarraj, R., Ballot, E., Pan, S., Montreuil, B.: Analogies between Internet network and logistics
service networks: challenges involved in the interconnection. J. Intell. Manuf. 25(6), 1207–
1219 (2012). https://doi.org/10.1007/s10845-012-0697-7
3. Matusiewicz, M.: Logistics of the future-physical internet and its practicality. Transp. J. 59(3),
200–214 (2020)
4. McKinnon, A.: Decarbonizing Logistics: Distributing Goods in a Low Carbon World. 1st ed.
Kogan Page Limited (2018)
5. Demir, E., Bektaş, T., Laporte, G.: A review of recent research on green road freight
transportation. Eur. J. Oper. Res. 237(3), 775–793 (2014)
6. “Measuring Regulatory Quality and Efficiency A World Bank Group Flagship Report com-
paring business regulation for domestic firms in 189 economies,” Washington (2016). World
Bank Group, www.worldbank.org. https://doi.org/10.1596/978-1-4648-0667-4. Accessed 17
Mar 2022
7. Ismail, A.M., Ahmed, H.Y., Owais, M.A.: Analysis and modeling of traffic accidents causes
for main rural roads in Egypt. J. Eng. Sci. 38(4), 895–909 (2010)
8. Japan international cooperation agency, & Oriental consultants co, ltd and A. corporation
katahira & engineering international. The comprehensive study on the master plan for
nationwide transport system in the Arab republic of Egypt. (online) Transport planning
authority- ministry of transport (2012). https://openjicareport.jica.go.jp/pdf/12057592_01.
pdf. Accessed 26 Dec 2021
9. Elshamly, A.F., El-Hakim, R.A., Afify, H.A.: Factors Affecting Accidents Risks among Truck
Drivers in Egypt. In: MATEC Web of Conferences,Taiwan, vol. 124 (2017)
10. Central Agency for public mobilization and statistics. Arab Republic of Egypt - Annual
bulletin of Vehicles and Trains accidents year 2019. CAPMAS (2019). https://censusinfo.
capmas.gov.eg/Metadata-en-v4.2/index.php/catalog/407/related_materials. Accessed 26 Dec
2021
594 A. Osama et al.
11. The world bank. CO2 emissions from liquid fuel consumption (kt) - Egypt, Arab Rep.
The world bank (2016). https://data.worldbank.org/indicator/EN.ATM.CO2E.LF.KT?end=
2016&locations=EG&start=1960&view=chart. Accessed 5 July 2021
12. Montreuil, B.: Toward a physical internet: meeting the global logistics sustainability grand
challenge. Logist. Res. 3(2–3), 71–87 (2011). https://doi.org/10.1007/s12159-011-0045-x
13. Montreuil, B., Meller, R., Ballot, E.: Towards a physical internet: the impact on logistics
facilities and material handling systems design and innovation. In: 11TH IMHRC proceedings
, WISCONSIN, USA, Progress in Material Handling Research, pp. 1–23 (2010)
14. Montreuil, B.: Manifesto for a Physical Internet Version 1.4. Canada Research Chair in Enter-
prise Engineering, Interuniversity Research Center on Enterprise Networks, Logistics and
Transportation (CIRRELT) Version 1.4, canda, vol. 2e edition, p. 499 (2012)
15. IEEE, IEEE guidelines for 64-bit global identifier (EUI-64) registration authority
(2017). https://standards.ieee.org/content/dam/ieee-standards/standards/web/documents/tut
orials/eui.pdf. Accessed 5 Jan 2022
16. Montreuil, B., Ballot, E., Tremblay, W.: Modular Design of Physical Internet Transport,
Handling and Packaging Containers, Progress in Material Handling Research, vol. 13 (2014)
17. Gontara, S., Boufaied, A., Korbaa, O.: Routing the Pi-Containers in the Physical Internet using
the PI-BGP Protocol. In: Proceedings of IEEE/ACS International Conference on Computer
Systems and Applications, AICCSA (2018)
18. Dong, C., Franklin, R.: From the digital internet to the physical internet: a conceptual
framework with a stylized network model. J. Bus. Logist. 42(1), 108–119 (2021)
19. Montreuil, B., Ballot, E., Fontane, F.: An open logistics interconnection model for the physical
internet. In: IFAC Proceedings Volumes (IFAC-PapersOnline), vol. 45, no. 6, pp. 327–332
(2012)
20. Montreuil, B., Meller, R.D., Thivierge, C., Montreuil, Z.: Functional design of physical inter-
net facilities: a road-based crossdocking hub. In: Progress in Material Handling Research,
Charlotte, NC. MHIA, USA (2012)
21. Hakimi, D., Montreuil, B., Sarraj, R., Ballot, E., Pan, S.: Simulating a physical internet
enabled mobility web: the case of mass distribution in France. In: Proceeding 9th International
Conference of Modeling, Optimization and Simulation, Bordeaux, MOSIM 2012, pp. 1–7
(2012)
22. Ballot, E., Gobet, O., Montreuil, B.: Physical internet enabled open hub network design
for distributed networked operations. In: Service Orientation in Holonic and Multi-Agent
Manufacturing Control, Springer, vol. 402, pp. 279–292. Springer (2012)
23. Furtado, P., Biard, P., Frayret, J.-M., Fakhfakh, R.: Simulation of a physical internet-based
transportation network. In: International Conference on Industrial Engineering and Systems
ManagementAt: Rabat – Morocco, vol. 5, pp. 1–8 (2013)
24. Lin, Y.-H., Meller, R.D., Ellis, K.P., Thomas, L.M., Lombardi, B.J.: A decomposition-
based approach for the selection of standardized modular containers a decomposition-based
approach for the selection of standardized modular containers. Int. J. Prod. Res. 52(15),
4660–4672 (2014)
25. Venkatadri, U., Krishna, K.S., Ülkü, M.A.: On Physical internet logistics: modeling the impact
of consolidation on transportation and inventory costs. IEEE Trans. Autom. Sci. Eng. 13(4),
1517–1527 (2016)
26. Yang, Y., Pan, S., Ballot, E.: Freight transportation resilience enabled by physical internet.
IFAC-PapersOnLine 50(1), 2278–2283 (2017)
27. Ezaki, T., Imura, N., Nishinari, K.: Network topology and robustness of Physical Internet
(2021). https://arxiv.org/pdf/2109.02290v2.pdf. Accessed 01 May 2022
28. Joshi, A.V.: Essential Concepts in Artificial Intelligence and Machine Learning. Springer,
Switzerland, pp. 9–20 (2020)
A Proposed Framework for Enhancing the Transportation Systems 595
29. Liu, H.: Forecasting model of supply chain management based on neural network. In: Pro-
ceedings of the 2015 International Conference on Automation, Mechanical Control and
Computational Engineering, China (2015)
30. Nunes da Silva, I., Hernane Spatti, D., Andrade Flauzino, R., Helena Bartocci Liboni, L.,
Franco dos Reis Alves, S.: Artificial Neural Networks: A Practical Course, 1st ed. Springer,
Switzerland, Cham (2-17)
31. Sohrabi, H., Klibi, W., Montreuil, B.: Modeling scenario-based distribution network design
in a Physical Internet-enabled open Logistics Web. In: 4th International Conference on
Information Systems, Logistics and Supply Chain, Quebec (2012)
A Systematic Review of Machine Learning
and Explainable Artificial Intelligence (XAI)
in Credit Risk Modelling
Abstract. The emergence of machine learning and artificial intelligence has cre-
ated new opportunities for data-intensive science within the financial industry.
The implementation of machine learning algorithms still faces doubt and distrust,
mainly in the credit risk domain due to the lack of transparency in terms of deci-
sion making. This paper presents a comprehensive review of research dedicated to
the application of machine learning in credit risk modelling and how Explainable
Artificial Intelligence (XAI) could increase the robustness of a predictive model.
In addition to that, some fully developed credit risk software available in the mar-
ket is also reviewed. It is evident that adopting complex machine learning models
produced high performance but had limited interpretability. Thus, the review also
studies some XAI techniques that helps to overcome this problem whilst break-
ing out from the nature of the ‘black-box’ concept. XAI models mitigate the bias
and establish trust and compliance with the regulators to ensure fairness in loan
lending in the financial industry.
1 Introduction
According to The Malaysian Reserve, the statistics published by Malaysian Depart-
ment of Insolvency shows that more than 95,000 people had their loan defaulted where
the defaulted loans were from personal loans (27.76%), hire purchase loans (24.73%),
housing loans (14.09%) and credit card (9.91%) between the year 2014 and 2018 [1].
Loan defaults will not only disrupt the individual’s credit score but will also introduce
monetary losses to banks. This is also witnessed from a related publication released by
Bank Negara Malaysia which states that the cumulative amount of impaired loans had
reached RM31 billion as of July 2021 [2]. This is a huge loss for the bank sector, and
it could lead to significant risk in Malaysia’s economy. Thus, financial institutions are
invigorated to employ a reliable credit risk model to minimize default risk.
Credit risk is known as the risk of the lender where the lender might not receive
the principal and interest from the borrower [3]. Moreover, credit risk assessment plays
2 Domain Research
2.1 Credit Risk in Financial Industry
There are many types of risks faced by the banking industry, as seen in Fig. 1, which
includes credit risk, market risk, liquidity risk, exchange rate risk, interest rate risk and
operational risk. Among all the different types of risk aforementioned, credit risk is one
of the main risks that most of the bank are facing nowadays [4].
Credit risk refers to the risk of loss imposed on creditors caused by borrowers due to
their inability to meet their obligations [4–7]. This type of risk causes ambiguity in terms
of the net income and the market value of the shares. Kolapo et al. [8] indicated that the
bank is likely to experience financial crisis if the bank is highly vulnerable to credit risk.
So, the performance of a bank can be determined using the approach that a bank used to
handle credit risk. This is further supported by Chen and Pan [9] who states that credit risk
is the most significant risk faced by banks and different banks have different approaches
to credit risk management which allows them to adapt to changing environments. In
598 Y. S. Heng and P. Subramanian
the opinion of Rehman et al. [7], ignorance about credit risk by bank personnel will
negatively affect the bank’s development and customers’ interest. Thus, credit risk can
be considered an essential field of study. This is because if some of the borrowers default
on the loans issued, it can eventually cause negative impact on banks including the entire
banking system whereby banking crisis might occur [10]. This means that banks with
high credit risk will face substantial loss mainly because borrowers defaulting on their
loan repayment which might potentially lead to bankruptcy and insolvency.
Credit risk can occur due to several factors. For instance, poor management, poor
loan underwriting, poor lending procedures, interference by the government bodies,
inappropriate credit policies, unstable interest rate, direct lending, low reserves, liquidity
level, huge licensing of banks, limited institutional capacity, insufficient supervision by
the central bank, lack of strictness in credit assessment and inappropriate laws [11].
Therefore, it is recommended for the bank to consider minimizing the risk such as
improvising the lending procedure, maintaining a well-documented information about
the borrowers and a stabilized interest rate to potentially reduce the number of loan
defaults and non-performing loans.
Effective credit risk management can enhance the compassion of the bank and the
confidence of the depositors. Moreover, the financial health of a bank is highly dependent
on the possession of good credit risk management. Hence, a good credit risk policy plays
an essential role in boosting the banks’ performance and its capital adequacy protection
[11]. Pradhan and Shah [12] examined the relationship between credit management
practices, credit risk mitigation measures and obstacles against loan repayment in Nepal
using survey-based primary data and has performed a correlation analysis. The results
revealed that credit risk management practices and credit risk mitigation measures have
a positive relationship with loan repayment whilst obstacles faced by borrowers have
no significant impact on loan repayment. This indicates that credit risk management
practices and credit risk mitigation actions taken by the bank can help to reduce credit
risk whereby borrowers will repay their loan on time which increases loan repayment
behavior.
The Basel Accords was developed with the aim of establishing an international
governing framework for controlling market risk and credit risk. This is to make sure
banks holds enough capital to protect themselves from the financial crisis. The new Basel
Capital Accord (Basel II) stated that banks should implement their internal credit risk
model to assess default risk [13]. The effectiveness of credit risk management will not
only help to maintain the profitability of the bank’s businesses but also helps in sustaining
the stability of the economy [14]. Moreover, Basel II relies on the following three pillars
for its functioning: Minimal capital requirement, Supervisory review process, Market
discipline.
According to Basel Committee on Banking Supervision [13], the risk parameters
of Basel II are probability of default (PD), exposure at default (EAD) and loss given
default (LGD). With these three risk parameters, the expected loss (EL) of the bank can
be computed with the formula below:
EL = PD ∗ EAD ∗ LGD (1)
In general, the banking industry plays an essential role in supporting the financial
stability within a country. Thus, it is crucial for financial institutions to fully understand
Review of Machine Learning and XAI in Credit Risk Modelling 599
and ensure that data driven decisions are reached by figuring out the Expected Loss as
outlined by Basel II in order to avoid the unfortunate impact of credit risk. With data
analytics, several machine learning techniques are used in order to predict the credit risk
and is reviewed in the following section.
scoring model to predict housing loan defaults in Malaysia. The authors have employed
logistic regression of different variation using data acquired from the Malaysian Central
Credit Reference Information Systems (CCRIS). The variations of logistic regression
built involving the use of balance class, unbalanced class, with variable selection and
without variable selection. The authors suggested that all four models yield favorable
results, but logistic regression based on a balanced dataset with variable selection has
obtained a high percentage of correctly classified data and the best sensitivity assuming
a 0.5 cut-off value.
However, some of the machine learning techniques are reported to generate better
results as compared to statistical techniques. Tsai and Wu [17] has stated that it is
much more superior than the traditional statistical models. This can be supported by
Bellotti and Crook [22], where the author compared support vector machine (SVM)
against traditional methods such as logistic regression and linear discriminant analysis
to predict the risk of default. The results indicated that SVM with a linear and Gaussian
radial basis function (RBF) kernel produces the best result with an AUC of 0.783 for
both algorithms. Nevertheless, the difference in terms of performance between SVM and
traditional methods are not significant, but it is proven that SVM can be used as a feature
selection to identify important variables in predicting the probability of default. Lee
[26] has also implemented support vector machine (SVM) with RBF kernel in corporate
credit rating problem and utilized 5-fold cross-validation with grid-search technique to
search for the best parameter. Besides, the author compares the SVM’s result against
multiple discriminant analysis (MDA), case-based reasoning (CBR) and three-layer fully
connected back-propagation neural networks (BPN) whereby the results show that SVM
transcend other methods without overfitting.
Byanjankar et al. [27] used artificial neural network to predict the default probabil-
ity of peer-to-peer (P2P) loan applicants. Moreover, comparisons have been conducted
between neural network and logistic regression. The result shows that neural network is
effective in identifying default borrowers whereas logistic regression is better in identi-
fying non-default borrowers. Even so, neural network’s result is deemed promising as
it is crucial to forecast default loans in advance to prevent the creditors from investing
in bad applicants. In another P2P credit risk study conducted by Bae et al. [28], online
P2P lending default prediction models was developed using stepwise logistic regres-
sion, classification tree algorithms (CART and C5.0) and multilayer perceptron (MLP)
to predict loan default. After evaluating the performance of the models with 5-fold
cross-validation, the results reveal that MLP has the highest validation average accu-
racy, 81.78%, whereas logistic regression has the lowest validation average accuracy,
61.63%.
Moreover, Chandra Blessie and Rekha [29] have proposed a loan default prediction
based on Logistic Regression, Decision Tree, Support Vector Machine and Naïve Bayes.
The result indicated that Naïve Bayes classifier is tremendously efficient and gave a supe-
rior result than other classifiers. Aside from that, data cleaning, feature engineering and
exploratory data analysis (EDA) were conducted before training the model. Features that
were studied during EDA are application income, co-application income, loan amount,
credit history, gender loan status, gender, relation status, education status and property
Review of Machine Learning and XAI in Credit Risk Modelling 601
area. Yet another evidence by Mafas developed a predictive model for loan default pre-
diction in peer-to-peer lending communities using Logistic Regression, Random Forest,
and Linear SVM with the selected feature set where Random Forest outperformed and
achieved an accuracy of 92%. The significant fittest feature subset was obtained using a
Genetic Algorithm and was evaluated using a Logistic Regression model [30].
After careful review, it is clear that the machine learning models can easily work
with large datasets and generate predictions with high accuracy making it exceptional,
but statistical techniques are much simpler and user friendly thereby making it popular
for use in the financial industry. Machine learning model fitting also avoids overfitting
as it will defeat the purpose of the study. This section discussed the performance of
individual statistical and machine learning models. Newer research also experiments the
usage of ensemble models also called as stacking approach.
Ensemble Model vs Individual Model. Aside from individual models, some
researchers have reported that using ensemble models can yield better accuracy as com-
pared to individual models. Yao [31] experimented with a single Decision Tree and two
ensemble learning algorithms such as Adaboost and Bagging (Bootstrap Aggregation)
with Decision Tree as a baseline algorithm to predict the creditworthiness of the appli-
cants with the Australian credit dataset. The result indicates that ensemble learning,
Adaboost CART with 14 features produced better results than a single Decision Tree
without having much complexity. Likewise, another research has also adopted an ensem-
ble model but with a different approach which is an ensemble technique of support vector
machine (SVM) for credit risk assessment in Australian and German dataset by Xu et al.
[32]. For example, the author experimented with voting ensemble based on single SVM
and four SVM based ensemble models of four different kernel functions such as poly-
nomial kernel, linear kernel, RBF kernel and sigmoid kernel against individual SVM
models. Besides, Principal Component Analysis (PCA) is implemented before training
the model to select credit features and five-fold cross-validation is utilized for model
validation purposes. The results show that the ensemble model of SVM performed better
than the individual SVM classifier. Furthermore, the author has also suggested that the
use of the ensemble model for credit risk assessment is promising to improve prediction
performances.
Madaan et al. [33] proposed using Random Forest and Decision Tree to assess indi-
vidual loans based on their attributes. The authors had also conducted exploratory data
analysis to get acquainted with the dataset and performed data pre-processing. The data
are then split into training (70%) and testing (30%) set whereby the selected algorithms
will be used to train the model. The results of the classification report show that Random
Forest outperforms Decision Tree with an accuracy score of 80% and 73% respectively.
Another author, Zhu et al. [34], also proposed Random Forest classification but on a
different scenario which is to predict loan default in P2P online lending platform and
compare it against other machine learning methods such as Decision Tree, Support Vec-
tor Machine (SVM) and Logistic Regression. The results indicated that Random Forest
classification performs significantly better in identifying loan defaults. The authors have
overcome the challenge of imbalanced class in the dataset by applying SMOTE (Syn-
thetic Minority Oversampling Technique) method which can generate new samples for
602 Y. S. Heng and P. Subramanian
the minority class. Furthermore, the authors also suggested using larger datasets and fine-
tuning the models can potentially improve the accuracy of the model in future research.
Another P2P loan default prediction was conducted by Li et al. [35] based on XGBoost,
Logistic Regression and Decision Tree. The result indicated that the predictive accu-
racy of XGBoost technique (97.705%) outperforms other models under five-fold cross
validation. Other performance comparisons were compared such as AUC value, classi-
fication error rate, model robustness and model run time. The result shows that although
XGBoost has the best robustness and least error rate, the run time of the XGBoost is the
slowest compared to other models. However, the author mentions that XGBoost is dras-
tically better than traditional models in nearly all aspects. Moreover, the author has also
visualized the top ten features that have the most significant influence on loan default
rates based on the XGBoost classifier.
Zhao et al. [36] suggested to use ensemble learning classification model such as adap-
tive boosting (AdaBoost) with decision tree on credit scoring problem. Ten-fold cross-
validation was performed to assess and compare the performance between AdaBoost-DT,
Decision Tree, and Random Forest. The results show that AdaBoost-DT model yields
the highest accuracy. Moreover, the author has also recommended to experiment with
parameter optimization methods in future research. Udaya Bhanu and Narayana [37]
proposed using random forest, logistic regression, decision tree, K-nearest neighbor,
and Support Vector Machine for customer loan prediction. The author has also prepro-
cessed the data and apply feature engineering technique to enhance the performance of
machine learning algorithms. The comparative study shows that Random Forest shows
the best accuracy, 82% in classifying loan candidates with an excellent F1-score.
In addition to the above models, LightGBM is a recently popular machine learning
algorithm, which uses histogram algorithm and Leaf-wise strategy with depth limi-
tation. LightGBM model has been used to predict the financing risk profile of 186
enterprises where the researcher conducted comparison experiments using k-nearest-
neighbor’s algorithm, decision tree algorithm, and random forest algorithm on the same
data set. The experiments show that LightGBM has better prediction results than the
other three algorithms for several metrics in corporate financing risk prediction [38].
The reviewed literature has shown that ensemble models perform better compared
to individual models. However, there is not much attention given to the voting ensemble
model whereby it is a technique of combining the classifiers of different machine learning
algorithms which is worth further investigation. A general consensus in the machine
learning models either individual or ensemble would be to address data quality issues,
handle imbalanced class and tune hyper parameters in order to improve the performance
of the model.
have high predictive accuracy in assessing customer credit risk. Still, these innovative and
advanced machine learning algorithms lack transparency that is essential to comprehend
the reason behind the rejection and approval of an individual’s loan application. The
author also added that it is tough to trace back to the steps that an algorithm took
to arrive at its decision as these models are developed directly from the data by an
algorithm. The lack of credibility, trust and explainability are the major challenges faced
by many researchers when introducing machine learning based models to companies in
the credit scoring field [41]. Thus, ‘black box’ models are deemed to be less suitable in
financial services due to the lack of interpretation. Even though the machine learning
model improves over time and generates excellent predictive results, yet many financial
institutions are still reluctant to fully trust the predictive model.
Provenzano et al. [44] implemented SHAP and LIME techniques to explain the
prediction of the high performing Light-GBM classifier that obtains 95% accuracy in
default classification. The author stated that adopting SHAP and LIME has helped in
understanding the important features in determining an individual result and thereby
increasing the confidence in the model. Another study conducted by Visani et al. [45]
has compared statistical model, Logistic Regression against machine learning model,
Gradient Boosting Trees on credit risk data whereby LIME was tested on machine learn-
ing model to check its stability. It is reported that Gradient Boosting Model outperformed
Logistic Regression and LIME is a stable and reliable technique when applied to the
machine learning model.
Hadji Misheva et al. [40] has also adopted both XAI techniques, LIME and SHAP,
in machine learning based credit scoring models on Lending Club dataset. The models
that the author train including logistic regression, XGBoost, Random Forest, SVM and
Neural Networks. The author has implemented LIME, as shown in Fig. 2, to explain
local instances on SVM and tree-based models (XGBoost and Random Forest) whereas
SHAP, as shown in Fig. 3, was used to obtain global explanations. The results of the
study imply that both LIME and SHAP offer reliable explanation in line with financial
reasoning. The author also mentions that SHAP is a powerful and effective technique
in highlighting the feature importance, but it can take a very long time to generate the
results. This is supported by Phaure and Robin [46] in their study of model explainability
in credit risk management whereby the author indicated that the computational time of
SHAP method is proportional to the number of feature, observation and the complexity
of the model.
Fig. 2. XGBoost model with LIME explanation on a customer that classified as a ‘default’ loan
type [40]
Review of Machine Learning and XAI in Credit Risk Modelling 605
Fig. 3. Summary plot - XGBoost model with SHAP tree explainer [40]
In short, introducing XAI techniques can help improve the explainability and trans-
parency of the black-box model rather than relying solely on machine learning output for
decision making. XAI will not only eliminate bias, but it can also assist in establishing
trust and in compliance with the regulators in financial institutions to ensure fairness
in loan lending. Therefore, XAI techniques, specifically LIME should be adopted to
explain the credit decision of the black-box model.
Credit Scorecards. The banking industry uses credit scorecards as a tool for risk man-
agement. Credit scorecards consist of a group of features that are widely used to predict
the default probabilities such as classifying good and bad credit risk. There are vari-
ous techniques used in the development of scorecards such as support vector machine,
genetic programming, artificial neural networks, multiple classifier systems, hybrid mod-
els, logistic regression, classification tree, linear regression and linear programming [39,
47]. Moreover, Dong et al. [39] stipulated that generating credit scorecards will poten-
tially contribute to effective credit risk management. The author added that the quality
of the credit scorecard can be measured such as using Percentage Correctly Classified
(PCC) to identify the accuracy of the prediction.
606 Y. S. Heng and P. Subramanian
3 Related Works
This section will compare and analyze different credit risk models and software that are
fully developed and currently available in the market. Most of the credit risk models
developed are marketed towards medium and large sized companies such as banks and
enterprise creditors. Their goal is to assists companies who purchase their system in
determining the creditworthiness of potential borrowers and minimizing loan defaults.
With a timelier and accurate predictions, lenders can use the result generated to negotiate
with the borrowers. As part of the research, comparisons will be conducted between three
different commercial systems to understand their structures and functionalities. The three
systems selected in this study are GiniMachine, ABLE Scoring and ZAML.
3.1 GiniMachine
GiniMachine is an AI-driven credit scoring software that can help lenders make reliable
credit decisions within a short amount of time and the logo of GiniMachine can be seen
Review of Machine Learning and XAI in Credit Risk Modelling 607
in Fig. 5 [49]. This system will employ machine learning for automated decision-making
where it is effective even towards thin-file borrowers. Thus, banks and fintech companies
can identify bad loans to avoid unwanted risk without relying on traditional credit scoring
or doing manual work that has many shortcomings. For instance, GiniMachine that is
based on AI technologies can analyze parameters that traditional method tends to ignore.
Furthermore, GiniMachine can easily adapt into changing environment that will fit nicely
into specific businesses and risk assessment rules. Let’s suppose, if the company has
released a new loan product, the system can process the information of the new loan
product and adjust accordingly to the needs of the lenders. The system will also generate
detailed reports, as shown in Fig. 6, that consist of statistical calculations regarding
the decision made by the model. Moreover, the system is easy to use as it is designed
specifically for non-technical individuals to operate the system. Thus, no specific training
is required to operate the system.
ABLE Scoring is another powerful credit scoring software that will assist in making
credit decisions to prevent bad loans and the logo of ABLE Scoring can be seen in Fig. 7.
608 Y. S. Heng and P. Subramanian
Scorecards along with credit decisions can be easily generated via the scorecard builder,
as shown in Fig. 8. Moreover, ABLE Scoring allows lenders to score potential borrowers
in batches which will save a lot of time. Different machine learning models can be built
including the classical logistic regression model. The performance of each of the models
can be compared and evaluated in terms of performance and stability. Furthermore, the
result of the credit decision will be explained in the scorecards generated, as shown in
Fig. 9, which will help the lenders to better understand the output decision made by
the machine learning model to eliminate any doubts. It will also check for data formats,
consistency, and missing values to ensure the data is in high quality. The software is easy
to use without any specific training required. The users will just need to upload an XLS
file format to generate a scorecard report. ABLE Scoring promotes fast and smart credit
decisions based on AI models and it ensures a stable and high-quality lending process.
This software is being trusted by banks and fintech companies such as Eurasian Bank,
OTP Bank and Alfa Bank [50].
3.3 Zest AI
Zest AI is yet another robust machine learning software that assists lenders and under-
writers to make better, more timely and transparent credit decisions. The logo of Zest
AI is shown in Fig. 10. Zest AI also aims to address the problems of traditional credit
scoring tools, such as gaps, errors or structural inequities that lead to the rejection of
good applicants [52]. With Zest AI, lenders can easily identify good borrowers and safely
increase loan approvals while minimizing the risk and losses. Besides, Zest AI provides
a bigger picture of every borrower with full interpretability to comply with the strictest
regulators and satisfy doubters [51]. For example, the custom-built logistic regression
scorecards in Zest AI will be used to assess the creditworthiness of the borrowers to help
lenders in their decision making. Figure 11 shows a sample of the scorecards generated
with Zest AI:
Most importantly, it is a stable software that offers rapid analysis to help lenders
make quick business decisions and ensure fairness in lending operations. Thus, this
will potentially improve customer experience and make a positive impact on lending
businesses. Furthermore, the software owners can also rest assured as Zest AI offers
smooth transition and adoption from traditional credit scoring tools with professional
support. In addition, the software is also user-friendly whereby it can be operated by
non-technical staff without prior machine learning background. Zest AI is also being
recognized by one of the largest banks in Turkey, Akbank. Akbank has found Zest AI
software extremely effective in identifying good borrowers with minimal risks. Akbank
managed to reduce non-performing loans by 45% and less time needed to retrain and
build the models with Zest AI that initially took them seven months [53]. Besides, Zest
AI can adapt to changing requirements which further increases the confidence of their
client. Thus, the adoption of Zest AI can promote sustainable growth among banks and
other financial institutions in their lending businesses.
The comparisons between the related work are essential to understand the attributes
of the fully developed credit scoring systems. Moreover, new ideas and opportunities
can be triggered by analyzing the existing systems, which will benefit future research.
Table 1 shows the comparisons between different credit risk systems that are currently
available in the market:
Based on the analysis conducted, all the systems are built to ensure faster, fair, and
high-quality loan lending. This is because their target users are mostly banks and other
financial institutions whereby the primary goal is to mitigate credit risk and avoid bad
loans. The systems are also user-friendly, especially for non-technical staff to operate the
system without much training needed. Moreover, it is also important for the output result
to be transparent to comply with the regulators. However, it is noted that all three systems
solely focus on predicting the output, but it has no dashboard to visualize the trends of
loan customers. In that case, it will be an opportunity for the developer to include a
dashboard in the web application that will visualize the trends of loan customers.
References
1. Hani, A.: Credit cards, personal loans landing Malaysians in debt trap (2019). https://the
malaysianreserve.com/2019/08/08/credit-cards-personal-loans-landing-malaysians-in-debt-
trap/
2. Bank Negara Malaysia, Monthly Highlights and Statistics in July 2021 (2021). https://www.
bnm.gov.my/-/monthly-highlights-and-statistics-in-july-2021
3. Brock, Credit Risk (2021). https://www.investopedia.com/terms/c/creditrisk.asp
4. Goyal, K.A., Agrawal, S.: Risk management in Indian banks: some emerging issues. Int. J.
Econ. Res. 1(1), 102–109 (2010)
5. Chenghua, S., Kui, Z.: Study on commercial bank credit risk based on information asymmetry.
In: 2009 International Conference on Business Intelligence and Financial Engineering, BIFE
2009, pp. 758–761 (2009). https://doi.org/10.1109/BIFE.2009.175
6. Li, H., Pang, S.: The study of credit risk evaluation based on DEA method. In: Proceedings
of the 2010 International Conference on Computational Intelligence and Security, CIS 2010,
pp. 81–85 (2010). https://doi.org/10.1109/CIS.2010.25
7. Rehman, Z.U., Muhammad, N., Sarwar, B., Raz, M.A.: Impact of risk management strategies
on the credit risk faced by commercial banks of Balochistan. Financ. Innov. 5(1) (2019).
https://doi.org/10.1186/s40854-019-0159-8
8. Kolapo, T.F., Ayeni, R.K., Oke, M.O.: Credit risk and commercial banks’ performance in
nigeria: a panel model approach. Aust. J. Bus. Manag. Res. 2(02), 31–38 (2012)
9. Chen, K.-C., Pan, C.-Y.: An empirical study of credit risk efficiency of banking industry in
Taiwan. Web J. Chin. Manag. Rev. 15(1), 1–17 (2012). http://cmr.ba.ouhk.edu.hk
10. Waemustafa, W., Sukri, S.: Bank specific and macroeconomics dynamic determinants of credit
risk in islamic banks and conventional banks. Int. J. Econ. Financ. Issues 5(2), 476–481 (2015).
https://doi.org/10.6084/m9.figshare.4042992
11. Bhattarai, Y.R.: The effect of credit risk on nepalese commercial banks. NRB Econ. Rev.
28(1), 41–64 (2016). https://nrb.org.np/ecorev/articles/vol28-1_art3.pdf
12. Pradhan, S., Shah, A.K.: Credit risk management of commercial banks in Nepal. J. Bus. Soc.
Sci. Res. 4(1), 27–37 (2019). https://doi.org/10.3126/jbssr.v4i1.28996
13. Basel Committee on Banking Supervision: An Explanatory Note on the Basel II IRB Risk
Weight Functions (2005). www.bis.orgb/cbsirbriskweight.pdf
14. Psillaki, M., Tsolas, I.E., Margaritis, D.: Evaluation of credit risk based on firm performance.
Eur. J. Oper. Res. 201(3), 873–881 (2010). https://doi.org/10.1016/j.ejor.2009.03.032
15. Lai, L.: Loan default prediction with machine learning techniques. In: Proceedings of the
2020 International Conference on Computer Communication and Network Security, CCNS
2020, pp. 5–9 (2020). https://doi.org/10.1109/CCNS50731.2020.00009
16. Addo, P.M., Guegan, D., Hassani, B.: Credit risk analysis using machine and deep learning
models. SSRN Electron. J. (2018). https://doi.org/10.2139/ssrn.3155047
17. Tsai, C.F., Wu, J.W.: Using neural network ensembles for bankruptcy prediction and credit
scoring. Expert Syst. Appl. 34(4), 2639–2649 (2008). https://doi.org/10.1016/j.eswa.2007.
05.019
18. Chen, M., Huang, S.: Credit scoring and rejected instances reassigning through evolutionary
computation techniques. Expert Syst. Appl. 24(4), 433–441 (2003). https://doi.org/10.1016/
S0957-4174(02)00191-4
19. Schreiner, M.: Scoring: the next breakthrough in microcredit. In: CGAP, no. 7, pp. 1–64
(2003)
20. Vidal, M.F., Barbon, F.: Credit Scoring in Financial Inclusion. CGAP, July 2019
21. Eddy, Y.L., Engku Abu Bakar, E.M.N.: Credit scoring models: techniques and issues. J. Adv.
Res. Bus. Manag. Stud. 7(2), 29–41 (2017). https://www.akademiabaru.com/submit/index.
php/arbms/article/view/1240
Review of Machine Learning and XAI in Credit Risk Modelling 613
22. Bellotti, T., Crook, J.: Support vector machines for credit scoring and discovery of significant
features. Expert Syst. Appl. 36(2), 3302–3308 (2009). https://doi.org/10.1016/j.eswa.2008.
01.005
23. Memic, D.: Assessing credit default using logistic regression and multiple discriminant anal-
ysis: empirical evidence from Bosnia and Herzegovina. Interdiscip. Descr. Complex Syst.
13(1), 128–153 (2015). https://doi.org/10.7906/indecs.13.1.13
24. Obare, D.M., Njoroge, G.G., Muraya, M.M.: Analysis of individual loan defaults using logit
under supervised machine learning approach. Asian J. Probab. Stat. 3(4), 1–12 (2019). https://
doi.org/10.9734/ajpas/2019/v3i430100
25. Foo, L.K., Chua, S.L., Chin, D., Firdaus, M.K.: Logistic regression models for Malaysian
housing loan default prediction (2017). https://www.bnm.gov.my/documents/20124/826852/
WP11+-+Logistic+Regression.pdf/d22ef5a2-4bdb-4d39-28f3-c19253d2814e?t=158503059
9211
26. Lee, Y.C.: Application of support vector machines to corporate credit rating prediction. Expert
Syst. Appl. 33(1), 67–74 (2007). https://doi.org/10.1016/j.eswa.2006.04.018
27. Byanjankar, A., Heikkila, M., Mezei, J.: Predicting credit risk in peer-to-peer lending: a neural
network approach. In: Proceedings of the 2015 IEEE Symposium Series on Computational
Intelligence SSCI 2015, pp. 719–725 (2015). https://doi.org/10.1109/SSCI.2015.109
28. Bae, J.K., Lee, S.Y., Seo, H.J.: Predicting online peer-to-peer (P2P) lending default using data
mining techniques. J. Soc. E-bus. Stud. 23(3), 1–6 (2018)
29. Chandra Blessie, E., Rekha, R.: Exploring the machine learning algorithm for prediction the
loan sanctioning process. Int. J. Innov. Technol. Explor. Eng. 9(1), 2714–2719 (2019). https://
doi.org/10.35940/ijitee.A4881.119119
30. Victor, L., Raheem, M.: Loan default prediction using genetic algorithm: a study within
peer-to-peer lending communities. Int. J. Innov. Sci. Res. Technol. 6(3) (2021). ISSN No.
2456-2165
31. Yao, P.: Credit scoring using ensemble machine learning. In: Proceedings of the 2009 9th
International Conference on Hybrid Intelligent Systems, HIS 2009, vol. 3, pp. 244–246 (2009).
https://doi.org/10.1109/HIS.2009.264
32. Xu, W., Zhou, S., Duan, D., Chen, Y.: A support vector machine based method for credit risk
assessment. In: Proceedings of the IEEE International Conference on e-Business Engineering,
ICEBE 2010, pp. 50–55 (2010). https://doi.org/10.1109/ICEBE.2010.44
33. Madaan, M., Kumar, A., Keshri, C., Jain, R., Nagrath, P.: Loan default prediction using
decision trees and random forest: a comparative study. IOP Conf. Ser. Mater. Sci. Eng. 1022(1),
1–12 (2021). https://doi.org/10.1088/1757-899X/1022/1/012042
34. Zhu, L., Qiu, D., Ergu, D., Ying, C., Liu, K.: A study on predicting loan default based on the
random forest algorithm. Procedia Comput. Sci. 162, 503–513 (2019). Itqm. https://doi.org/
10.1016/j.procs.2019.12.017
35. Li, Z., Li, S., Li, Z., Hu, Y., Gao, H.: Application of XGBoost in P2P default prediction. J.
Phys. Conf. Ser. 1871(1), 1 (2021). https://doi.org/10.1088/1742-6596/1871/1/012115
36. Zhao, J., Wu, Z., Wu, B.: An AdaBoost-DT model for credit scoring. In: WHICEB 2021
Proceedings, vol. 15 (2021)
37. Udaya Bhanu, L., Narayana, D.S.: Customer loan prediction using supervised learning tech-
nique. Int. J. Sci. Res. Publ. 11(6), 403–407 (2021). https://doi.org/10.29322/ijsrp.11.06.2021.
p11453
38. Wang, D.N., Li, L., Zhao, D.: Corporate finance risk prediction based on LightGBM. Inf.
Sci., 602, 259–268 (2022). ISSN 0020-0255. https://doi.org/10.1016/j.ins.2022.04.058
39. Dong, G., Kin, K.L., Yen, J.: Credit scorecard based on logistic regression with random
coefficients. Procedia Comput. Sci. 1(1), 2463–2468 (2010). https://doi.org/10.1016/j.procs.
2010.04.278
614 Y. S. Heng and P. Subramanian
40. Hadji Misheva, B., Hirsa, A., Osterrieder, J., Kulkarni, O., Fung Lin, S.: Explainable AI in
credit risk management. SSRN Electron. J., 1–16 (2021). https://doi.org/10.2139/ssrn.379
5322
41. El Qadi, A., Diaz-Rodriguez, N., Trocan, M., Frossard, T.: Explaining credit risk scoring
through feature contribution alignment with expert risk analysts, pp. 1–12 (2021). http://
arxiv.org/abs/2103.08359
42. Wijnands, M.: Explaining black box decision-making. University of Twente (2021)
43. Confalonieri, R., Coba, L., Wagner, B., Besold, T.R.: A historical perspective of explainable
Artificial Intelligence. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 11, no. 1, pp. 1–21 (2021). https://doi.org/10.1002/widm.1391
44. Provenzano, A.R., et al.: Machine Learning approach for Credit Scoring (2020). http://arxiv.
org/abs/2008.01687
45. Visani, G., Bagli, E., Chesani, F., Poluzzi, A., Capuzzo, D.: Statistical stability indices for
LIME: obtaining reliable explanations for machine learning models. J. Oper. Res. Soc., 1–18
(2020). https://doi.org/10.1080/01605682.2020.1865846
46. Phaure, H., Robin, E.: Explain artificial intelligence for credit risk management. Deloitte,
April 2020
47. Bequé, A., Coussement, K., Gayler, R., Lessmann, S.: Approaches for credit scorecard cali-
bration: an empirical analysis. Knowl. Based Syst. 134, 213–227 (2017). https://doi.org/10.
1016/j.knosys.2017.07.034
48. Siddiqi, N.: Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring.
Wiley, Hoboken (2006)
49. GiniMachine: Credit Scoring Software (2021). https://ginimachine.com/risk-management/
credit-scoring/. Accessed 13 Oct 2021
50. RND Point: Credit Scoring Software (2021). https://rndpoint.com/solutions/able-scoring/.
Accessed 25 Oct 2021
51. Zest AI: Zest AI (2021). https://www.zest.ai/. Accessed 25 Oct 2021
52. Upbin, B.: ZAML Fair - Our New AI To Reduce Bias in Lending (2019). https://www.zest.
ai/insights/zaml-fair-our-new-ai-to-reduce-bias-in-lending. Accessed 25 Oct 2021
53. Zest AI: Model Management System (2021). https://www.zest.ai/product. Accessed 25 Oct
2021
On the Application of Multidimensional LSTM
Networks to Forecast Quarterly Reports
Financial Statements
1 Introduction
Financial prognostics are an emerging tool in managing corporate finances. Since fore-
casting allows reasonable and scientific prediction of future events, it makes it possible
to make decisions (e.g. for investment) with the potential for precise evaluation of its
effect on the financial situation of the enterprise. This is crucial in economic realities, in
which economic practice is increasingly using data from financial reporting and financial
analysis, making decisions based on such data. The functional importance of forecast-
ing, among others, is pointed out by Hedeyati and Esfandyari et al. [1–4]. Preparation
of a financial prediction involves tolerating the uncertainty it brings with it. However,
despite the fact that the drafting of an accurate forecast (due to the presence of the ran-
dom factors that cannot be taken into account at the preparation stage of the forecast) is
exceptionally complex, it enables a detailed analysis of the company’s future financial
situation and the accounting of elements that are not taken into consideration during
typical financial analysis. Thus, it allows a more specific determination of threats and
opportunities in the financial domain of the company’s operation. It seems, therefore,
that forecasting financials will be more and more used in business practice, especially in
subjects operating in international or global markets, as a subsidiary to traditional meth-
ods of financial analysis. Indeed, the fundamental methods of financial analysis need to
be enhanced. In addition, in sum, there is a pressing need to develop practical methods
for relatively simple projection of a company’s financial situation in the perspective of
at least more than one financial statement forwards, which is not straightforward due
to the difficulty of forecasting methods and the presence of contingency factors that are
unpredictable when building a forecast.
In summary, the following market gaps have been identified: lack of a tool to auto-
matic analysis and prognosis of financial reports data. Among the most widely used
deep learning methods in financial time series forecasting are methods based on long
short-term memory neural network structure [5–9]. This research is on the application of
such multi-dimensional deep learning structure to solve the problem of forecasting the
value of three indicators: cash flow from operating activities, cash flow from investing
activities and cash flow from financing activities, included in quarterly financial reports.
Simple structure of LSTM network (single-input single output) to predict other financial
statement items has been studied earlier in our works [19]. This idea is not new, e.g., in
[3] deep learning in relation to multi-agent stock trading system is analyzed.
2 Contribution
The purpose of the study is to prepare and analyze the forecast of coupled financial
reporting: sales revenue with operating result for a selected group of companies which
represent various types of companies with different characteristics, traded at the Warsaw
Stock Exchange. To reach the objectives, the following research hypothesis was stated:
Coupled financial forecasting improves the classical forecasting of the financial state
of a company, and therefore is an effective support for the investment decisions of
the individual investors on the Stock Exchange. The study used the author’s choice
of companies with the division into industrial, financial and service oriented ones. It
has been proposed to categorize companies into four types, named by us: Cows, Stars,
Phoenixes and Zombies. The survey period included the years 2008 to 2018 and forecasts
as well as factors in 2019.
The analysis used Matlab’s DeepLearnig Toolbox [10] for time series forecasting
using deep learning. Specifically, time series data forecasting using long-short-term
memory (LSTM) networks was used under two structures: single input-single output
network and multi input-multi output network.
On the Application of Multidimensional LSTM Networks 617
There were nine companies selected for analysis, each one representing different
group in WSE-CTM matrix: Amica, Sniezka, PZU, CD Project, TSGames, Quercus,
LiveChat, PBG, GetBack. In the analysis the companies were anonymized and WSE-
CTM matrix is presented in Table 1.
The cash flow from operating activities, cash flow from investing activities and cash
flow from financing activities data have been extracted from quarterly financial reports.
These are statements no. 10, 11 and 12 in standardized reporting format on WSE.
4 Methodology
The LSTM type of network is under development to forecast future index values from
historical data. LSTM networks are a type of the recurrent neural network (RNN) concept
designed to avoid the problem of long-term dependence, in of which each neuron carries
a memory cell that can store the prior knowledge used by the RNN or forget it if it is
needed [12]. They are currently widely used for successful applications in time series
prediction problems [13–16]. The LSTM-RNN is also designed to have a memory cell
that stores long-term interrelationships. In addition to the memory cell, the LSTM cell
contains an input gate, an output gate and a forgetting gate. For each gate in the cell, it
acquires the current input, the hidden state of the previous time, and state information
of the cell’s internal memory to perform certain operations and determine whether to
activate the actuation function. To forecast the values of upcoming sequence time steps,
the reactions are expected to be training sequences with values that are shifted by one
time step. In other words, at each step of the input sequence, the LSTM network is
taught to predict the value of the next time step. To forecast the future value of multiple
time steps, the function PredestAndUpdateState of the toolbox was deployed, which
forecasted the timing steps on a one-by-one basis, and the status of the network was
upgraded at each forecast. In the M-LSTM structure (Fig. 1), the dimension of the input
and output signals (vector) in our case is three, that is, the additional information affects
both the prediction and predictions with update results.
To benchmark the effectiveness of the forecasts, the root mean square error (RMSE)
that was calculated from the standardized data that is:
n
i=1 (Xtest,i − Xpred ,i )
2
RMSE = , (1)
n
where X test are test values and X pred are predicted values at time i. In order to compare
the forecasts quality RMSE has been normalized using the difference between maximum
and minimum of the test data method:
NRMSE1 = RMSE/(Xtest max − Xtest min ) (2)
On the Application of Multidimensional LSTM Networks 619
5 Results
5.1 Example of Forecasts for Industry Companies for M-LSTM (Three
Input-Three Output M-LSTM Structure)
Example 1. The illustrative data file contains a single time series with time steps rep-
resenting quarters and values corresponding to financial report items, here cash flows
from investing activities. The resulting data is an array of cells in which each element
is a single time step. In Fig. 2, cash flow data from investing activities and data with a
forecast are plotted.
Fig. 2. Cash flow from investing activities data (left) and data with forecast (right).
Fig. 3. Cash flow from investing activities forecast with NRMSE (left) and FORECAST WITH
UPDATES with NRMSE (right).
Figure 3 (left) illustrates the forecasts for the actual observed data values accompa-
nied by the prediction error. If the actually recorded indicator values of the forecasts are
already known, the state of the network over the observed values instead of the predicted
values can be updated, thereby confronting the learning results with the actual observed
values (the so-called reinforcement learning technique). The results are shown in Fig. 3
620 A. Gałuszka et al.
(right). In this case, updating the observations significantly (and not in a statistical
meaning) and improves the forecasting result.
Example 2. The exemplary data file contains a single series of time steps with intervals
corresponding to quarters and values representing financial statement item, here cash
flows from financial activities. The output is an array of cells in which each element is
a single time step. In Fig. 4, cash flow data from financing activities and data with a
forecast are shown.
Fig. 4. Cash flow from financing activities data (left) and data with forecast (right).
The Fig. 5 (left) demonstrates the forecasts for observed data values along with the
forecast error. If the values of the forecast indicators are available, the state of the network
can be actuated with the observed values instead of the forecasted ones, thus confronting
the learning results with the actually observed values. The findings are shown in Fig. 5
(right). In such a case, the observation update significantly (not statistically) decreases
the forecasting result.
Fig. 5. Cash flow from financing activities forecast with NRMSE (left) and forecast with updates
with NRMSE (right).
On the Application of Multidimensional LSTM Networks 621
Example 3. The Sample Data File contains a single time series with time steps which
correspond to quarters and values which correspond to financial report item, here cash
flow from operations activities. The output is an array of cells in which each element is
a single time step. Figure 6 shows the operating cash flow and forecast data.
Fig. 6. Cash flow from operating activities data (left) and data with forecast (right).
The Fig. 7 (left) demonstrates the forecasts for observed data values along with the
forecast error. If the values of the forecast indicators are available, the state of the network
can be actuated with the observed values instead of the forecasted ones, thus confronting
the learning results with the actually observed values. The findings are shown in Fig. 7
(right). In such a case, the observation update significantly (not statistically) decreases
the forecasting result.
Fig. 7. Cash flow from operating activities forecast with NRMSE (left) and forecast with updates
with NRMSE (right).
622 A. Gałuszka et al.
the market situation). The forecast result in the case of analyzed items, i.e. cash flow from
investing activities and cash flow from financing activities is supposed to be considered
rather as an assisting tool in automatically reporting possible unusual fluctuations in the
reporting of financial data, and not as an efficient prognostic instrument [17, 18].
Acknowledgments. We would like to thank the stock exchange experts for their critical com-
ments. The work has been funded by GPW Data Grant No. POIR.01.01.01-00-0162/19 in 2021.
The work of Adam Gałuszka was supported in part by the Silesian University of Technology (SUT)
through the subsidy for maintaining and developing the research potential grant in 2022. The work
of Eryka Probierz was supported in part by the European Union through the European Social Fund
as a scholarship under Grant POWR.03.02.00-00-I029, and in part by the Silesian University of
Technology (SUT) through the subsidy for maintaining and developing the research potential grant
in 2022 for young researchers in analysis. This work was supported by Upper Silesian Centre for
Computational Science and Engineering (GeCONiI) through The National Centre for Research
and Development (NCBiR) under Grant POIG.02.03.01-24-099/13. The work of Karol J˛edrasiak
and Aleksander Nawrat has been supported by National Centre for Research and Development as
a project ID: DOB-BIO10/19/02/2020 “Development of a modern patient management model in
a life-threatening condition based on self-learning algorithmization of decision-making processes
and analysis of data from therapeutic processes”.
References
1. Hedayati Moghaddama, A., Hedayati Moghaddamb, M., Esfandyari, M.: Stock market index
prediction using artificial neural network. J. Econ. Finance Adm. Sci. 21, 89–93 (2016)
2. Kyoung-jae, K.: Financial time series forecasting using support vector machines. Neurocom-
puting 55(1–2), 307–319 (2003)
3. Korczak, J., Hemes, M.: Deep learning for financial time series forecasting in a-trader system.
In: 2017 Federated Conference on Computer Science and Information Systems (FedCSIS),
Prague, pp. 905–912 (2017)
4. Franc-D˛abrowska, J., Zbrowska, M.: Prognozowanie finansowe dla spółki X – spółka
logistyczna. Zeszyty Naukowe SGGW w Warszawie. Ekonomika i Organizacja Gospodarki
Żywnościowej 64, 251–270 (2008). (in Polish)
5. Chen, K., Zhou, Y., Dai, F.: A LSTM-based method for stock returns prediction: a case study
of China stock market. In: 2015 IEEE International Conference on Big Data (Big Data),
pp. 2823–2824 (2015). https://doi.org/10.1109/BigData.2015.7364089
6. Zhao, Z., Rao, R., Tu, S., Shi, J.: Time-weighted LSTM model with redefined labeling for
stock trend prediction. In: 2017 IEEE 29th International Conference on Tools with Artificial
Intelligence (ICTAI), pp. 1210–1217 (2017). https://doi.org/10.1109/ICTAI.2017.0018
7. Roondiwala, M., Patel, H., Varma, S.: Predicting stock prices using LSTM. Int. J. Sci. Res.
(IJSR) 6 (2017). https://doi.org/10.21275/ART20172755
8. Qiu, J., Wang, B., Zhou, C.: Forecasting stock prices with long-short term memory neural
network based on attention mechanism. PLoS ONE 15(1) (2020). https://doi.org/10.1371/jou
rnal.pone.0227222
9. Fischer, T., Krauss, C.: Deep learning with long short-term memory networks for financial
market predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018). https://EconPapers.repec.org/
RePEc:eee:ejores:v:270:y:2018:i:2:p:654-669
10. www.mathworks.com
11. BCG Matrix (2021). http://www.netmba.com/strategy/matrix/bcg/
624 A. Gałuszka et al.
12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
13. Elsaraiti, M., Merabet, A.: Application of long-short-term-memory recurrent neural networks
to forecast wind speed. Appl. Sci. 11, 2387 (2021). https://doi.org/10.3390/app11052387
14. Shumway, R.H., Stoffer, D.S.: Time Series Analysis and its Applications: With R Examples,
4th edn. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52452-8
15. Huang, J., Chai, J., Cho, S.L.: Deep learning in finance and banking: a literature review and
classification. Front. Bus. Res. China 14, 13 (2020). https://doi.org/10.1186/s11782-020-000
82-6
16. Gałuszka, A., Pacholczyk, M., Bereska, D., Skrzypczyk, K.: Planning as artificial intelligence
problem - short introduction and overview. In: Nawrat, A., Simek, K., Świerniak, A. (eds.)
Advanced Technologies for Intelligent Systems of National Border Security. SCI, vol. 440,
pp. 95–103. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31665-4_8
17. Roondiwala, M., Patel, H., Varma, S.: Financial time series forecasting with deep learning: a
systematic literature review: 2005–2019 (2020)
18. Sezer, O.B., Gudelek, M.U., Ozbayoglu, A.M.: Appl. Soft Comput. J. 90, Article no. 106181
(2020)
19. Gałuszka, A., Probierz, E., Olczyk, A., Kocerka, J., Klimczak, K., Wisniewski, T:. The appli-
cation of SISO LSTM networks to forecast selected items in financial quarterly reports - case
study. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds.) Compu-
tational Science and Its Applications - ICCSA 2022 Workshops, Malaga, Spain, July 4–7,
Proceedings pt 5, pp. 605–616 (2022)
Utilizing Machine Learning to Predict Breast
Cancer: One Step Closer to Bridging the Gap
Between the Nature Versus Nurture Debate
Abstract. with breast cancer, scientists have been trying to find the most effective
solutions and treatments. Moreover, studies on genes through machine learning
were conducted. By identifying the factor that influences breast cancer the most,
this knowledge can be used to prevent or treat patients with breast cancer appropri-
ately. Furthermore, the result of this experiment extends onto the debate of nature
versus nurture. If the result concludes that Only Gene or Only Mutation has a
stronger effect on tumors, then it weighs nature more in this debate. Likewise, if
Only Others is the dominant factor, then this emphasizes the nurture more in this
debate. Gathered data was processed and went through eight different machine
learning algorithms to predict the tumor size and stage. The ‘Others’ was con-
cluded as the most influential factor for the tumor. Among the ‘Others’, the type
of breast surgery and the number of chemotherapy received were identified with
the highest correlation with tumor size and stage. In conclusion, this solidifies the
nurture’s stance on the debate. The data on the external effects and the usage of
a developed machine learning model can be adopted to improve the experiment
because they would increase the accuracy of the result.
1 Introduction
1.1 Background
Cancer is an abnormal growth of cells due to a mutation caused by DNA alteration and/or
exterior factors [1]. It is “the leading cause of death worldwide” in 2020, and there were
more than 10 million confirmed cases of cancer [2]. Breast cancer is one of the most
widespread and deadliest cancers [2]. There are different types of breast cancer: Ductal
Carcinoma In Situ (DCIS), Invasive Ductal Carcinoma (IDC), Lobular Carcinoma In
Situ (LCIS), Invasive Lobular Cancer (ILC), Triple Negative Breast Cancer, Inflamma-
tory Breast Cancer (IBC), Metastatic Breast Cancer, and other less frequent types [3].
Possible symptoms of breast cancer are “new lump in the breast or underarm (armpit)
[, t]hickening or swelling of part of the breast[, and i]rritation or dimpling of the breast
skin [4]. Up to 110 genes are related to breast cancer, and mutations in BRCA1 and
BRCA2 are known to have significant effects on the risk of getting breast cancer [5].
Machine learning is a technique that makes the machines train and learn about
information in ways similar to those of humans, which is making predictions through
the data [6]. Recently, machine learning has been used in genomic predictions, and it
can “adapt to complicated associations between data and the output [and] adapt to very
complex patterns” [7]. Also, it can “help us identify underlying genetic factors for certain
diseases by looking for genetic patterns amongst people with similar medical issues”
[8]. Machine learning has identified 165 new cancer genes in a recent study [9].
Fig. 1. Estimated number of new cases in 2020, worldwide, females, all ages
Figure 1 identifies Breast Cancer as the most detected cancer worldwide for women in
2020 [10]. Moreover, Fig. 2 indicates an increased probability for an ordinary US woman
to be diagnosed with invasive breast cancer. This shows the importance of analyzing the
key factors of breast cancer. If those factors were found, it would help a lot of women
prevent getting cancer.
Utilizing Machine Learning to Predict Breast Cancer 627
Fig. 2. Increased percentage of a woman’s lifetime risk of being diagnosed with invasive breast
cancer in the United States
1.2 Objective
While there were significant advancements in developing treatments and cures, the
majority of the experts suggest that it is still important to detect the tumor at the earliest
stages possible. This study aims to determine which factor among the number of genes,
the presence of mutation, or the other external factors influence tumor size and stage the
most. After concluding the most influential factor, this data can be used for prevention.
If genes or mutations of them were concluded as the most influential factor, then this
data would be crucial for prevention. Some tumors might be too small at the moment
to be noticed, as, for example, solid tumors are only detected through imaging when
“approximately 10^9 cells [are] growing as a single mass” [11]. Thus, people need to
wait until the tumor grows up to that point. However, if doctors see abnormal activities
in that certain factor, then they can start to stay alert and take precautions. Moreover,
early detection of breast cancer is crucial because it can lead to an “increased number
of available treatment options, increased survival, and improved quality of life” [12].
If ‘Others”, the exterior factors, are concluded as the most influential factor among the
three, then the right, important treatments can be given properly. For example, if radio-
therapy, one factor of the “Others”, has the highest inverse correlation with tumor size
and stage, then radiotherapy can be actively utilized to treat patients with breast cancer.
628 J. Park and M. Kim
Existing research did not consider the exterior factors, and their accuracies were
only based on one factor such as a particular gene. On the other hand, our experiment
focused not only on the normal genes but also on the mutated genes, which increases
the accuracy of the genetic datasets. Moreover, the proposed study included exterior
factors such as whether the patients have received chemotherapy or not. Furthermore,
accuracies were tested with multiple factors. For instance, the data have been mixed for
mutation and exterior factors and have been tested for accuracy. This is crucial because
it is more detailed for us to figure out which factors influenced the result the most. In
addition, this research predicted tumor size and stage as well, which is more detailed
compared to other research that only focused on the presence of the relationship between
a gene and breast cancer. Last but not least, Nature versus Nurture was mentioned and
debated in this article, and the gap between Nature and Nurture in this debate has been
bridged via machine learning algorithms.
2 Literature Review
Urda et al. used three free-public RNA-Seq expression datasets from The Cancer Genome
Atlas website. The datasets are linked to BRCA, COAD, and KIPAN genes. Those
databases are analyzed using a standard Least Absolute Shrinkage and Selection Operator
(LASSO) as a baseline model. DeepNet(i) and DeepNet(ii) are used for the application
of the deep neural net model for analysis. The results suggest that the straightforward
applications of deep nets described in this work are not enough to outperform simpler
models like LASSO. They found out that deep learning processes took more time to create
models than LASSO. They conclude that using a simple feature selection procedure to
reduce the genes and later fitting a deep learning model will take much more processing
time to achieve similar predictive performances [13].
Castillo et al. obtained the data from different cancer breast datasets in the National
Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) web plat-
form. The data were obtained from three datasets: RNA-seq, Microarray, and integrated.
First, they used Train-Test split to obtain the level of gene expression in RNA-seq and
Microarray. To test the accuracy of the data obtained through RNA-seq and Microar-
ray, they used Support Vector Machine (SVM), Random Forest (RF), and K-Nearest
Neighbors (k-NN). Moreover, they used minimum-Redundancy Maximum-Relevance
(mRMR) to apply feature selection. At last, they found that SFRP1, GSTM3, SULT1E1,
MB, TRIM29, and VSTM2L genes are the most relevant and frequent in the datasets.
The researchers concluded the study with high accuracy of the six genes found through
different techniques and confirmed once more by informing that five of the final six
genes were previously noted as genes related to breast cancer [14].
Liñares Blanco et al. used RNA-seq expression on the data from The Cancer Genome
Atlas. They used a standard statistical approach and two algorithms, including Random
Forest and Generalized Linear Models. There are different expressions between the con-
ventional statistical approach and the created algorithms as 99% of the genes generated
by an algorithm are represented differently to from the standard statistical approach.
There are some similarities in the method of identifying genes that have the potential
to structure tumors. For instance, filtering methods helped to identify tumors on the
unknown genes linked with the cell cycle [15].
Utilizing Machine Learning to Predict Breast Cancer 629
Wang et al. obtained data from The Cancer Genome Atlas (TCGA). They used
Random Forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting
Machine (LightGBM). LightGBM achieved the highest performance and accuracy, and
all three methods indicated that the hsa-mir-139 gene showed the highest correlation
with breast cancer. Moreover, there were other genes such as hsa-mir-21 and hsa-mir-
183 that yielded high correlations as well. By the result, the researchers could conclude
two ideas. First, LightGBM can be primarily used to detect cancer, as it showed the
highest accuracy and efficiency than other machine learning algorithms. Second, the
genes that exhibited correlation with breast cancer could work as biomarkers in cancer
diagnoses [16].
Johnson et al. used 3 RNA-seq data sets from Rat Body Map, National Center for
Biotechnology Information, and the Cancer Genome Atlas. Machine Learning meth-
ods such as support vector machines, random forest, decision table, J48 decision tree,
logistic regression, and naïve Bayes with three normalization techniques and two RNA-
seq analysis pipelines called the standard Tuxedo suite and RNA-Seq by Expectation-
Maximization (RSEM) were used. Generally, random forest proved to have the highest
accuracy between 71.3% and 99.8%. RNA-seq-based classifiers should utilize transcript-
based expression data, feature selection preprocessing, Random Forest classification
method, but not normalization [17].
Figure 4 and 5 show the correlation matrix between tumor size or stage and five genes
with the highest correlation values. Generally, they show very low correlations between
genes and target features, which are tumor size and tumor stage.
Fig. 4. Correlation matrix of tumor stage and five different Genes’ Z-Scores
Fig. 5. Correlation matrix of tumor size and five different genes’ Z-Scores
Figures 6 and 7 show the correlation matrix between tumor size or stage and five
mutations with the highest correlation values. Similarly, they show very low correlations.
Utilizing Machine Learning to Predict Breast Cancer 631
Fig. 6. Correlation matrix of tumor stage and data for five different mutations
Fig. 7. Correlation matrix of tumor size and data for five different mutations
Figures 8 and 9 represent the correlations between Tumor specifics and different
factors in “Others,” and the highest correlation among the datasets is around 0.2–0.3
correlation. This is the highest correlation among datasets. For Tumor size, Chemother-
apy, and Type of Breast Surgery yielded correlations of 0.21 and 0.25 respectively. In
the case of the Tumor stage, Chemotherapy and 0.33, and Type of Breast Surgery and
0.25 correlation are shown. Conclusively, it was found that treatment processes such as
surgery or chemotherapy had some correlated effect on the size or stage of the tumor.
632 J. Park and M. Kim
Fig. 9. Correlation matrix of tumor size and factors in “Others” chemotherapy: 0.206/type of
surgery: 0.25
Utilizing Machine Learning to Predict Breast Cancer 633
dimension dataset, EFB allows LGBM a faster computing speed. It is especially useful
in the dataset including multiple string datasets, as the algorithm should convert them
into a One_Hot vector, which increases the dimension of the dataset [20]. A diagram to
help understand the mechanism for LGBM is shown in Fig. 11.
Nature versus nurture theory first arose in the field of psychology. It questions whether
our behaviors originated from our genes or our environments. Subsequently, other fields
started to engage in this theory in their debates. This theory is applied to biology as well,
and it has been around us for decades. The question goes like this: are diseases caused by
genetic predispositions or environmental factors? There are some diseases such as sickle
cell anemia and Huntington’s chorea that are solely affected by genetics. However, this
is not the case for most diseases as both factors are influential for the diseases to be
developed. The debate continues to determine which factor is more influential [21].
First of all, the data was obtained from the Breast Cancer Gene Expression dataset
(METABRIC) and divided into seven parts to investigate which sectors affect tumor
stage and size the most. The seven parts include Only Gene, Only Others, Only Muta-
tion, Mutation + Others, Gene + Others, Mutation + Gene, and All. These data were
then preprocessed through label encoding, null value checking, feature selection, and
train-test splitting. In the feature selection process, variables that are directly related to
tumor size or stage are removed from the “Others” dataset. After that, eight machine
learning algorithms obtained accuracy scores, root mean squared error (RMSE), and
mean absolute error (MAE) values, and the relationship between the data and the tumor
size or stage could be confirmed. The overview of the experiment workflow is shown
below in Fig. 12.
Utilizing Machine Learning to Predict Breast Cancer 635
4 Result
The graph below yields different root mean squared error (RMSE) values for different
types of data. Several different regressors have been used to get the RMSE values.
Through XGB Regressor, the value for Only Mutation resulted in 15.08, but when the
“Others” datasets were applied, the result showed 12.26. “Only Gene” received 15.04,
and the addition of the “Others” yielded a decrease to 12.09 as well. When the “Others”
factor was added to each, “Only Mutation” and “Only Gene” in Random Forest, the
RMSE values decreased by 2.71 and 3.10 respectively. Last, the addition of “Others”
in the Extra Trees Regressor obtained the decrease of RMSE values by 1.62 for “Only
Mutation” and 2.42 for “Only Gene”.
Figure 13 shows that XGB Regressor, Random Forest Regressor, and Extra Trees
Regressor resulted in the lowest RMSE values.
636 J. Park and M. Kim
Fig. 13. Root Mean Square Error (RMSE) values for seven different groups
Table 1. Root Mean Square Error (RMSE) values for seven different groups
Data type The algorithm Lowest RMSE The algorithm Highest RMSE
used for the value used for the value
lowest RMSE highest RMSE
value value
Only Mutation XGB Regressor 15.08 LGBM Regressor 16.21
Only Gene Extra Trees 14.77 Decision Tree 18.19
Regressor Regressor
Only Others Linear 11.76 Decision Tree 15.67
Regression Regressor
Mutation + XGB Regressor 12.26 Decision Tree 16.42
Others Regressor
Gene + Others Random Forest 12.01 Linear 17.42
Regressor Regression
Mutation + Extra Trees 15.00 Linear 21.28
Gene Regressor Regression
All Random Forest 11.98 Linear 20.19
Regressor Regression
Regressors were divided into two separate charts due to the huge difference between
the mean absolute error (MAE) values. Table 1 reveals different MAE values for different
types of data using Decision Tree Regressor, Random Forest Regressor, and Linear
Regression. With Random Forest Regressor, Others affected the Only Mutation and
Only Gene datasets to have a decrease in MAE values. Only Mutation faced a decrease
of 1.55, and Only Gene went through a decrease of 1.41.
Utilizing Machine Learning to Predict Breast Cancer 637
Fig. 14. Mean Absolute Error (MAE) values for seven different groups
Figure 14 displays Random Forest Regressor as the algorithm with the lowest MAE
values from this group of algorithms. The lowest and the highest MAE_1 values and the
algorithms used for them are represented in Table 2.
Table 2. Mean Absolute Error (MAE) values for seven different groups
Data type The algorithm Lowest MAE_1 The algorithm Highest MAE_1
used for the value used for the value
lowest MAE_1 highest MAE_1
value value
Only Decision Tree 10.29 Linear Regression 10.82
Mutation Regressor
Only Gene Extra Trees 10.21 Linear Regression 13.06
Regressor
Only Others Linear 11.76 Decision Tree 15.67
Regression Regressor
Mutation + Random Forest 9.22 Linear Regression 10.01
Others Regressor
Gene + Random Forest 9.80 Linear Regression 13.37
Others Regressor
Mutation + Random Forest 10.17 Linear Regression 15.55
Gene Regressor
All Random Forest 8.78 Linear Regression 15.55
Regressor
638 J. Park and M. Kim
Figure 15 indicates different MAE values for different types of data using XGB
Regressor, LGBM Regressor, and Linear Regression. The second set of MAE values was
evaluated through XGB Regressor. XGB Regresssor resulted in 227.32 for the “Only
Mutation” dataset, but when the “Others” datasets were applied, the Regressor decreased
to 150.22. For the “Only Gene” dataset, the XGB Regressor yielded 226.10. In contrast,
the XGB Regressor resulted in 146.13 after the “Others” was added. Furthermore, Table 3
presents different MAE values for seven groups of the experiment.
Fig. 15. The Second Set of Mean Absolute Error (MAE) values for seven different groups
Table 3. Second Set of Mean Absolute Error (MAE) values for seven different groups
Data type The algorithm Lowest MAE_2 The algorithm Highest MAE_2
used for the value used for the value
lowest MAE_2 highest MAE_2
value value
Only XGB Regressor 227.32 LGBM Regressor 262.84
Mutation
Only Gene Extra Trees 218.26 LGBM Regressor 228.05
Regressor
Only Others XGB Regressor 154.76 Extra Trees 205.72
Regressor
Mutation + XGB Regressor 150.22 Extra Trees 212.50
Others Regressor
Gene + XGB Regressor 146.13 Extra Trees 152.64
Others Regressor
(continued)
Figure 16 and Table 4 shows the accuracy of different machine learning algorithms
with different sections of the data. With the aid of “Others”, “Only Mutation” and “Only
Gene” gained 5.3 and 8.03% each in LGBM Classifier.
Utilizing Machine Learning to Predict Breast Cancer 639
Table 3. (continued)
Data type The algorithm Lowest MAE_2 The algorithm Highest MAE_2
used for the value used for the value
lowest MAE_2 highest MAE_2
value value
Mutation + Extra Trees 224.85 LGBM Regressor 229.76
Gene Regressor
All XGB Regressor 145.30 Extra Trees 158.58
Regressor
Data type The algorithm Lowest accuracy The algorithm Highest accuracy
used for the score used for the score
lowest accuracy highest accuracy
Only KNeighbors 45.13 XGB Classifier 50.12
Mutation Classifier
Only Gene Logistic 49.88 Random Forest 55.58
Regression Classifier
Only Others KNeighbors 54.99 XGB Classifier 66.42
Classifier
Mutation + KNeighbors 52.55 LGBM Classifier 67.4
Others Classifier
(continued)
640 J. Park and M. Kim
Table 4. (continued)
Data type The algorithm Lowest accuracy The algorithm Highest accuracy
used for the score used for the score
lowest accuracy highest accuracy
Gene + Logistic 57.91 XGB Classifier 66.67
Others Regression
Mutation + KNeighbors 45.61 Random Forest 55.34
Gene Classifier Classifier
All KNeighbors 52.55 XGB Classifier 66.67
Classifier
5 Discussion
To sum up, the decrease in both RMSE and MAE values using different machine learning
algorithms as the “Others” was added, showed us how “Others” contributed the RMSE
and MAE scores to be lower, indicating how the factors with “Others” are better at pre-
dicting tumor size. Moreover, the accuracy of data using the LGBM classifier expressed
a similar effect. The addition of “Others” influenced the accuracy score to increase,
resulting in better predictions of the tumor stages. The debate of nature versus nurture
has continued with a consensus that both factors are influential to some extent in dis-
eases. However, according to the analysis above, “Others”, the exterior factors have been
concluded as the most influential factor among the three in predicting tumor size and
stage. In particular, Figs. 17 and 18 exhibit the type of breast therapy and chemotherapy
as the most important factor among those in others for determining tumor size and stage
respectively. This shows how the other exterior factors such as the age of diagnosis are
more important than the number of genes or mutated genes themselves, thus concluding
that nurture takes a bigger part in deciding tumor size and stage.
Fig. 17. Feature importance scores of different factors in “Others” for tumor size
Utilizing Machine Learning to Predict Breast Cancer 641
Fig. 18. Feature importance scores of different factors in “Others” for tumor stage
5.1 Limitations
The most striking limitation of this research is its low accuracy. Although data fused
with “Other” datasets has a relatively higher accuracy of 50–60% than other datasets, the
highest value is 67.4%, which is derived when LGBM Classifier is used in the “Mutation
+ Others” datasets. In future research, when designing a model related to this issue, the
focus should be on improving the accuracy of the model. Another potential weakness
in our study is that these researchers do not have records of the patients’ usages of
carcinogens that are often consumed through drinking or smoking. Since carcinogens
damage the DNA in our cells, consumption of them can affect the extent or the size
and the stage of the tumor [22]. Due to this weakness, these researchers weren’t able to
evaluate the full extent of the exterior factor.
6 Conclusion
The given data, METABRIC, is divided into seven datasets. They are preprocessed
through four methods. Then, the preprocessed data sets become databases, and they go
through eight different machine learning algorithms. The addition of “Others” resulted
in lower RMSE and MAE values and higher accuracy, meaning that “Others” affected
the result to predict the tumor size and stage better. According to the correlation matrix
and feature importance graph, types of breast surgery and chemotherapy were identified
as the most influential factors in determining tumor size and stage. The nature versus
nurture debate is about whether visible or writable effects of biological diseases are due
to genetics or the environment. While it is important to recognize that both factors are
influential, this experiment strengthens nurture’s side on this debate by showing that
“Others”, the exterior factors, improved the accuracy and predictability of the dataset.
Further research must investigate external effects related to carcinogens such as smoking
and drinking alcohol and quantify and put such data into datasets. Deeper research is
needed to improve the accuracy of the model based on this richer external data so that
higher correlations can be found.
642 J. Park and M. Kim
References
1. National Cancer Institute: What is cancer? 5 May 2021 https://www.cancer.gov/about-can
cer/understanding/what-is-cancer
2. WHO|World Health Organization: Cancer, 3 March 2021. https://www.who.int/news-room/
fact-sheets/detail/cancer
3. National Breast Cancer Foundation: 19 September 2019. Other Types. https://www.nationalb
reastcancer.org/other-types-of-breast-cancer. Accessed 13 Dec. 2022
4. Centers for Disease Control and Prevention: What are the symptoms of breast cancer? 14
September 2020. https://www.cdc.gov/cancer/breast/basic_info/symptoms.htm
5. Breastcancer.org: Researchers identify 110 genes associated with breast cancer, 20 December
2018 https://www.breastcancer.org/research-news/110-genes-associated-with-breast-cancer
6. IBM Cloud Education: What is machine learning? IBM - United States, 15 July 2020. https://
www.ibm.com/cloud/learn/machine-learning
7. Montesinos-López, O.A., et al.: A review of deep learning applications for genomic selection.
BMC Genomics 22(1) (2021). https://doi.org/10.1186/s12864-020-07319-x
8. Jethanandani, M.: Machine learning and genetics. 23andMe Education Program, 8 August
2018. https://education.23andme.com/machine-learning-and-genetics/
9. Max-Planck-Gesellschaft: 165 new cancer genes identified with the help of machine learning.
ScienceDaily, 12 April 2021. www.sciencedaily.com/releases/2021/04/210412142730.htm
10. Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., Bray, F.:
Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide
for 36 cancers in 185 countries. CA: A Cancer J. Clinicians 71(3), 209–249 (2021). https://
doi.org/10.3322/caac.21660
11. Frangioni, J.V.: New technologies for human cancer imaging. PubMed Central (PMC), 20
August 2008. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2654310/
12. M. (2018, October 31). Early detection of breast cancer information. MyVMC. https://www.
myvmc.com/investigations/early-detection-of-breast-cancer/
13. Urda, D., Montes-Torres, J., Moreno, F., Franco, L., Jerez, J.M.: Deep learning to analyze
RNA-seq gene expression data. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2017. LNCS,
vol. 10306, pp. 50–59. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59147-6_5
14. Castillo, D., Gálvez, J.M., Herrera, L.J., Román, B.S., Rojas, F., Rojas, I.: Integration of RNA-
SEQ data with heterogeneous microarray data for breast cancer profiling. BMC Bioinform.
18(1) (2017). https://doi.org/10.1186/s12859-017-1925-0
15. Liñares Blanco, J., Gestal, M., Dorado, J., Fernandez-Lozano, C.: Differential gene expression
analysis of RNA-seq data using machine learning for cancer research. In: Tsihrintzis, G.A.,
Virvou, M., Sakkopoulos, E., Jain, L.C. (eds.) Machine Learning Paradigms. LAIS, vol. 1,
pp. 27–65. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15628-2_3
16. Wang, D., Zhang, Y., Zhao, Y.: LightGBM. In: Proceedings of the 2017 International Con-
ference on Computational Biology and Bioinformatics - ICCBB 2017 (2017). https://doi.org/
10.1145/3155077.3155079
17. Johnson, N.T., Dhroso, A., Hughes, K.J., Korkin, D.: Biological classification with RNA-SEQ
data: can alternatively spliced transcript expression enhance machine learning classifiers?
RNA 24(9), 1119–1132 (2018). https://doi.org/10.1261/rna.062802.117
18. Alharbi, R.: Breast cancer gene expression profiles (METABRIC). Kaggle: Your Machine
Learning and Data Science Community, 27 May 2020. https://www.kaggle.com/raghadalh
arbi/breast-cancer-gene-expression-profiles-metabric
19. Breiman, L.: Random forests. Mach. Learn. 45(3), 5–32 (2001). https://doi.org/10.1023/a:
1017934522171
Utilizing Machine Learning to Predict Breast Cancer 643
20. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural. Inf.
Process. Syst. 30, 3146–3154 (2017)
21. Medicinenet.com. (2022). https://www.medicinenet.com/nature_vs_nurture_theory_genes_
or_environment/article.htm. Accessed 3 Jan 2022
22. How These Common Carcinogens May Be Increasing Your Risk for Cancers. Verywell
Health: (2022). https://www.verywellhealth.com/carcinogens-in-cigarettes-how-they-cause-
cancer-514412#:~:text=A%20carcinogen%20is%20any%20substance,cancer%20helps%
20in%20prevention%20efforts. Accessed 3 Jan 2022
Recognizing Mental States when Diagnosing
Psychiatric Patients via BCI and Machine
Learning
Ayeon Jung(B)
1 Introduction
1.1 Background
A psychiatric disorder is any disorder that interferes with a person’s thoughts, emotions,
or behavior, including anxiety disorders, and depressive disorders. Physical exams and
lab tests are conducted [1] to diagnose these disorders. When people come to the psychi-
atrist for the first time, they fill out a few simple questionnaires, such as how long they’ve
had their symptoms, whether they have a personal or family history of mental health
concerns, and whether they’ve received any psychiatric therapy. These questionnaires
help psychiatrists understand the basic information about the patient’s condition [2].
These psychiatric evaluations usually take about 30 to 90 min; therefore, concentrating
during the entire evaluation is vital for accurate results [3].
A Brain-Computer Interface (BCI) is a computer-based system that allows direct
communication between a brain and an external device. It “acquires brain signals, ana-
lyzes them, and translates them into commands that are relayed to an output device to
carry out a desired action” [4] and is “often aimed at assisting, augmenting or repairing
human cognitive or sensory-motor functions” [5]. Research on BCIs has helped physi-
cally disabled patients by allowing for advancements in complex control over cursors,
prosthetics, wheelchairs, and other devices. It also opened up new possible methods of
neurorehabilitation for patients who suffer from strokes or other nervous system disor-
ders [6]. The field of BCI is rapidly growing, with the BCI market revenue in the United
States by an application having heightened from 127.9 million to 354.3 million U.S.
dollars in a span of 10 years from 2012 to 2022 as shown in Fig. 1 [7].
Specifically, an interest in electroencephalographic (EEG) based BCI approaches
have increased as “recent technological advances such as wireless recording, machine
learning analysis, and real-time temporal resolution” developed [8]. Electroencephalog-
raphy is “a medical imaging technique that reads scalp electrical activity generated by
brain structures” [5]. The EEG displays signals that are divided into several bands; “The
most commonly studied waveforms include delta (0.5 to 4 Hz); theta (4 to 7 Hz); alpha
(8 to 12 Hz); sigma (12 to 16 Hz) and beta (13 to 30 Hz)” [9].
Machine learning and deep learning are branches of artificial intelligence (AI) and
they mainly focus on learning automatically through experience. They are the state of
the art algorithms and have been applied to various fields including science, healthcare,
manufacturing, education, financial modeling, policing, and marketing [10].
Fig. 1. Brain computer interface (BCI) market revenue in the United States from 2012 to 2022,
by application
646 A. Jung
1.2 Purpose
This study aims to find ways to lower the prevalence of misdiagnoses that occur due to
patients’ experiencing poor concentration during their psychiatric evaluation. By analyz-
ing different mental states presented in the form of EEG brainwave signals, it becomes
possible to determine who is verifiably concentrated as EEG signals are an objective
indicator. Recognizing the mental state of patients using EEG brain wave signals will
allow psychiatrists and psychologists to receive accurate questionnaire answers and make
proper diagnoses instead of relying on self-reported claims of patients’ concentration
levels.
An algorithm-based model was developed in order to diagnose mental states using
EEG data. A total of four experiments were conducted with one control experiment
and three test experiments. The control experiment ran the raw data through machine
learning models to classify the mental states of participants. The first and second experi-
ments applied either the Principal Component Analysis (PCA) technique or Independent
Component Analysis (ICA) technique to the data. Both modified data sets were then run
through the machine learning classification models. The third experiment used a corre-
lation matrix to determine six columns of data with the highest corollary relationship
and applied them to the classification models.
The findings of this study will provide a method for improving rates of proper
diagnoses made by allowing psychiatrists and psychologists to recommend a break or
an exercise for enhancing concentration for patients who are unable to concentrate during
the diagnostic tests.
The rest of this paper proceeds as follows: the literature review section which covers
related research on this topic, methods, and materials section for introducing various
algorithms, and workflow. The result section covers the results of each experiment, while
the discussion part states the principal finding of our paper. Finally, in the conclusion
section, we summarize our research.
2 Literature Review
Kaczorowska et al. investigated the algorithms for removing EEG signal artifacts by
utilizing ICA and PCA in their experiment. Artifacts included were eye blinks, speaking,
and hyperventilation. The experiment’s subjects consisted of twenty people of similar
ages, and each of them entered the silent testing place with artificial lighting. After
entering the testing room, subjects were asked to carry out standard actions with the aim
of collecting resting-state data, cognitive activity action data, and noise data. Mitsar EEG
201 was utilized to record the EEG data from the subjects with a frequency of 500 Hz.
PCA and ICA were both factor analysis algorithms, which were based on the coordinate
system’s rotation, and non-Gaussian data’s linear representation respectively. PCA and
ICA were applied to the gathered dataset, and the experiment yielded a result that PCA
surpassed ICA. The computation speed of the PCA was faster, and utilizing it was less
demanding, which could lead to more usage of the PCA for removing artifacts in the
upcoming research [11].
Wang et al. examined the relationship between emotional state and EEG data via
machine learning algorithms. They utilized some movie clips obtained from Oscar films
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 647
was recorded for 60 s per state - relaxed, concentrating and neutral - after being exposed to
three different stimuli. The dataset was then processed with statistical feature extraction,
creating 2479 rows and 989 columns of data, and Fig. 2 showed visualizations of the
EEG data [15].
Fig. 2. Visualization of the EEG dataset based on the labels: concentrating, relaxed, neutral
for each divided data to yield the final results. Furthermore, the bagging method allows
redundancy during bootstrapping. The random forest can be easily utilized through
the Scikit-learn library, and there are several hyperparameters that could be opti-
mized. N-estimators, max_features, max_depth, max_leaf_nodes, min_samples_split,
and min_samples_leaf are the representative hyperparameters, and each of them refers
to the number of decision trees in the model, number of features in the division, max
depth of the model, max number of leaves in the model, number of minimum data sam-
ples for distribution, and minimal sample data for becoming a leaf [17]. The overall
structure of the random forest could be found in Fig. 3.
For the control, all 989 columns of EEG signal data were processed into the train
split function and analyzed through eight classification models (Decision Tree, Logis-
tic Regression, Random Forest, Gradient Boosting, Adaptive Boosting, K-neighbors,
LGBM, XGB) without further analysis to produce an accuracy score. The first test
experiment condensed the data into five columns using the PCA analysis feature extrac-
tion algorithm. The extracted data were then processed through the same classification
models as mentioned above in the control experiment and resulted in an accuracy score.
The second test experiment condensed the data into five columns using another fea-
ture extraction algorithm, ICA Analysis. The third experiment extracted six column
features from correlation matrix analysis. The data were analyzed through the classifi-
cation models, which resulted in an accuracy score. Figure 4 yields the overall process
of the experiment.
650 A. Jung
4 Result
Fig. 5. Bar graph showing the comparison of the accuracy score from machine learning models
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 651
A visualization of the data before and after running the data through PCA Analysis for the
second experiment is shown in the Fig. 6. The 989 columns of data were reduced to five
columns and the graph shows that the reconstructed data was an accurate representation
of the original data.
In the first test experiment, the Decision Tree Classifier, Logistic Regression, Ran-
dom Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, KNeighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 79.84%,
82.06%, 87.7%, 86.09%, 68.35%, 86.09%, 87.1%, 85.08% respectively, as shown in
Fig. 7.
Fig. 7. Bar graph showing the comparison of the accuracy score from PCA + machine learning
Models
652 A. Jung
A visualization of the data before and after running the data through ICA Analysis for
the third experiment is shown in the Fig. 8. The 989 columns of data were reduced to
5 columns and the graph shows the reconstructed data. The ICA extracted data is a less
accurate representation compared to the PCA extracted data.
In the second test experiment, the Decision Tree Classifier, Logistic Regression,
Random Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, K-neighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 81.85%,
74.4%, 88.1%, 87.3%, 56.05%, 86.29%, 87.9%, and 86.09% respectively as depicted in
Fig. 9.
Fig. 9. Bar graph showing the comparison of the accuracy score from ICA + machine learning
models
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 653
In the third test experiment, the Decision Tree Classifier, Logistic Regression, Ran-
dom Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, K-neighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 71.77%,
66.53%, 78.02%, 76.21%, 57.86%, 71.17%, 80.04%, and 75.81% respectively as shown
in Fig. 11.
Fig. 11. Bar graph showing the comparison of the accuracy score from correlation + machine
learning models
654 A. Jung
5 Conclusion
5.1 Discussion
This study uses BCI and AI approaches to recognize insufficient concentration levels
when patients are being diagnosed. The algorithm produced a very high overall accuracy
score, with the highest being 98.19% using the LGBM Classifier on the raw dataset in the
control experiment. The most efficient feature selection method was PCA, though similar
in accuracy to ICA, from the three feature extraction models. The highest PCA processed
data yielded an accuracy score of 87.7% using the random forest classification model
while the correlation matrix analysis yielded an accuracy score of 80.04% using the
LGBM classifier. These feature selection methods will be able to produce an accurate
and efficient diagnosis of mental states in future studies with larger data sets. Unlike
previous research [12], the findings of this study can be used to analyze the brain waves
of patients of psychiatrists and psychologists while they are solving diagnostic tests.
Doctors will be able to recognize lowered concentration levels and advise that their
patients take a break or do an activity that enhances concentration. Patients will answer
all questions at a sustained level of sufficient concentration, ultimately allowing more
proper diagnoses to be made.
5.2 Summary/Restatement
This research paper’s purpose was to lower misdiagnosis rates that result from patients
experiencing deficient concentration levels during their psychiatric evaluation. Being
able to recognize faltered concentration using brain wave data will enable psychiatrists
and psychologists to provide methods for patients to regain concentration. The study
input 989 columns of EEG signal data into a train split function and analyzed them
using the classification models: Decision Tree, Logistic Regression, Random Forest,
Gradient Boosting, Adaptive Boosting, K-neighbors, LGBM and XGB. The first and
second test experiments condensed the data into five columns each using the feature
extraction algorithm PCA analysis or ICA analysis. The third test experiment extracted
six features using the correlation matrix analysis. The highest accuracy score of 98.19%
was produced by the LGBM Classifier in the control experiment, which could satisfy the
objective of this paper. The most efficient feature selection method was PCA but ICA
also yielded similar accuracy scores, and this finding could help future research about
reducing features of the EEG datasets. The proposed model could help patients answer
questions with a sustained level of high concentration, which enables a more precise
diagnosis of psychiatric disorders.
References
1. MAYO CLINIC. https://www.mayoclinic.org/diseases-conditions/mental-illness/diagnosis-
treatment/drc-20374974. Accessed 30 Dec 2021
2. WebMD. https://www.webmd.com/mental-health/mental-health-making-diagnosis.
Accessed 30 Dec 2021
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 655
Ting Sun(B)
1 Introduction
1.1 Background
Hepatitis C is a liver infection caused by the hepatitis C virus (HCV). It causes liver
inflammation and if not treated, may become a chronic liver disease. In fact, 80% of
infected individuals become chronic carriers and may develop fibrosis or cirrhosis–more
severe cases of liver damage [1]. Fibrosis is caused by liver scarring and high levels of
it can lead to cirrhosis which requires aggressive treatment [2]. HCV is spread through
contact with the blood from an infected person, including the sharing of equipment used
for drug injection as well as blood transfusions and organ transplants. However, with
the introduction of anti-HCV tests for blood donors, the rate of transmission of HCV by
blood transfusion has significantly decreased overall [3].
Although this decrease is prevalent in developed countries, there are still cases in
developing countries where a limited number of screening programs are available for
proper diagnosis and blood transfusion still remains a significant way of HCV trans-
mission [4]. Mostly in Asia and Africa, it was found that 31 of the 142 “developing”
countries do not undertake any anti-HCV screening, and another 37 screens less than
100% of blood [5]. Furthermore, while HCV only requires a blood sample for the screen-
ing test, later stages of liver diseases such as fibrosis and cirrhosis traditionally require
liver biopsy for proper diagnosis. A liver biopsy involves the removal of a small piece
of the liver tissue to be observed under a microscope for damage. Although it has been
the gold standard for diagnosing liver fibrosis, it is being recognized that liver biopsy
has several drawbacks. It has been found that 33.1% of the samples taken for biopsy are
misclassified by at least one grade of the fibrosis stage. Furthermore, despite its diagnos-
tic inaccuracy, the cost is also a major issue for the implementation of liver biopsy [1].
Due to the invasiveness of liver biopsies, noninvasive methods of liver disease diagnosis
like radiology are also available.
However, like liver biopsy, diagnostic imaging such as magnetic resonance imaging
(MRI) or computerized tomography (CT) scans is costly for both the patient and the
hospital, often being limited in availability in developing countries [6]. As a result, a
continuing increase in the rate of acute hepatitis C infections, and the evident healthcare
disparity regarding accessibility to HCV diagnosis and liver disease diagnosis demand
the need for a new, noninvasive, and affordable diagnosis tool. One way this demand can
be fulfilled is the use of machine learning (ML) to detect patterns of liver diseases within
existing HCV, fibrosis, and cirrhosis patient data. Then, the ML model can compare
blood donor data to the detected patterns to make predictions about the condition of the
blood donor as well as to further classify them with Hepatitis C, Fibrosis, or Cirrhosis.
1.2 Objective
The objective of this research is to develop a machine learning model that can accurately
classify Hepatitis C, liver fibrosis, and cirrhosis patients with preexisting medical records
of patients. There are currently several approaches to the diagnosis of Hepatitis C and
chronic liver diseases. For Hepatitis C, the most common system of diagnosis is utilizing
the anti-HCV test and the PCR test [7]. However, Hepatitis C often goes unseen due
to its asymptomatic nature. Therefore, it’s only when the condition has worsened to
become a chronic liver disease that the patients seek medical treatment. When it comes
to that point, further screening is required to diagnose them with the proper condition
(e.g. liver fibrosis, cirrhosis). This further screening includes radiology scans and liver
biopsies which are more costly and, in the case of liver biopsies, more invasive. With
this, it’s even more difficult for developing countries to have people test for liver fibrosis
and cirrhosis.
However, machine learning can be an alternative to liver disease diagnosis since
it only requires patient data from simple blood tests. In this study, multiple machine
learning models are applied to preexisting data of patient records from past blood tests
658 T. Sun
or medical checkups. Having a machine learning model to predict the state of the patient
reduces the need for costly or invasive methods of diagnosis. Furthermore, the suspect
blood donor information of the dataset can be analyzed to create a more accurate model
for disease prediction, and this approach could be useful when the given dataset is not
accurate enough. The rest of this paper is organized as follows: The prior work on
diagnosing hepatitis C is discussed in the Literature review section. The Materials and
Methodologies, including the proposed model, and datasets for diagnosing hepatitis C,
are stated in Sect. 3. In the Result section, the result of the proposed model is explained.
Lastly, in the Conclusion section, the principal finding of this paper and summary of this
paper is exhibited.
2 Literature Review
Akella et al. aimed to use machine learning algorithms to predict the extent of fibrosis in
patients with Hepatitis C as an alternative to the current invasive methods of diagnosis.
The study used a dataset from the machine learning repository of the University of
California Irvine which contained patient data from Egyptian hospitals. After selecting
six relevant features from the dataset, trained different machine learning algorithms with
the dataset to apply them in disease detection. The evaluation of the algorithms was done
with four parameters: accuracy, sensitivity, specificity, and the area under the receiver-
operating curve (AUROC). Six of the nine ML algorithms had evaluation parameters all
in the range of 0.60 to 0.96 for Experiments A and B and 0.34 to 0.64 for Experiment
C. Among the nine algorithms they used, XGBoosting had the best performance overall
with an accuracy of 0.81, AUROC of 0.84, a sensitivity of 0.95, and a specificity of 0.73
[8].
The purpose of Abd El-Salam et al.’s research was to develop a more efficient
technique for disease diagnosis through classification analysis, using machine learn-
ing techniques. The study focused on the diagnosis of Esophageal Varices, a common
side-effect of liver cirrhosis. The dataset used in the study was obtained from fifteen
different centers in Egypt between 2006 and 2017 and included twenty-four individual
clinical laboratory variables. After the dataset was cleaned for missing values, different
machine learning algorithms were applied to the data in order to predict esophageal
varices. To evaluate the performance of the classification algorithms, the sensitivity,
specificity, precision, Area under ROC (AUC) analysis, and the accuracy of the models
were calculated. Through the evaluation, the Bayesian Net algorithm was found to per-
form more efficiently and effectively than the other algorithms: 74.8% for the area under
the ROC curve and 68.9% for the accuracy. The paper concluded that the studied ML
models could be used as alternatives to gastrointestinal screening, the current method of
esophageal varices testing, for cirrhotic patients [9].
Chicco and Jurman utilized electronic health records (EHRs) of patients to analyzed
and process using machine learning classifiers. The EHRs of 540 healthy controls and 75
patients diagnosed with hepatitis C (total of 615 subjects) collected at Hannover Medical
School were considered to be the discovery cohort while another independent dataset
containing EHRs of 123 hepatitis C patients from Kanazawa University in Japan was
considered to be the validation cohort. Both the discovery cohort and validation cohort
Diagnosis of Hepatitis C Patients via Machine Learning Approach 659
had missing data that were replaced through Predictive Mean Matching (PMM). The
study performed binary classification analysis (for the discovery cohort) and regression
analysis (for the validation cohort) using Linear Regression, Decision Trees, and Random
Forests. The feature importance was also analyzed to investigate which clinical features
of the discovery cohort dataset were the most predictive of the status of the subject.
As a result, Random Forests achieved the top results for both binary classification and
regression: R2 of +0.765 and MCC of +0.858 [10].
Ahammed et al. aimed to classify the state of a patient’s liver condition by using
machine learning algorithms. The dataset analyzed was collected from the University of
California, Irvine machine learning repository. It contained related data of almost 1385
HCV infected patients, all of them classified by a stage of liver fibrosis. Using Synthetic
Minority Oversampling Techniques (SMOTE), the data was preprocessed and balanced.
Feature selection was also applied to the data to identify the most relevant features of the
dataset to improve the quality of the model’s performance. Then, the preprocessed data
was applied to various classifiers to determine which model is the best for liver condition
classification. As a result, KNN showed better outcomes than other classifiers with an
accuracy of 94.40%. Although there were limitations to their study in terms of the data
analysis, it was concluded that KNN could be a potential machine learning model for
HCV patient classification [11].
The dataset was collected from the University of California Irvine Machine Learn-
ing Repository, as shown in Fig. 1. It contained 14 attributes and 615 instances. All
attributes except Category and Sex were numerical, with ten attributes being the labo-
ratory data: Albumin (ALB), Alkaline Phosphatase (ALP), Alanine Amino-Transferase
(ALT), Aspartate Amino-Transferase (AST), Bilirubin (BIL), Choline Esterase (CHE),
Cholesterol (CHOL), Creatinine (CREA), Gamma Glutamyl-Transferase (GGT), and
Protein (PROT). The Category attribute contained categorical data: Blood Donor, sus-
pect Blood Donor, and the progress of Hepatitis C (Hepatitis C, Fibrosis, Cirrhosis), and
this could be found in Fig. 2. Likewise, the Sex attribute also contained categorical data:
m (male) and f (female) [12].
660 T. Sun
Fig. 2. Pie chart showing the values from the “Category” column
3.2 XG Boosting
XG Boosting is an abbreviation for eXtreme Gradient Boosting, which is an ensemble
machine learning algorithm, and used for both classification and regression. As the gra-
dient boosting algorithm has limitations that it is vulnerable to overfitting and has low
speed, XG Boosting was introduced to overcome those shortcomings. As the XG Boost-
ing is developed from gradient boosting, the main structure of the algorithm is quite
Diagnosis of Hepatitis C Patients via Machine Learning Approach 661
values, Label Encoder was used to encode the categorical values into values between 0
and n − 1 (n being the number of classes in the category). Additionally, the data was
checked for any null values, and those found were removed from the data. With the
appropriate format of data, the data was split into train and test subsets. The training
subset is used to fit the machine learning model while the test subset is used to evaluate
the machine learning model used with the training set. Furthermore, it was noticed that
the values of the dataset differ greatly in range; for example, the CHOL values are in
ones while some CREA values go above a hundred. And the problem with this is that
some features might affect the model more than other features do. Therefore, a standard
scaler was applied to the dataset to standardize the values of the features so that the
relative size of the values does not interfere with the model [16].
After the data has been preprocessed, different classification algorithms were fitted
onto the train set. The algorithms used were Logistic Regression, Decision Tree, LGBM,
XG Boosting, Gradient Boosting, and Random Forest. To evaluate the performance of
these ML models, the test set was used to find the accuracy scores of the models.
Aside from classification, the anomalies in the data were also detected to better the
performance of the models. This anomaly detection was done through the Isolation
Forest algorithm where the blood donors and the suspect blood donors were detected
through data isolation. To analyze the performance of the anomaly detection algorithm,
its probability of detecting the blood donors and its probability of detecting the suspect
blood donors were found for both the train and test set. The overall process of the
proposed experiment is explained in Fig. 5.
4 Result
4.1 Experiment for Classification of Patients
Upon trying several ML algorithms to determine which ML algorithm best processes the
HCV dataset, the accuracy scores of each algorithm were found. The lowest among the
six algorithms used was Decision Tree with an accuracy score of 92.66. Random Forest
also had a relatively low accuracy score of 93.79. Both LGBM and Gradient Boosting
algorithms had accuracy scores of 94.35. The algorithm with the highest accuracy score
Diagnosis of Hepatitis C Patients via Machine Learning Approach 663
among the six was XG Boosting with an accuracy score of 95.48. Figure 6 depicts the
accuracy scores of multiple algorithms.
Fig. 6. Bar graph showing the accuracy score from various machine learning classifiers
The feature importance of each attribute in the dataset was determined to visualize
which attributes were affecting the ML algorithm’s classification the most. In general,
the laboratory data attributes seemed to have high feature importance scores with the
exception of Age having a slightly higher score than Cholesterol (CHOL). The feature
with the highest importance score was Aspartate Amino-Transferase (AST), indicating
the level of this enzyme in a person greatly affected the algorithm’s choice of classi-
fication. Alanine Amino-Transferase (ALT) had the second highest feature importance
score while Albumin (ALB) had the third highest feature score. On the other hand, the
gender of the person had very little effect on the ML algorithm as indicated by the low
664 T. Sun
importance score of the attribute Sex. Figure 7 shows the feature importance score from
the variables.
Fig. 7. Bar graph showing the feature importance score from the variables
To visualize the data consisting of inlier (blood donor) and outlier features (non-blood
donor), the dimensionality of the data had to be reduced due to a large number of
features. This dimensionality reduction was done through Principal Component Analysis
(PCA) technique, a statistical technique primarily used in machine learning for this exact
purpose. Once the data has been reduced to three dimensions, the inliers and outliers of
the data could be clearly visualized (Fig. 8 below). The purpose of this data visualization
was to detect the outliers present in the data.
Diagnosis of Hepatitis C Patients via Machine Learning Approach 665
Fig. 8. 3D plot showing the inliers (blood donor) and outliers (suspect blood donor)
Fig. 9. Confusion matrix plot showing the result from the isolation forest (training set)
Using the Isolation Forest algorithm for the training set, the probability of detecting
the blood donors in the dataset was 1.0, while the probability to detect the suspect
blood donor was 0.9459. With this, the accuracy of this unsupervised anomaly detection
model on the training set was 94.66%. As for the test set, the probability to detect Blood
Donor was also 1.0, while the probability to detect a suspect blood donor was 0.9314.
Furthermore, the accuracy of the unsupervised anomaly detection model was 93.22%.
The confusion matrix of the results could be found in Fig. 9 and Fig. 10.
666 T. Sun
Fig. 10. Confusion matrix plot showing the result from the isolation forest (test set)
5 Conclusion
5.1 Discussion
In the first experiment of this study, it was found that XGBoosting allowed for the
most accurate model for the HCV data. Specifically, the accuracy score of the model
was 95.48, the highest among all classifiers. As the model’s accuracy was very high,
it can be a practical diagnostic tool for classifying actual patients with the three stages
of liver disease: Hepatitis C, liver fibrosis, and cirrhosis. Furthermore, this machine
learning model can be developed into a noninvasive and cost-effective diagnosis kit in
developing countries. Unlike MRI and CT scans that are often only available to highly
populated regions of a country (due to high cost and low availability), machine learning
models can be more easily distributed across the country at a lower cost. Additionally,
while patients have to pay high prices for radiological scans, the ML model only requires
the electronic records of patients, which can simply be obtained from the liver function
test, or a blood test that shows enzyme or protein levels.
In the second experiment, outliers like suspect blood donors were detected in the
dataset, which makes the model even more applicable to real world settings. This is
because real datasets, especially in developing countries, may have unreliable data that
interfere with the classification. So through the anomaly detection in the model, it can
detect and isolate those unreliable data to improve its performance.
For further research, aside from the features in the dataset used in the study, the race
of the patient can also be added as a feature. Although race or ethnicity is not proven
to be directly related to the liver disease itself, there are significant differences in the
prevalence of Hepatitis C among different races and ethnicities [17]. Therefore, more
research can be done with a similar method but with a new dataset that contains the
race/ethnicity information of the patients. Through this further research, the ML model
Diagnosis of Hepatitis C Patients via Machine Learning Approach 667
for diagnosis can be improved for better accuracy and also could develop the system via
Raspberry pi.
5.2 Summary/Restatement
This study aimed to develop a machine learning model that can diagnose a patient
with Hepatitis C, liver fibrosis, or cirrhosis solely by analyzing the medical records of
patients. In order to develop the model, several ML classification algorithms were applied
to the HCV dataset. The accuracy score of each algorithm was found to determine the
performance of the models; as a result, XGBoosting had the best performance with an
accuracy score of 95.48. Because the dataset contained data on suspect blood donors,
the Isolation Forest algorithm was used to detect outliers. The probability of detecting
suspect blood donors was 93.14% with an accuracy of 93.22%. With both XGBoosting
and Isolation Forest having high accuracy scores, it shows that patients can be accurately
classified with the proper stages of liver disease through machine learning. To improve
the model for even more accuracy, data on the race and ethnicity of HCV patients can
be analyzed to account for those attributes.
References
1. Sebastiani, G.: Chronic hepatitis C and liver fibrosis. World J. Gastroenterol. 20(32), 11033–
11053 (2014)
2. Healthline. https://www.healthline.com/health/hepatitis-c-fibrosis-score#fibrosis-score.
Accessed 15 Jan 2022
3. NHS website. https://www.nhs.uk/conditions/hepatitis-c/diagnosis/. Accessed 15 Jan 2022
4. Selvarajah, S., Busch, M.P.: Transfusion transmission of HCV, a long but successful road map
to safety. Antivir. Ther. 17(7 Pt B), 1423–1429 (2012)
5. Prati, D.: Transmission of hepatitis C virus by blood transfusions and other medical
procedures: a global review. J. Hepatol. 45(4), 607–616 (2006)
6. Frija, G., et al.: How to improve access to medical imaging in low- and middle-income
countries? EClinicalMedicine 38, 101034 (2021)
7. Bajpai, M., Gupta, E., Choudhary, A.: Hepatitis C virus: screening, diagnosis, and interpre-
tation of laboratory assays. Asian J. Transf. Sci. 8(1), 19 (2014)
8. Akella, A., Akella, S. Applying machine learning to evaluate for fibrosis in chronic hepatitis
C. MedRxiv (2020)
9. Abd El-Salam, S.M., et al.: Performance of machine learning approaches on prediction of
esophageal varices for Egyptian chronic hepatitis C patients. Inform. Med. Unlocked 17,
100267 (2019)
10. Chicco, D., Jurman, G.: An ensemble learning approach for enhanced classification of patients
with hepatitis and cirrhosis. IEEE Access 9, 24485–24498 (2021)
11. Ahammed, K., Satu, M.S., Khan, M.I., Whaiduzzaman, M.: Predicting Infectious state of
hepatitis C virus affected patient’s applying machine learning methods. In: 2020 IEEE Region
10 Symposium (TENSYMP) (2020)
12. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/HCV+data.
Accessed 17 Jan 2022
13. Chen, T., Guestrin, C.: XGBoost. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (2016)
668 T. Sun
14. Cheon, M.J., Lee, D.H., Joo, H.S., Lee, O.: Deep learning based hybrid approach of detecting
fraudulent transactions. J. Theor. Appl. Inf. Technol. 99(16), 4044–4054 (2021)
15. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International
Conference on Data Mining (2008)
16. Ferreira, P., Le, D. C., Zincir-Heywood, N.: Exploring feature normalization and temporal
information for machine learning based insider threat detection. In: 2019 15th International
Conference on Network and Service Management (CNSM) (2019)
17. CDC. https://www.cdc.gov/hepatitis/statistics/2019surveillance/Figure3.6.htm. Accessed 17
Jan 2022
Data Analytics, Viability Modeling
and Investment Plan Optimization of EDA
Companies in Case of Disruptive Technological
Event
1 Introduction
The question of the survival and viability of companies has excited many researchers.
There are two main approaches to predicting bankruptcy or optimizing the company’s
investment policy in order to improve the viability of the company: 1) the first is based on
the techniques of discriminant analysis (see for example [16–18]). An important work
in this connection is that by Altman [1]. In this approach statistical data can be used as a
representative sample of two groups of companies: survivors and disappearances. Then
a separating surface (in the simplest case a plane) is built to separate the two groups.
The positioning of the separating surface is optimized so that the number of incorrectly
classified companies is minimal. Using the optimal parameters of the separating surface,
predictions can be made with a certain accuracy, to which group of companies belongs
a new company, not included in the sample, as well as give advice on which parameters
the company needs to improve in order to fall into the group of survivors. 2) The second
approach is the direct one. It is based on the optimization of one or more criteria (objective
functions) and improves the values of factors, which are crucial for the better viability
of the companies.
Important related works applying the direct approach are connected with a devel-
oped theory on viable system modeling. The model of viable systems [2] is developed
to evaluate whether a company or organization is able to survive or not in a rapidly
changing environment/ market. The initial idea was to develop a cybernetic theoretical
model of the brain [2], corresponding to the management of a company. This model is
applied to a real example in the production of steel bars [3]. The author S. Beer seeks to
answer the question: How are systems “capable of self-existence”? The viable system
model (VSM) considers the individual organization as a whole system that exists in
constant balance with its rapidly changing (market) environment. Based on the theory
of cybernetics, Beer came to the conclusion that the model that determines the viability
of any system has five necessary and sufficient subsystems, interactively included in
each organism or organization, which are able to maintain their identity. Because the
theoretical model proved to be too complex [5], a simplified version of the model was
subsequently created and published in the book “Brain of the Firm: A Development
in Management Cybernetics” (see [4]). The simplified model uses neuro-physiological
terminology instead of mathematics. Later, a new version of VSM was developed, called
“The Heart of the Enterprise” [6]. Finally, looking at the application of the VSM, Beer
published a third book, entitled “Diagnostics of the Organization System” [7]. In the
VSM model, five subsystems have been developed, each with its own role, but working
closely with each other. The first three systems refer to the current activities of the orga-
nization, where the first subsystem consists of elements producing the organization. The
fourth system focuses on the future effects of external changes and requirements affect-
ing the organization. The fifth system maintains a balance between current activities and
future external changes and ensures the viability of the organization. All systems interact
with each other and are connected with the dynamically changing environment/market.
With the help of VSM model, it is possible to study the internal and external balance
of the organization/company and to make improvements necessary for its survival. Other
recently published works on this topic are [10, 15].
Dynamic Financial Analysis [12] is used to reveal economic factors and to create
different scenarios in the business modeling of insurance companies [8, 9, 19]. This
approach can be useful to understand the dependencies between the business models
and the external impacts on the company’s profitability. Unfortunately, the real available
economic risks can make the created dynamic models unreliable.
In the specific area of electronic design automation (EDA) the innovations events
have a huge impact on the survival of the companies and the emergence of new start-ups.
There is a gap in the modeling of such a specific environment. The purpose of this paper
is to fill this gap by offering new models in this area. Comparisons with other similar
models are not possible, as the authors do not have data on such developments in the EDA
area. This study is focused on the second-mentioned (direct) approach in the field of EDA
Data Analytics, Viability Modeling and Investment Plan Optimization 671
companies. As noted in [14], the first available data for this company type is from the year
1961, i.e. this business sector is 60 years old. Moreover, the existence of an EDA company
is particularly dynamic and depends heavily on many factors related to technological
development. Very important are also the factors influencing the market. A major factor
influencing the viability of EDA companies is the emergence of technological disruptive
events (innovations). The factors influencing the electronic design automation software
market are analyzed in [11].
viability of the EDA companies. For this reason, also the period of 1 year after this event
is considered (see Fig. 1) because it is equal to the tolerable delay Dmax .
Let N denote the number of EDA companies, which will be considered, and by M i
is denoted the sum of investments of a company i for the considered period, i = 1, 2, …,
N;
Based on the work [14] here will be considered five fields of possible investments
for the EDA companies (see Table 1). For each investment type, a return coefficient is
introduced. For example, if the invested amount of type j is 1, and the return coefficient
r j = 1.2, it means that the expected return of this investment is evaluated to be equal to
1.2 after 1 year. The variables in our model will be the proportions k1i , …, k5i of each
company i (see Table 1). In the table below SW&HW is used to denote Software and
Hardware.
Let Boolean variables y1, …, y5 be used to show are there investments of a given
type, where yj = 0 means “no investments of type j”, and yj = 1 means “made invest-
ments of type j”. For example y1 = 1 means “made investments for Development
of Interfaces”, and y3 = 0 means “there are no investments in SW&HW descrip-
tion languages”. Then the following thresholds of investments necessary to achieve
a technological breakthrough event (innovation) are introduced (see Table 2):
By making an investment equal to or greater than the set threshold, the correspondent
company is able to implement the necessary novel innovation in its production. Based
Data Analytics, Viability Modeling and Investment Plan Optimization 673
on the above annotation we could express the reaction delay d i to a given innovation
event by the company i. Let for example the investment threshold P3 = 40 units, and
at the time-moment t e − 1 the i-th company invests 30 units in this field (Processing
power). Assuming that the investments are constant even after the innovation event,
then this company will be able to follow the innovation in the correspondent field after
P3/(k4i Mi ) = 40/30 years, equal to 1 year and 4 months. Hence the delay of i-th company
regarding this field will be 4 months after the innovation event. The maximal reaction
delay of i-th company to an innovation event can be expressed in the form:
P1 P2 P3
di = max { − 1; − 1; − 1} (1)
(k1 + k2 + k5)Mi k3Mi k4Mi
Possible optimization criterion based on the reaction delays is:
min {max {d1 , d2 , ...di , ... , dN }} (2)
Better is the criterion to consider only the reaction delays greater than one year.
The profit Ri of i-th company for a period of 1 year can be expressed as:
5
Ri = rj kj i Mi − Mi (3)
j=1
Regarding the obligatory constraints, it should be mentioned that the sum of pro-
portions of the total investment is equal to 1. Other constraints are connected with the
investment types corresponding to P1 because they are connected with the attraction and
use of talented people in the company. It could be expected that the available talented peo-
ple are not enough to cover the needs of all companies, and that not every one company
has connections with academic circles in the mentioned areas. From Table 1 it follows,
that the investments with r1, r2 and r5 (Development of Interfaces, Mathematical Meth-
ods, and Solvers, and Development of Models and search for Application area) bring
60% of the total profit. Hence it could be concluded, that companies that invest a smaller
percent-sum in the mentioned fields have the wrong investment plans and don’t attract
and use enough “talented” people. This circumstance can be expressed by a constraint
on the sum: (k1i + k2i + k5i ) for the concrete company i. Competitive interactions may
impact the resources (“talented” people). For simplicity it is assumed, that once hired,
talented people are loyal to their employer and do not change the company in which they
work.
Based on these considerations the following two mathematical models are proposed
to improve the viability of EDA companies:
674 G. Marinova et al.
MODEL I:
N 5
max rj kji Mi − Mi
i=1 j=1
N 5
= −min[ rj kji Mi − Mi ] (4)
i=1 j=1
subject to:
5
kj i = 1; i = 1, 2, . . . , N ; (5)
j=1
where a1, a2, a3 and a4 are specific constants, reflecting the wrong investment policy
of companies with numbers il*.
MODEL II:
4 An Illustrative Example
In this illustrative example are included 20 companies, i.e. N = 20. In advance are given
the total investments of the considered companies and the investment thresholds for
generating an innovation. Variables in the optimization process are the proportions of
total investment for each company. Here are considered five types of investments, hence
for 20 companies the example includes 100 variables.
The example is a simulative one, because the investment units are not real, but virtual
units proportional to the data of real companies, from the database, published on site:
https://sec.report/
Data Analytics, Viability Modeling and Investment Plan Optimization 675
This is an American database that stores financial data for companies listed as public
(stock markets).
The total investments of all 20 EDA companies for a period of 1 year are presented
in Table 3.
It is assumed that the values of investment thresholds necessary to achieve an
innovation event are:
P1 = 120; P2 = 30; P3 = 50;
Regarding the constraints (6)–(9) it is assumed that il* ≡ {6, 11, 16, 20}.
The corresponding constraints look as follows:
k16 + k26 + k56 ≤ 0.30; (12)
5 Optimization Results
The above illustrative example is solved by means of MATLAB solvers fmincon and
patternsearch.
The following tasks are formulated and solved:
Starting fmincon solver after eight iterations the result shown in Fig. 2 was obtained.
The initial profit value is: F0 = 145.0301826571;
The optimal profit for all companies is: F* = 155.7600020019;
The obtained improvement is 6,897%.
For comparison starting with the same initial point, patternsearch solver found the
best solution after four iterations with profit value: F2* = 154.290515625. The obtained
improvement is a little bit smaller: 6,002%. The corresponding result is shown in Fig. 3.
The optimal solution with profit value F corresponds to different reaction delays of
the companies. Only company 3 and company 18 have a negative delay, where d3 ≈
− 0.4; and d18 ≈ − 0.3; this means that both companies are able to (and will) realize
innovations in the fields corresponding to investment threshold P1. The constraints (12)–
(15) lead to the wrong investment plan for the companies №№ 6, 11, 16, and 20. In this
case only the company № 6 with the greatest total investment survives, and companies
№№ 11, 16, and 20 obtain reaction delay di > 1 and will be merged with other companies
or will disappear.
6 Conclusions
Based on data analytics key factors determining the viability of EDA companies are
selected. These factors are used to formulate two single objective optimization models
aimed to improve the EDA companies’ viability by means of maximizing the total profit
of all considered EDA companies or minimizing the sum of the squared delays for all con-
sidered EDA companies after an innovation event. The criterion in the first model is linear
and maximizes the profit of the companies, which contributes to their longer viability. The
criterion in the second model is nonlinear and minimizes the reaction delays of the EDA
companies when they are greater than the tolerable delay Dmax with a length 1 year. The
results show that the second model leads to better solutions and is more effective than the
first model. Nevertheless, both models have shown good performance during the simula-
tion tests. The obtained results are encouraging and it can be concluded that these models
can be successfully used to solve tasks with real data in the mentioned area.
Further investigations of the proposed models should be performed on real data in the
same area. The generated solutions could contribute to improving the real investment plan
of EDA companies. Also, a conclusion could be drawn, which production parameters a
given company has neglected and has to improve in order to fall into the survivor’s group.
A direct approach for improving the viability of the companies is rarely used. There
are very few optimization methods, developed in this area. This is an open field for future
research.
680 G. Marinova et al.
Acknowledgment. This work is supported by the Bulgarian National Science Fund by the project
“Mathematical models, methods and algorithms for solving hard optimization problems to achieve
high security in communications and better economic sustainability”, Grant No: KP-06-N52/7.
References
1. Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy. J. Financ. 25(4), 589–609 (1968)
2. Beer, S.: Cybernetics and Management. English Universities Press, London (1959)
3. Beer, S.: Towards the cybernetic factory. In: Principles of Self Organization (symposium).
Pergamon Press, Oxford (1960)
4. Beer, S.: Brain of the Firm: A Development in Management Cybernetics. Herder and Herder,
New York (1972)
5. Beer, S.: The viable system model: its provenance, development, methodology and pathology.
J. Oper. Res. Soc. 35(1), 7–25 (1984)
6. Beer, S.: The Heart of Enterprise. Wiley, Chichester (1979)
7. Beer, S.: Diagnosing the system for Organizations. Wiley, Chichester (1985)
8. Blum, P., Dacorogna, M.: Dynamic Financial Analysis - Understanding Risk and Value Cre-
ation in Insurance (2003). https://www.researchgate.net/publication/23749485_Dynamic_F
inancial_Analysis_-_Understanding_Risk_and_Value_Creation_in_Insurance. Accessed 10
Mar 2022
9. D’arcy, S.P., Gorvett, R.W., Hettinger, T.E., Walling III, R.J.: Using the Public Access DFA
Model (1998). http://www.casact.org/pubs/forum/98sforum/
10. Espejo, R., Reyes, A.: Organizational Systems: Managing Complexity with the Viable System
Model. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19109-1
11. Gaul, V.: Electronic Design Automation Software Market, pages 267, December 2020. https://
www.alliedmarketresearch.com/electronic-design-automation-software-market
12. Kaufmann, R., Gadmer, A., Klett, R.: Introduction to Dynamic Financial Analysis (2004).
http://www.casact.org/library/astin/vol31no1
13. Marinova, G.I., Bitri, A.: Assessment and forecast of EDA company viability in case of
disruptive technological events. In: MATHMOD 2022 Discussion Contribution Volume, 10th
Vienna International Conference on Mathematical Modelling, Vienna, Austria, 27–29 July
2022, pp. 33–34 (2022). ARGESIM Report 17 (ISBN 978-3-901608-95-7), pp. 33–34. https://
doi.org/10.11128/arep.17.a17084. https://www.argesim.org›a17084.arep.17.pdf
14. Marinova, G.I., Bitri, A.: Review on formalization of business model evaluation for technolog-
ical companies with focus on the electronic design automation industry. In: IFAC Conference
TECIS 2021, Moscow, Russia, pp. 630–634, September 2021
15. Mulder, P., Viable System Model (VSM) (2018). ToolsHero: https://www.toolshero.com/man
agement/viable-system-model/. Accessed 05 Feb 2022
16. Rubin, P.A.: Solving mixed integer classification problems by decomposition. Ann. Oper.
Res. 74, 51–64 (1997)
17. Soltysik, R.C.F., Yarnold, P.R.: The Warmack-Gonzalez algorithm for linear two-group multi-
variable optimal discriminant analysis. Comput. Oper. Res. 21, 735–745 (1994)
18. Warmack, R.E., Gonzalez, R.C.: An algorithm for the optimal solution of linear inequalities
and its application to pattern recognition. IEEE Trans. Comput. C 22, 1065–1075 (1973)
19. Wiesner, E.R., Emma, C.C.: A Dynamic Financial Analysis Application Lined to Corporate
Strategy (2000). http://www.casact.org/pubs/forum/00sforum/
Using Genetic Algorithm to Create
an Ensemble Machine Learning Models
to Predict Tennis
1 Introduction
The prevalence of data or statistics about past games or players has helped
researchers to create predictive models for head-to-head games. From 2009
onward, there is a growing level of interest in applying machine learning to
sport [5]. Bunker and Susnjak showed that the application of machine learn-
ing to tennis is way less popular than its application to soccer and basketball.
They also noticed that ensemble techniques are not as frequently used to predict
sport compared to the stand alone models, such as artificial neural network and
decision tree.
In this paper, we report a study on using data about tennis players to create
an ensemble technique that will be used to predict the outcome of tennis games.
Our study consisted of two major parts. First, by utilizing a genetic algorithm, we
focused on finding good data representations to improve the accuracy of machine
learning algorithms. Second, we utilized the good data representations, from the
first part of the study, to create an ensemble technique, which will be used predict
the outcome of future tennis games. We tested our ensemble technique on the
Women’s singles in the 2020 Australian open, the 2021 French open and the
2021 US open. The predictions of the ensemble models were compared to the
predictions that are based solely on the player rankings.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 681–695, 2023.
https://doi.org/10.1007/978-3-031-18461-1_45
682 A. S. Randrianasolo and L. D. Pyeatt
1.1 Motivation
2 Related Work
match, and a tournament were calculated. This approach identified the winners
of the 2002 US Open and Wimbledon tournaments [14].
Gu and Saaty combined data and judgements to predict the winners of the
tennis matches from the 2015 US Open [7]. The prediction accuracy was 85.1%.
They utilized data of tennis matches from 1990 to 2015 for the men and data of
tennis matches from 2003 to 2015 for the women. The predictors used by the app-
roach can be categorized as basic tournament information, player description,
and performance metrics. In addition to these data, the authors added other
human judgments such as tactics, state and performance, psychology, brain-
power, and experience. In total, they ended up with 7 clusters that contained 24
predictors. The predictors associated with the two players that were set to play
a game were sent to an Analytic Network Process to predict the winner.
Knottenbelt, Spanias, and Madurska developed a model to predict tennis
games using a transitive property [11]. The idea behind this transitive approach
was that if player a is better than player c and c is better than b, then it can
be inferred that player a is better than b. To predict the outcome of a game
that involved two players, a and b, the approach looked at historical data to
find previous common opponents for a and b. Given a common opponent, the
proportion of serve points won and the proportion of return points won by a and
b were calculated. The calculated values were used to calculate the measure of
advantage or disadvantage of a over b given the same common opponent. This
value, in its turn, was used in an O’Malley equation to produce the probability
for a beat b via the same opponent. The final step consisted of averaging the
probability of a beat b over all the possible opponents, usually limited to 50, from
the historical data. The approach’s best performance was a 77.53% accuracy on
predicting the 2011 US Open with a 9.01% return on investment.
Barnett and Clarke developed a model to show that it was possible to pre-
dict the outcome of a tennis match before the game and while the game was
progressing. This approach was tested on predicting the longest men’s match
between Roddick and El Aynaoui at the 2003 Australian Open [15]. The model
used the players’ winning percentages on both serving and receiving. For the
two players that were playing the game to be predicted, the following statistics
were recorded: percentage of first serves in, percentage of points won on the first
serve, the percentage of points won on second serve, the percentage of points
won on return of first serve, and the percentage of points won on return of sec-
ond serve. For the longest game that this approach was tested on, the statistics
came from the average of 70 games. These statistics were used in calculating the
players’ winning percentages on serving and receiving. These last percentages
were used in the formula for calculating the combined percentage of points won
on serve for player i against j and the combined percentage of points won on
return for player j against i. Finally, these combined percentages were used in
a Markov chain model to perform the prediction. The prediction produced the
players’ chance of winnings and the chance for the length of the game.
McHale and Morton utilized a Bradley-Terry model to predict outcomes of
tennis games. The prediction for player, i, to win a game is calculated by divid-
684 A. S. Randrianasolo and L. D. Pyeatt
ing i’s ability with the combined ability of i and j (the opponent) [12]. This
basic Braddley-Terry probability was updated by the authors to include decayed
weights of the previous games. The approach also gave more weights to previ-
ous games that were played on the same surface as the current surface to be
used on the game to be predicted. The player’s ability were rankings derived
from historical data from 2000 to 2008 and consisted of the number of wins, the
number of wins combined with date, match score, match score combined with
date, or match score combined with date and playing surface. Each prediction
made by the model required the usage of the previous three years of data. This
approach did better than the prediction based on the player’s official ranking for
games from 2001 to 2008. When utilizing match score and date, this approach
was 66.0% accurate. When utilizing match score, date, and the type surface, this
approach was 66.90% accurate.
Klaassen and Magnus created a model capable of performing predictions
before the match and during the match. The model produced the probabil-
ity for a player to win the match against the given opponent. This probability
was calculated using the probabilities of winning a point on service from the
two opposing players. The before match probability was extracted from Wim-
bledon single matches from 1992 to 1995 using a logit regression model. This
starting probability was later updated as the match progressed. This approach
was applied to the Sampras-Becker (1995) and Graf-Novotna (1993) Wimbledon
finals [16].
Candila and Palazzo used 26880 male tennis matches from 2005 to 2018 to
train and validate an artificial neural network (ANN) [3]. They used 32 pre-
dictors. The predictors consisted of players’ statistics, player’s biometrics, and
synthetically generated data such as fatigue and betting odds from bookmakers.
The goal for the ANN was to produce the probability for a player i to win a
match j. This approach was validated on predicting matches from 2013 to 2018
and it outperformed the logit regression, [16], the probit regression [13], the
Bradley-Terry type [12], and the point-based approach [15] in terms of return
on investment.
Wilkens utilized the men and women tennis match from 2010 to 2019 to show
that an ensemble technique is the best approach to use when betting in tennis
games [8]. This approach used the differences of players’ statistics, locations,
tournament information and four different odds from bookmakers as predictors.
The ensemble that was created consisted of logistic regression, neural network,
random forest, gradient boosting and support vector machine. The accuracy of
each model, when not in an ensemble, was about 70%. When the assemble was
used on the betting market, returns of 10% and more were detected.
The machine learning techniques in the survey literature either used a very
extensive historical data or used complex data engineering that required the
involvement of human experts. In our proposed approach we limit the amount
of historical data to be used to go back only at maximum 1 year. We are also
reducing the human involvement by using genetic algorithm to perform the data
engineering for us.
Using Genetic Algorithm to Create an Ensemble 685
3 Early Observation
In our previous work [1], we used statistics from 2019, available from
wtatennis.com, to predict the women’s 2020 Australian open. In all, 15 pre-
dictors were initially collected. The predictors are the ranking, aces per game,
number of matches, double faults per game, first serve percentage, second serve
points percentage, first serve points, percentage serve points won percentage,
service games won percentage, break points percentage, first return points per-
centage, second return points percentage, break points converted percentage,
return games won percentage, and return points won percentage.
presented to the algorithm 100 times using a 80%–20% random split. The best
parameters, measured by average accuracy, were used to create the final models
that predict the games after the first rounds. The accuracy of the predictions
are captured in Fig. 1. We used the predictions based on the players’ rank, not
involving machine learning, as the benchmark to compare each model.
The final predictors used for each models were the ranking, number of
matches, aces per game, double faults per game and the percentages of the
first serve, first serve points, serve points won, break points, service games won,
first return points, second return points, return games won and return points
won.
Since the genetic algorithm is not always guaranteed to output exactly the
same vector in every run due to some randomness involved in the search, we
repeated the search process 101 times. We ended up with 101 tolerance vectors
that were used to create 101 models for each machine learning algorithm that we
considered. These 101 models formed our ensemble. Each model in the ensemble
was used to predict a game and a majority rule between the models was used to
consolidate the prediction of the ensemble. Figure 2 summarizes this approach.
A genetic algorithm, as its name implies, is a search algorithm that mimics the
process used for genetic information transmission in humans or animals. It is
based on the survival of the fittest. In a genetic algorithm, a potential solution,
expressed as a string of characters, is called an individual. The set of individuals
forms the population. Each individual is associated with a fitness value. This
value indicates the quality of the solution that the individual represents. Like
in biology, individuals in the population are permitted to reproduce to generate
new solutions.
688 A. S. Randrianasolo and L. D. Pyeatt
Game Representation
(known outcomes)
Candidate
Tolerance
vectors Fitness evaluation
by a Machine Lerning
Algorithm (e.g. SVM)
Crossover
Mutation
Genetic Algorithm
Predictions
We picked the machine learning algorithms, listed below, that performed well
from Fig. 1. In each algorithm, x represents a multi-dimensional vector that
describes a game. y is the the desired output, win or loss. w represents the
weights that the algorithm is searching for. Here are the chosen algorithms:
.
690 A. S. Randrianasolo and L. D. Pyeatt
– Random Forest: Trees are created by utilizing a random subset of the top k
predictors at each split in the tree. A tree is a multistage decision system,
in which classes are sequentially rejected, until a finally accepted class is
reached. Each tree in the ensemble is then used to generate a prediction for
a new sample. The majority in these m predictions is the forest’s prediction.
– Support Vector Machine: Provided two classes that are linearly separable,
design the function
g(x) = wT x + w0 = 0,
that leaves the maximum margin from both classes. In the cases of classes
that are non-linearly separable, a kernel function, K, can be utilized and the
decision function changes to
l
g(x) = yi αi K(x, xi ).
i=1
We revisited the 2020 Women Australian open by using the player’s statistics
from 2019 to create the game representations. We also predicted the 2021 Women
French Open (Roland Garros) and the 2021 Women US open. For the French
Open, player’s statistics from January 2021 to May 2021 were used to create
Using Genetic Algorithm to Create an Ensemble 691
the game representation. For the US Open, player’s statistics from January 2021
to August 2021 were used to create the game representation. In all of these
predictions, the genetic algorithm used the games from the first round of the
tournament to calculate the fitness of each candidate tolerance vector. Models
in the ensemble were also created using the games from the first round of the
tournament. The models were then used to predict the games from the second
round up to the final. We kept the parameter setting for the machine learning
algorithms used in the ensemble the same as what is shown in Fig. 1.
The results of our testing are summarized in Fig. 3, 4, 5. Similar to what
we had done in the early observation, we compared the accuracy of the ensem-
ble’s predictions to the accuracy of the predictions that are only based on the
players’ rank. In these figures, each mentioned machine learning algorithm name
refers to 101 models created utilizing that same algorithm with the parame-
ter(s) being listed. The models were generated using the same machine learning
algorithms with the same parameters; however the game representation from
which the machine learning algorithm generated the model were not necessarily
the same. Each game representation for each model depended on the tolerance
vector from the genetic algorithm. Note also that the fitness evaluation in the
genetic algorithm used the same machine learning algorithm, with exactly the
same parameters, that later generated the models.
The predictors used in testings were the player’s rank, number of matches,
and the percentages of the first serve, first serve points, second serve points, serve
points won, break points, first return points, second return points, return games
won and break points converted. Each candidate tolerance vector consisted of
11 values, each ranging from 1 to 20.
used are considered stable at that point and better reflect the players’ ability.
However, more studies need to be conducted to confirm if the Australian open is
easier to predict using the player’s statistics compared to the other tennis grand
slam tournaments. We hoped to see better results for the predictions for the US
open. We assumed that since it is the last grand slam tournament of the year,
the players’ statistics should be close to accurate by that time. However, injuries
and fatigues and other non-performance factors that have accumulated through
the season may render the prediction very hard.
Adding predictors such as fatigue, physical strength and mental strength,
similar to the approach of Gu and Saaty [7], can possibly increase the accuracy
of the predictions. It is rather complex and hard to extract these predictors
without considerable help from human expertise and such processes will hinder
automation.
There are still other genetic algorithm setups that this paper has not yet
explored. Among those setups are using random points for crossover, different
selection strategy instead of the roulette wheel selection that we have used, and
different stopping criteria. In the ensemble setup, using models from different
machine learning algorithms, similar to [8], instead of models that come from
the same algorithm can also be explored.
In the future, we plan to run a comparative analysis on the tolerance vectors
produced for each tennis tournament to see if there is a possibility of general-
ization that can be obtained. At the moment, we suspect that these tolerance
vectors can be tournament specific and may also vary with time. More testing
on various tennis tournament over multiple years will be needed to conduct such
study. Applying this ensemble approach to other sports will also be in our future
works.
References
1. Randrianasolo, A.S., Pyeatt, L.D.: Comparing Different Data Representations and
Machine Learning Models to Predict Tennis. In: Arai, K. (eds) Advances in Infor-
mation and Communication. FICC 2022. Lecture Notes in Networks and Systems,
vol. 439. Springer International Publishing, pp. 488–500 (2022) https://doi.org/
10.1007/978-3-030-98015-3 34
2. Huang, M.-L., Li, Y.-Z.: Use of machine learning and deep learning to predict the
outcomes of major league baseball matches. Appl. Sci. vol. 11(10), 4499 (2021)
3. Candila, V., Palazzo, L.: Neural networks and betting strategies for tennis, Risks
vol. 8(3) 2020
4. Randrianasolo, A.S., Pyeatt, L.D.: Predicting head-to-head games with a similarity
metric and genetic algorithm. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2018.
AISC, vol. 880, pp. 705–720. Springer, Cham (2019). https://doi.org/10.1007/978-
3-030-02686-8 53
5. Bunker, R.P., Susnjak, T.: The application of machine learning techniques for
predicting results in team sport: A review, CoRR, vol. abs/1912.11762 (2019)
6. Khan, S., Kirubanand, V.B.: Comparing machine learning and ensemble learning
in the field of football. Int. J. Electr. Comput. Eng. (IJECE), textbf95, 4321 (2019)
Using Genetic Algorithm to Create an Ensemble 695
7. Gu, W., Saaty, T.: Predicting the outcome of a tennis tournament: Based on both
data and judgments. J. Syst. Sci. Syst. Eng. 28 317–343 (2019)
8. Wilkens, S.: Sports prediction and betting models in the machine learning age:
The case of tennis, SSRN Electr. J. (2019)
9. Pretorius, A., Parry, D.A.: Human decision making and artificial intelligence: A
comparison in the domain of sports prediction,” In: Proceedings of the Annual
Conference of the South African Institute of Computer Scientists and Information
Technologists ser. SAICSIT ’16. New York, NY, USA: ACM, pp. 32:1–32:10 (2016)
10. Brooks, J., Kerr, M., Guttag, J.,: Developing a data-driven player ranking in soc-
cer using predictive model weights, In: Proceedings of the 22Nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16.
New York, NY, USA: ACM, pp. 49–55 (2016)
11. Knottenbelt, W.J., Spanias, D., Madurska, A,M,: A common-opponent stochastic
model for predicting the outcome of professional tennis matches. Computers &
Mathematics with Applications theory and Practice of Stochastic Modeling, vol.
64, no.12, pp. 3820–3827 (2012)
12. McHale, I., Morton, A.: A Bradley-Terry type model for forecasting tennis match
results. Int. J. Forecast. 27(2), pp. 619–630 (2011)
13. del Corral, J., Prieto-Rodrı́guez, J.: Are differences in ranks good predictors for
grand slam tennis matches. Int. J. Forecast. 26(3), 551–563 (2010)
14. Newton, P.K., Keller, J.B.: Probability of winning at tennis i. theory and data.
Studies Appl. Math. 114(3), 241–269 (2005)
15. Barnett, T., Clarke, S.R.: Combining player statistics to predict outcomes of tennis
matches. IMA J. Manag. Math. 16(2) 113–120 (2005)
16. Klaassen, K.J., Magnus, J.R.: Forecasting the winner of a tennis match. Europ. J.
Operat. Res. 148(2), 257–267 (2003)
17. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learn-
ing. In: 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.,
(1989)
Towards Profitability: A Profit-Sensitive
Multinomial Logistic Regression
for Credit Scoring in Peer-to-Peer
Lending
1 Introduction
1.1 Background
1.2 Motivations
Although both credit scoring and profit scoring can help investors make deci-
sions, they provide the evaluations of loans from totally different perspectives.
Credit scoring pursues lower risk while profit scoring values higher profit. There-
fore, the loans recommended by one approach may not be considered as the
high-quality loans by the other. However, in many cases, those higher risk loans
are those with greater potentials in profit due to the higher interest rate. Even
the defaulted loans could have generated profit for the lender before it turned
into delinquency. Hence, we would like to propose a recommendation system that
could better evaluate the loans for lenders by integrating the profit information
into credit scoring. We hope the top loans ranked by the integrated model would
be those in general safe while more profitable.
The second motivation of our research is to design a new strategy that
can handle the imbalanced P2P data. It is not hard to imagine that typically
most loads were fully repaid in time. The default loans make a smaller cate-
gory. Among those default loans, the defaulted but profitable loans contain an
even smaller proportion. Traditionally, cost-sensitive learning is an approach to
deal with the imbalance issue during the classification tasks [7]. However, such
method does not consider the profit information of the loans. Inspired by the
logic of cost-sensitive learning that deals with the imbalance issue by adjusting
according to the distribution of the target variable, we propose an innovative
loss function for credit scoring that involves not only the adjustments according
to the frequency characteristic, but also the adjustments based on profitability.
It is expected that the proposed profit-sensitive credit scoring model can deal
with the imbalanced P2P data well and can bring better investment suggestions
for P2P lenders.
1.3 Contributions
To handle the issue of heterogeneity in the default loans that appear as one
category in the conventional credit scoring approach, in our study, we define the
target variable with three different classes: “default and no profit”, “default but
with profit”, and “not default and with profit”. As a result, the conventional
binary classification problem for credit scoring is transferred into a multi-level
classification task.
As discussed earlier, these three target classes are not evenly distributed.
We then bring in the idea of cost-sensitive learning to deal with the imbalanced
data. In addition to adjusting based on the target distribution which has already
included some general profitability information, we also weight each observation
according to its own profitability. So the second approach of incorporating profit
information into credit scoring is to design a new loss function that weighing
loans differently according to their varying profits as well as their occurrence
frequencies in the real-world practice.
We name the proposed method as a profit-sensitive methodology. It is
expected that the proposed methodology can bring the model close to the real
cases in the P2P market and thus better guide lenders in making investment
decisions.
Theoretically, the proposed profit-sensitive loss could be applied to many
machine learning methodologies, including classification trees, logistic regression,
neural networks, etc. Considering that logistic regression is the benchmark model
for credit scoring, we will test the efficiency of the profit-sensitive learning in
this study by using logistic regression. To be specific, we use the binary logistic
regression as the conventional credit scoring approach to solve the traditional
binary classification problem for the P2P market and to produce the baseline
result. Then, we design a novel loss function based on the multinomial logistic
regression model to solve the multi-level classification problem.
In summary, our study makes contributions from three perspectives:
The rest of the article is organized as follows. Section 2 summarizes the exist-
ing research on P2P lending in the context of credit scoring and profit scoring.
Section 3 briefly discusses the theory of the designed profit-sensitive multino-
mial logistic regression model. Section 4 empirically examines the effectiveness
of the proposed method using the Lending Club data. Section 5 concludes with
a summary.
700 Y. Wang et al.
2 Literature Review
In P2P lending, credit scoring is conventionally formulated as a binary classifica-
tion problem, which classifies the loans into either (1) the default category if the
predicted probability of default (PD) exceeds a certain pre-defined threshold, or
(2) the non-default category otherwise. Different classifiers have been used in the
credit scoring area, including binary logistic regression [14], random forest-based
classification approach [9], LightGBM and XGBoost methods [8], etc. Regardless
of the various machine learning models proposed in the credit scoring area, all of
them focused on reducing the default risk while totally ignoring the profitability.
Therefore, from the credit scoring perspective, the models suggest the lenders
to invest in the loans with a low chance going default because of the low default
risk.
Over the past few years, many studies have changed their focus from minimiz-
ing the default risk (i.e. the credit scoring approach) to maximizing the potential
profit (i.e. the profit scoring approach), since gaining profit is the final goal of
the P2P investors. As a result, the profit scoring approach was first proposed
as an alternative to credit scoring for P2P lending in [13], wherein the authors
used IRR as the measure of the profitability of loans. They built multiple lin-
ear regression and decision tree models, indicating that the lenders can obtain a
higher IRR using profit scoring models rather than a credit scoring model. In [18],
Xia et al. pointed out that ARR is a more appropriate measure of profitability
considering the varying repayment duration of the P2P loans. They proposed a
cost-sensitive boosted tree for loan evaluation, which incorporated cost-sensitive
learning in extreme gradient boosting to enhance the capability of identifying
the potential default borrowers. Regardless of the different profit measures used,
profit scoring focuses on maximizing the profit while totally ignoring the default
risk. From the profit scoring perspective, lenders should invest in the loans with
a high predicted profit because of the high return they may bring.
Both credit scoring and profit scoring can be used as the decision tools for
evaluating loans and making investment suggestions to the lenders. However, the
two approaches work from different perspectives. The high-quality loans selected
by the credit scoring approach may not be those could achieve a high profit due
to the associated low interest rate. And reversely, the high profit loans predicted
by the profit scoring approach are not always the loans going default. There are
loans paid in full but was assigned a high interest rate in the beginning. Thus,
we assume that if evaluating loans from the credit scoring and profit scoring
perspectives together, we could achieve a better and more comprehensive evalu-
ation. Our assumption was confirmed by a recently published article in [2], which
was an integration of credit scoring and profit scoring. To be specific, a two-stage
scoring approach was proposed to recommend the loans to lenders. In stage 1,
the credit scoring approach was used to identify the non-default loans and these
loans were further examined in terms of IRR in stage 2. Their numerical studies
indicated that the two-stage approach outperformed the existing profit scoring
approaches with respect to IRR. To the best of our knowledge, this was the only
study that combined credit scoring and profit scoring together to evaluate P2P
Profit-Sensitive Credit Scoring in P2P Lending 701
1 1+exp(β 1 xi )
if k = 1
p(yi = k|xi , β1 ) = (2)
⎩1 − π(β T xi ) = 1
if k = 0
1 1+exp(β x )T
1 i
702 Y. Wang et al.
Assuming that all the observations are independent, the loss function or the
cost function L denotes the negative transformation of LL, which is the log
transformation of the likelihood function L. The goal of model training is to
seek the model coefficients β that minimizes L given in Eq. 3.
L = −LL = −log(L)
N
N
(3)
= −log( p(yi |xi , β1 )) = − log{p(yi |xi , β1 )}
i=1 i=1
The single loss for the ith observation can be further defined using Eq. 4.
⎧ T
⎪ exp(β 1 xi )
⎨−log{π(β 1T xi )} = −log T if k = 1
1+exp(β 1 xi )
lossi = −log{p(yi = k|xi , β 1 )} = (4)
⎪
⎩−log{1 − π(β 1T xi )} = −log 1
if k = 0
T
1+exp(β 1 xi )
K
lossi = − I(yi = k)log{p(yi |xi , βk )}
k=1
K
K
exp(βkT xi )
=− I(yi = k)log{π(βkT xi )} = − I(yi = k)log K
T
k=1 k=1 k=1 exp(βk xi )
(5)
L = −LL
N
K
N
K exp(β kT xi )
=− I(yi = k)log{p(yi = k|xi , β k )} = − I(yi = k)log K
k=1 exp(β k xi )
T
i=1 k=1 i=1 k=1
(6)
K
lossi = −wi1 wi2 I(yi = k)log{p(yi = k|xi , βk )} (7)
k=1
According to the definition of the loss for each loan, the loss function of the
entire training set is given in Eq. 8. We hope that by incorporating different
weights based on both the profit information and the frequency information into
the loss function, the proposed method can identify more “profitable” loans while
ensuring the “safeness” of the investment compared to the conventional credit
scoring method.
N
N
K
L =− lossi = − wi1 wi2 I(yi = k)log{p(yi = k|xi , βk )} (8)
i=1 i=1 k=1
4 Empirical Study
In this section, the proposed method is applied to the Lending Club data to test
its effectiveness. The task is to use the proposed method to classify the Lending
Club loans and therefore recommend the high-quality loans to the investors.
Comparing to the conventional credit scoring method which considers the high-
quality loans as those with low PD without considering their profitability, it is
our hope that the proposed method will target the loans that are safe while
profitable.
The dataset utilized in the empirical study originates from Lending Club, which
is one of the largest P2P platforms in the US and provides public available
dataset on its official website. We analyzed 1,123,895 loans that were origi-
nated before August 2016, with 219,809(19.56%) are the default loans while
90,4086(80.44%) are the non-default ones. The entire feature set for each loan is
collected from three perspectives: (1) the loan related information such as loan
704 Y. Wang et al.
purpose, term of the loan, etc. (2) the credit information of the borrower, such
as the FICO score, the debt-to-income (dti) ratio, etc. and (3) the other infor-
mation from the borrower, such as whether or not owning the living place, etc.
One feature worth mentioning is grade. Lending Club has rated all the loans into
seven different grades, labeled from A to G, and assigned an increasing nominal
interest rate from A to G.
The grade could act as one direct decision tool to assist lenders in making
rational investment decisions, however, it is not a secure tool since there still exist
default loans in the safest grade. The variable loan status, denotes the status of
the loans after it expires, with 1 denotes the loan was defaulted while 0 denotes
the loan was fully paid. Loan status is the target variable of the traditional credit
scoring approach.
Based on the raw Lending Club data, we define a target variable ARR using
Eq. 1 to measure the profitability of the loans in our study. It is worth noting that
the ARR calculated here using Eq. 1 is the actual ARR that occurs in real-world
practices. It may be different with the theoretical ARR that was expected when
the loan was originated, because of the possible early repayments or delinquen-
cies. The loans with ARR larger than 1 earn profit and vice versa. The mean,
median, and SD of ARR are 0.99, 1.07, and 0.25, respectively. It is surprising
that on average, it is not profitable to invest in the P2P market, as indicated by
the mean value of ARR. Therefore, having a data-driven recommendation that
performs better than randomly choosing some loans to invest is essential for the
lenders.
Our study starts by redefine the target variable. Instead of simply classifying the
loans into two categories while ignoring the profit information, we incorporate
the profitability of the default loans into the target variable and thus transfer
the binary classification problem into a multi-level classification problem.
To be specific, a new target variable named “Group” is created as in Eq. 9,
where Group = NoDefProf means the loan was fully paid and led to a profit,
Group = DefNoProf means the loan defaulted and made a loss to the investor,
and Group = DefProf means the loan defaulted but still generated some profit.
There is no scenario of Group = NoDefNoProf (i.e., a non-default loan without
any profit) since all the principle plus some interest have been paid back if the
loan is not defaulted.
⎧
⎪
⎨NoDefProf if loan status = 0&ARR > 1
Group = DefNoProf if loan status = 1&ARR <= 1 (9)
⎪
⎩
DefProf if loan status = 1&ARR > 1
The distribution of the newly created outcome “Group” in the training set
(70% of the entire dataset) is given in Table 1. 80.42% of the loans are non-default
with a mean ARR of 1.09. As expected, the default category is heterogeneous:
Profit-Sensitive Credit Scoring in P2P Lending 705
most of the default loans have no profit while 13, 267 of them (1.69% of the
training set) are profitable.
By creating the new outcome “Group”, the traditional credit scoring problem
has been transformed into a multi-level classification problem. Next, we will use
the proposed profit sensitive multinomial logistic regression method to solve
the multi-level classification problem, where “Group” is the target variable. We
would further check whether or not the proposed method can identify higher
profitable loans than the traditional binary classification approach.
By applying Eq. 5, we can define the loss for the ith loan in the transformed
3-category classification problem in Eq. 10. Note that pdp , pndp , and pdnp denote
the probabilities that the ith loan belongs to the category DefProf, NoDefProf, or
DefNoProf, respectively. Similarly, βdp , βn dp , and βdn p are the corresponding
coefficient vectors. As a result, the loss function L for the multinomial logit
model applied in our scenario, which is the summation of the loss for all the
observations in the data, is given in Eq. 11.
As shown in Table 1, “Group” has an extremely imbalanced distribution,
because the proportion of the DefProf category is much lower than the rest two.
Our initial experiments showed that when using the loss function of multinomial
logistic regression defined in Eq. 11, it didn’t classify any loan into to the Def-
Prof category, which confirms that the traditional multinomial method is not
appropriate for the extremely unbalanced P2P data. Someone may suggest that
the minority category DefProf can be simply discarded from being recommended
to the investors because of the low frequency of occurrence. However, as shown
in Table 1, the median ARR of the DefProf category is 1.03, which indicates
that the minority category DefProf is also the class of interest when making
investment suggestions.
706 Y. Wang et al.
lossi = −I(y
⎧ i = k)log{p(yi |xi , βk )}
⎪
⎪
exp(β dT p xi )
⎪
⎪−log(pdp ) = −log exp(β T x )+exp(β T x )+exp(β T x )
⎪
⎪ dp i n dp i dn p i
⎪
⎪
⎪
⎪ if y = DefProf
⎪
⎪
i
⎪
⎨−log(p ) = −log exp(β nT d p xi )
ndp exp(β d p xi )+exp(β nT d p xi )+exp(β dT n p xi )
T (10)
=
⎪
⎪
⎪
⎪ if yi = NoDefProf
⎪
⎪
⎪
⎪ exp(β dT n p xj )
⎪
⎪−log(p ) = −log
⎪
⎪
dnp exp(β dT p xi )+exp(β nT d p xi )+exp(β dT n p xi )
⎪
⎩ if y = DefNoProf
i
N
L =− lossi
i=1
N
N
=− I(yi = Def P rof )log(pdp ) − I(yi = N oDef P rof )log(pndp )
i=1 i=1
N
− I(yi = Def N oP rof )log(pdnp )
i=1
(11)
As discussed in Sect. 3, we propose a profit-sensitive multinomial logistic
method by defining a new loss function shown in Eq. 8, in which two weight
terms are included. To make the proposed profit-sensitive model more accurate,
we further adjust the weights in Eq. 8 from two aspects as follows. First, the
value wi1 in Eq. 8 is defined as the ARR of the ith loan. As a result, each
loan is weighted differently based on its own profitability. Moreover, wi1 is used
only for the profitable loans. In other words, we add wi1 to the loans from
the categories of DefProf and NoDefProf while the loans from the category of
DefNoProf don’t have the wi1 item. By doing so we can take into account the real
profit information of the loans, thus more profitable loans will be emphasized
more during the modeling process and the non-profitable loans are treated the
same with each other. Secondly, instead of adjusting wi2 for each of the three
categories of “Group”, we only re-weight the loans belonging to the DefProf
category. It is because DefProf is the category that has the lowest frequency
but it is one of the target categories of interest to the investors. By adding wi2 ,
we could adjust the bias caused by the extremely low frequency of the DefProf
category. Therefore, the loss for each loan i is further modified using Eq. 12.
Please note that wi1 and wi2 are only used in the training stage of the model
not the predicting stage, so it is fine to use the “posterior” information such as
ARR here.
To summarize, for the categories that can generate profit, including DefProf
and NoDefProf, we use the real profit as the weight term (as shown by wi ) in
order to put different emphasis on the loans based on their varying profitability.
Profit-Sensitive Credit Scoring in P2P Lending 707
For the minority category DefProf, we add an additional weight term w f req to
adjust the extremely low frequency. w f req is a hyper-parameter that needs to
be tuned during the model training.
⎧ exp(β dT p xi )
⎪
⎪ −w w log(p ) = −w w log
⎪
⎪ i f req dp i f req
⎪
⎪
T
exp(β d p xi )+exp(β n T T
d p xi )+exp(β d n p xi )
⎪
⎪
⎪
⎪ if yi = DefProf
⎪
⎪
⎪
⎪ T
⎨ exp(β n d p xi )
−wi log(pndp ) = −wi log
lossi = exp(β T
x
dp i )+exp(β T T
n d p xi )+exp(β d n p xi ) (12)
⎪
⎪
⎪
⎪ if yi = NoDefProf
⎪
⎪
⎪
⎪ exp(β d n p xi )
⎪
⎪ −log(p dnp ) = −log
⎪
⎪ exp(β dT p xi )+exp(β nT T
⎪ d p xi )+exp(β d n p xi )
⎪
⎩
if yi = DefNoProf
Based on Eq. 12, the loss function L of the proposed profit sensitive multi-
nomial logit regression is finally defined in Eq. 13.
N exp(β dT p xi1 )
L = lossi = − wi w f req log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i=1 i|yi =Def P rof
exp(β n
T
d p xi )
− wi log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i|yi =N oDef P rof
exp(β n
T
d p xi )
− log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i|yi =Def N oP rof
(13)
The model is trained with the purpose of minimizing the above L . To con-
firm that the globally optimal solution exists, we first mathematically prove the
convexity of the proposed loss function and the details are shown as follows.
After designing the loss function, it is critical to mathematically prove that the
algorithm of minimizing the loss function L in Eq. 13 can converge during the
training process if an appropriate learning rate is used. Otherwise, we cannot
guarantee to get a reliable and optimal solution that minimizes the loss func-
tion. We will apply gradient decent, an optimization algorithm widely used, to
minimize our loss function and find the solution of β. According to Theorem 1,
which has been indicated in [15], the problem of proving the convergence can be
transferred into the problem of proving the convexity of the loss function.
for k iterations with a fixed step size t ≤ 1/L, it will yield a solution f (k) which
satisfies Eq. 14, where f (x∗ ) is the optimal value.
x(0) − x∗ 22
f (xk ) − f (x∗ ) ≤ (14)
2tk
In other words, it means that gradient descent is guaranteed to converge and
that it converges with rate O(1/k) for a convex and differentiable function.
∂ 2 f (x)
zT [ ]z ≥ 0, ∀z (15)
(∂x)2
∂ 2 f (x)
In other words, f if convex if and only if the Hessian matrix (∂x)2 is positive
semi-definite for all x ∈ Rn .
Lemma 1 lists one important property for the convex functions and we will
use it later during our proof [5].
Lemma 1. Let f(x), g(x) be two convex functions, then for λ1 , λ1 ≥ 0, λ1 f (x)+
λ2 g(x) is also convex. In other words, non-negative linear combination of convex
functions is also convex.
We now give the proof of the convexity of the loss function L in Eq. 13.
Proof. The loss function L given in Eq. 13 can be expressed as a linear combina-
tion of functions 16, 17, and 18. According to Lemma 1, to prove the convexity
of L , we need to prove the convexity of these three functions. According to
Definition 1, to prove the convexity of Eqs. 16, 17, and 18, we need to prove
that the Hessian matrices of them are all positive semi-definite. Without loss of
generality, we prove the convexity of function 16 only. The convexity of functions
17 and 18 can be obtained similarly.
T
exp(βdp xi )
−wi w f req log T
(16)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
exp(βnT dp xi )
−wi log T
(17)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
Profit-Sensitive Credit Scoring in P2P Lending 709
exp(βnT dp xi )
−log T
(18)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
The first derivative of function 16 with respect to βdp is derived using Eq. 19
and then its Hessian matrix is given in Eq. 20. Then, Eq. 21 is checking whether
the Hessian matrix is positive semi-definite or not.
T
∂ exp(βdp xi )
[−wi w f req log T
]
∂βdp exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
T
exp(βdp xi ) (19)
= −wi w f req [1 − T
]xi
exp(βdp xi ) + exp(βnT dp xi ) + T
exp(βdn p xi )
:= −wi w f req [1 − T
π(βdp xi )]xi T
= wi w f req [π(βdp xi ) − 1]xi
∂ 2 f (β d p )
Hessian(f (β d p )) = [ ]
∂β d p ∂β dT p
∂2 exp(β d p xj )
= [−wi w f req log ]
∂β d p ∂β dT p exp(β d p xi ) + exp(β n d p xi ) + exp(β d n p xi )
∂ ∂
= wi w f req [π(β d p xi )
T
− 1]xi = wi w f req [π(β dT p xi ) − 1]xi
∂β dT p ∂β dT p
∂ ∂ ∂
= wi w f req [π(β dT p xi )xi − xi ] = wi w f req [π(β dT p xi )xi ] − wi w f req xi
∂β dT p ∂β dT p ∂β dT p
= wi w f req π(β d p xi )[1
T
− π(β dT p xi )]xi xT
i
(20)
=⇒ Then ∀z ∈ Rp ,
∂ 2 f (β d p )
zT [ ]z
∂β d p ∂β dT p
= z T [wi w f req π(β d p xi )[1
T
− π(β dT p xi )]xi xT
i ]z = wi w f req π(β d p xi )[1
T
− π(β dT p xi )](xT
i z)
2
(21)
In Eq. 21, we have wi > 0 and w f req > 0 since they both denote the
weights on the loans in our definition of L . We also have π(βdp
T
xi ) ≥ 0 and
[1 − π(βdp
T
xi )] ≥ 0 because of the range of the softmax function. Finally, it
is always true that (xT
i z) ≥ 0 because it is a square of a scalar. Therefore,
2
∂ 2 f (β d p ) ∂ 2 f (β d p )
z T [ ∂β T ]z ≥ 0 is true ∀z ∈ Rp and the Hessian matrix ∂β d p ∂β dT p
is positive
d p ∂β d p
semi-definite.
710 Y. Wang et al.
After proving the convexity of the proposed loss function L , we further articulate
the algorithm for learning the coefficients during the model training process.
Considering the large size of the training set, the mini-batch stochastic gradient
descent algorithm is used to learn the proposed multinomial logit model [4].
Algorithm 1 gives the details of the training procedure and Eqs. 22, 23 and 24
show the calculation of the partial derivatives, respectively.
∂ exp(β dT p xi )
L = −w{[I(yi = Def P rof ) − ](xi )} (22)
∂β d p exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
∂ exp(β n
T
d p xi )
L = −w{[I(yi = N oDef P rof ) − ](xi )}
∂β n d p exp(β d p xi ) + exp(β n d p xi ) + exp(β dT n p xi )
T T
(23)
∂ exp(β dT n p xi )
L = −w{[I(yi = Def N oP rof ) − ](xi )}
∂β d n p exp(β df
T x ) + exp(β T
i n d p xi ) + exp(β d n p xi )
T
(24)
where ⎧
⎪
⎨wi w f req if yi = DefProf
w = wi if yi = NoDefProf (25)
⎪
⎩
1 if yi = DefNoProf
Profit-Sensitive Credit Scoring in P2P Lending 711
11: βdk n+1 k 1
p ← βd n p − η |Bi | i∈Bi ∂β d n p L (di = xi , yi )
∂
12: k ←k+1
13: end for
14: If converges, break
15: end for
16: Output: cost-sensitive multinomial model βdk p , βnk d p , and βdk n p .
existing similar approaches, we use our model to identify good loans in the test
dataset and then recommend them to the investors. The first step is to set the
rule that we use to recommend the loans to the investors based on the model
results. In this study, we decide to use p(yi = Def N oP rof ) as the ranking metric
for recommending the loans, which is the probability that the ith loan belongs
to the defaulted and no profit (DefNoProf) category. The reasons are as follows.
First, since both NoDefProf and DefProf are related to a “good” characteristic
of the loans from the profitability perspective, it is unfair to use any of them
while discarding the other in the loan evaluation. On the other hand, DefNoProf
is related to the “bad” characteristics of the loans from both the risk and the
profitability perspectives. Thus, we would recommend the loans with the lowest
p(yi = Def N oP rof ) to the investors.
To confirm that incorporating the profit information into the credit modeling
could be beneficial in detecting more “profitable” loans, it would be crucial
to compare its performance with the conventional credit scoring approach (i.e.,
without using the profit information). Considering that the profit-sensitive multi-
nomial logistic model is a modified variation of logistic regression, it is reasonable
to use logistic regression as the benchmark model.
In addition, as discussed in Sect. 1.3, the main contributions of this study are
two folds. The first is transferring the binary classification problem in traditional
credit scoring to a three-level classification task, which could partially achieve the
goal of incorporating the profit information of the loans into credit scoring. The
second is going beyond the current existing cost-insensitive and cost-sensitive
multi-level classification methods by proposing a novel loss function that weighs
the loans based on their profitability as well as frequency. In order to have a
comprehensive analysis and highlight the contributions, we compare the pro-
posed model (labeled as Model 6) with several cost-insensitive and cost-sensitive
logistic regression models (labeled from Model 1 to Model 5). Models 1, 2, and 3
are all binary classification problems so they all use “loan status” as the target
variable. Models 4, 5, and 6 are multi-level classification problems thus they use
“Group” as the target variable. The details of these six models are given below:
weights for class 0 (i.e., loan status = 0) and 1 (i.e., loan status = 1) in Model
786726 786726
2 are 2∗632728 = 0.62 and 2∗153998 = 2.55, respectively.
n samples
weight = ( ) (26)
n classes ∗ n samples with class
– Model 3: A profit-sensitive binary logistic regression. The implementation of
Model 3 is similar to that of the proposed model described in Sect. 4.6, except
that the weights used in Model 3 are determined based on the binary outcome
“loan status”. Similar as in Eq. 12, we use the individual ARR to weight each
loan from the profitable class and use an additional term w f req to adjust the
minority class (i.e., loan status = 1). w f req is the hyper-parameter that is
tuned to maximize the cross-validated ARR. The final setting for w f req in
Model 3 is 10.
– Model 4: A conventional multinomial logistic regression.
– Model 5: A cost-sensitive multinomial logistic regression, where the weight
for each class is defined according to the Heuristic method again as given in
Eq. 26. Specifically, the weights for classes DefProf, NoDefProf, and DefNo-
786726 786726 786726
Prof are 3∗13268 = 19.76, 3∗632728 = 0.41, and 3∗140730 = 1.86, respectively.
– Model 6: The proposed profit-sensitive multinomial logistic regression model,
with the details in Sect. 4.6.
For the model evaluation and comparison purposes, Models 1, 2, and 3 are
compared by the predicted PD for each loan. Since a higher PD corresponds
to a “bad” characteristic, loans with a lower PD will be recommended to the
lenders. Models 4, 5, and 6 output three probabilities as discussed in Sect. 4.6:
p(yi = N oDef P rof ), p(yi = Def N oP rof ), and p(yi = Def P rof ). We would
recommend the loans with a lower p(yi = Def N oP rof ) to the investors, which
is the rule set in Sect. 4.7.
Different from previous research which commonly utilizes accuracy to com-
pare classification models, we define our own model comparison rules in this
article because of the special design and the purpose of the study. Considering
that the main goal of this study is to detect/recommend “higher profit” loans,
we use the average profitability of the top loans recommended by the six models
for the model comparison.
below 1.03 while Model 6 will recommend the loan portfolio with average ARR
around 1.04.
Fig. 1. Average ARR from the selected loans identified by the six models.
– Model 1 vs Model 2: The cost-sensitive learning (Model 2) does not show its
superiority over cost-insensitive learning (Model 1), where Model 2 addresses
the imbalance data in P2P lending for binary classification. In other words,
the conventional cost-sensitive learning is not the optimal option in the P2P
study. It may because the imbalance issue is not severe in the P2P market.
– Model 2 vs Model 3: The profit-sensitive learning (Model 3) has better per-
formance than the cost-sensitive learning (Model 2) in a few cases but not
always. In other words, the proposed profit sensitive learning approach shows
its weak superiority in the binary classification case. This verifies that incor-
porating the profit information into target is a useful and important step in
our proposed model.
– Model 1 vs Model 4: The similar performance between these two models
indicates that it is not beneficial in identifying “more profitable” loans by
Profit-Sensitive Credit Scoring in P2P Lending 715
solely transferring the binary classification problem (Model 1) into the multi-
level classification problem (Model 4).
– Model 4 vs Model 5: The cost-insensitive learning (Model 4) even outperforms
the cost-sensitive learning (Model 5) in many cases such as selecting the top
9 or the top 10 loans. Although the Heuristic method is the best practice in
addressing the imbalance issue, it is not the optimal solution for finding the
higher profitable loans in the P2P market when the problem is structured as
a multi-level classification task. This may be because we evaluate the models
differently here, by focusing only on the top loans identified. If still using the
accuracy rate, we may observe some different result.
– Model 5 vs Model 6: The profit-sensitive learning (Model 6) has much better
performance than the cost-sensitive learning (Model 5) in most cases.
– Model 3 vs Model 6: they are both profit-sensitive learning approaches in
this study and the only difference is Model 3 solves the binary classification
problem while Model 6 works out the multi-level classification task. However,
the performance of Model 6 is much better than Model 3. It is consistent
with our expectation since by transferring the binary classification problem
into the multi-level classification problem, the heterogeneity of the default
loans is further reduced. Accordingly, the model prediction will be closer to
the real interest of the investors thus better investment suggestions would be
provided.
Table 2. Constituent of the top 18 loans selected by models 1–6. The “Sum” column
contains the total number of loans and also the number of defaulted loans in the
parenthesis. ARR denotes the average ARR in each grade segment.
“safety”, incorporating the profit information into credit scoring could identify
loans with a higher profitability than the credit scoring alone approach.
Table 3. Average ARR and average default rate of the top 18 loans selected by the
six models. ARR denotes the overall average ARR.
We first formulate a multi-level classification task and then define a novel loss
function for multinomial logistic regression to solve the pre-defined multi-level
classification problem. The proposed loss function aims to put different weights
on loans according to their varying profits as well as their occurrence frequen-
cies. As a result, the loans with higher profits (regardless of whether they are
the usual cases or the rare cases in the real-world practice) are given higher
weights during the model training process and they have a higher chance to be
recommended to the investors.
The effectiveness of the proposed methodology is validated using the real-
world Lending Club data. Results show that the proposed profit-sensitive learn-
ing approach can not only identify the “higher profit” loans but also maintain the
risk control. To the best of our knowledge, our study is the first that integrates
the profit information into the traditional credit scoring approach by formulat-
ing a multi-level classification problem along with a profit-sensitive loss function.
This approach can also be applied to model any scenario that has two outcomes
– one nominal and one numerical – while there exists some trade-off between the
two outcomes.
Our work also has some limitations, one of which is that the effectiveness of
the proposed methodology is validated only on the offline data. In our future
work, we plan to implement the proposed method on real-time data to provide
instant loan evaluations. We also want to compare the performance of the pro-
posed multinomial logistic regression on the financial data pre-, during, and post
the COVID-19 pandemic period, to gain insights on the impacts from certain
global or national crisis. In addition, we plan to leverage the logic of profit-
sensitive learning to other machine learning algorithms for binary or multino-
mial classification problems, including but not limited to neural network, random
forest, etc.
References
1. Bachmann, A., et al.: Online peer-to-peer lending-a literature review. J. Internet
Bank. Commer. 16(2), 1–18 (2011)
2. Bastani, K., Asgari, E., Namavari, H.: Wide and deep learning for peer-to-peer
lending. Exp. Syst. Appl. 134, 209–224 (2019)
3. Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math.
44(1), 197–200 (1992)
4. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller,
K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436.
Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 25
5. Boyd, S., Lieven, V.: Convex Optimization. Cambridge University Press, Cam-
bridge (2004)
6. Kim, J.Y., Cho, S.B.: Predicting repayment of borrows in peer-to-peer social lend-
ing with deep dense convolutional network. Exp. Syst. 36(4), e12403 (2019)
7. Charles, X.L., Victor, S.S.: Cost-sensitive learning. Encyclopedia Mach. Learn.
231–235 (2010)
718 Y. Wang et al.
8. Ma, X., Sha, J., Wang, D., Yuanbo, Yu., Yang, Q., Niu, X.: Study on a prediction
of p2p network loan default based on the machine learning lightgbm and xgboost
algorithms according to different high dimensional data cleaning. Electron. Com-
mer. Res. Appl. 31, 24–39 (2018)
9. Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random
forests. Exp. Syst. Appl. 42(10), 4621–4631 (2015)
10. Minka, T.P.: Algorithms for maximum-likelihood logistic regression (2012)
11. Pedregosa, F., et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.
12, 2825–2830 (2011)
12. Roberts, A.W.: Convex Functions. In: Handbook of Convex Geometry, pp. 1081–
1104. Elsevier (1993)
13. Serrano-Cinca, C., Gutiérrez-Nieto, B.: The use of profit scoring as an alternative
to credit scoring systems in peer-to-peer (p2p) lending. Decis. Support Syst. 89,
113–122 (2016)
14. Serrano-Cinca, C., Begoña, G.-N., Luz, L.-P.: Determinants of default in p2p lend-
ing. PLoS ONE 10(10), e0139427 2015
15. Shapiro, A., Wardi, Y.: Convergence analysis of gradient descent stochastic algo-
rithms. J. Optim. Theor. Appl. 91(2), 439–454 (1996)
16. Wang, Y., Xuelei, S.N.: A xgboost risk model via feature selection and bayesian
hyper-parameter optimization. arXiv preprint arXiv:1901.08433 (2019)
17. Wang, Y., Xuelei, S.N.: Improving investment suggestions for peer-to-peer lending
via integrating credit scoring into profit scoring. In: Proceedings of the 2020 ACM
Southeast Conference, pp. 141–148 (2020)
18. Xia, Y., Liu, C., Liu, N.: Cost-sensitive boosted tree for loan evaluation in peer-
to-peer lending. Electron. Commer. Res. Appl. 24, 30–49 (2017)
Distending Function-based Data-Driven
Type2 Fuzzy Inference System
Abstract. Some challenges arise when applying the existing fuzzy type2
modeling techniques. A large number of rules are required to complete
cover the whole input space. A large of parameters associated with type2
membership functions have to be determined. The identified fuzzy model
is usually difficult to interpret due to the large number of rules. Designing
a fuzzy type2 controller using these models is a computationally expen-
sive task. To overcome these limitations, a procedure is proposed here
to identify the fuzzy type2 model directly from the data. This model is
called the Distending Function-based Fuzzy Inference System (DFIS).
The proposed procedure is used to model the altitude controller of a
quadcopter. The DFIS model performance is compared with various
fuzzy models. The performance of this controller is compared with type1
and type2 fuzzy controllers.
1 Introduction
Fuzzy theory has found numerous practical applications in the fields of engi-
neering, operational research and statistics [16,23,24]. In most cases, expert
knowledge is not available or it is poorly described. So the exact description
of fuzzy rules is not an easy task. However, if the working data of the process
is available then a data-driven based design is an attractive option [1,14,21].
Fuzzy modeling involves the identification of fuzzy rules and parameter values
from the data. The data-based identification of a fuzzy model can be divided
into two parts, namely qualitative and quantitative identification. Qualitative
identification focuses on the number and description of fuzzy rules, while quan-
titative identification is concerned with the identification of parameter values.
These parameters belong to membership functions and fuzzy operators. In one
of the latest paper, Dutu et al. [7] investigated the qualitative identification in
detail for Mamdani-like fuzzy systems. A parametrized rule learning technique
called Selection-Reduction was introduced. The number of rules were optimized
by dropping some rules based on the rule redundancy index. This technique is
called Precise and the Fast Fuzzy Modeling (PFFM) approach. The unique fea-
tures of this approach include the high accuracy of the trained model, minimum
time for rules generation and better interpretability provided by the compact
rule set.
Because of the uncertainties, the membership functions are no longer certain
i.e. the grade of the membership functions cannot be a crisp value. To overcome
this problem, type-2 membership functions (T2MF) were introduced. T2MF
contains the footprint of uncertainty (FOU) between the upper membership
function (UMF) and lower membership functions (LMF). Interval T2FS have
been developed to reduce the computational complexity [15]. T2FS has superior
properties such as: 1) Better handling of uncertainties [17]; 2) Smooth controller
response [22]; 3) Adaptivity [22]; 4) Reduction in the number of fuzzy rules [9].
T2FS have been successfully used in control system design [3], data mining [19]
and time series predictions [8]. The design of the interval T2FS consists involves
a type reduction step. In this step type2 fuzzy sets is converted to type1 fuzzy
sets. The type reduction step is performed using the so-called Karnik-Mendel
(KM) iterative algorithm [18].
This approach has some drawbacks, such as: 1) The choice of T2MFs; 2) The
computational complexity of the type reduction step; 3) Difficulties in the opti-
mization process; 4) Controller design complexity. Quite recently several tech-
niques have been proposed to tackle these problem [2,11,26]. Tien-Loc Le pre-
sented a self-evolving functional-link type2 fuzzy neural network (SEFIT2FNN)
[12]. It uses the particle swarm optimization method to adjust the learning rate
of the adaptive law. The adaptive law tunes the parameters of the type2 fuzzy
neural network. SEFIT2FNN has been shown to successfully control the antilock
braking system under various road conditions.
However, these existing approaches have some limitations. Qualitative identi-
fication suffers from the so-called flat structure (curse of dimensionality) problem
of the rule-base [19] i.e. if the number of input variables increase, then an expo-
nentially large number of rules are required to accurately model the system. The
computational complexity of the quantitative part of the identified fuzzy model
also increases with the number of rules. As the number of rules increases, the
number of parameters of the T2MF and operators also grows exponentially. The
choice of T2MF and its systematic connection with the type of uncertainty are
not clear. Different type-1 membership functions can be combined to generate
T2MFs. In most cases, the interpretability of the identified fuzzy rule base is
not clear. If the number of rules grows exponentially, then for a given set of
input values, it is not possible to predict the response of the model and analyze
its performance. Although type-2 fuzzy logic systems require fewer rules com-
pared to type-1 fuzzy systems, the number of parameters is comparatively large.
So optimizing a large number of parameter values is not an easy task. Most of
the fuzzy type2 control design techniques use the type reduction step [20]. The
type reduction step is based on the KM algorithm, which is computationally
expensive.
Distending Function-based Data-Driven Type2 Fuzzy Inference System 721
(λ)
where ν ∈ (0, 1), ε > 0, λ ∈ (1, +∞) and c ∈ R. δε,ν (x − c) is denoted by
δs (x).
The values of the DF parameters (ν, ε, λ, c) may be uncertain. As a result,
these parameters can take various values around their nominal values, within
the uncertainty bound (Δ). By varying the parameter values within Δ, various
722 J. Dombi and A. Hussain
Fig. 2. Uncertain peak values T2DF with the footprint of uncertainty (FOU)
DFs are obtained. The DF with the highest grade values is called the upper
membership function (UMF) and that with the lowest values is called the lower
membership function (LMF). The UMF, LMF and various DFs in between can
be combined to form an interval T2DF [5]. If the peak value of the DF becomes
uncertain, then it can be represented using the interval T2DF with an uncertain
’c’ value, as shown in Fig. 2.
Various T2DFs belonging to the same fuzzy variable can be combined to
form a single T2DF. The support of the resultant T2DF will be approximately
the same as the combined support of the individual T2DFs. The UMF of the
T2DF consists of the LHS and RHS (the same is true for the LMF). The LHS
and RHS are given by [5]:
1
2
δ̄L (x − c) = x−c λ , (2)
1+ 1−ν 1
ν ε 1+e( λ∗ (x−c))
1
2
δ̄R (x − c) = x−c λ . (3)
1+ 1−ν 1
ν ε 1+e( −λ∗ (x−c))
The LHS and RHS of the UMF and LMF can be combined using the Dombi
conjunctive operator to get a single T2DF. Consider two T2DFs δ12 and δ22 . The
LHS of δ12 and RHS of δ22 can be combined using the Dombi conjunctive operator
2
[4]. This produces a resultant T2DF δresult , as shown in Fig. 3. Combining various
T2DF helps to reduce the number of fuzzy rules. This leads to a decrease in the
computational complexity of the identified fuzzy model.
The proposed design approach is explained in the next section.
Distending Function-based Data-Driven Type2 Fuzzy Inference System 723
Fig. 3. Combining two T2DF (δ12 and δ22 ) to get a single T2DF (δresult
2
)
where U and V contains the l data points of each input and output variable.
Here, a1 , a2 , . . . , an are the data points belonging to the input fuzzy subsets
U1 , U2 , . . . , Un , respectively, and b1 is included in the output fuzzy subset V .
Each column of the U matrix corresponds to a unique feature (input variables)
of the process. Therefore the U matrix forms an n dimensional input feature
space. Each column of the training matrix U is normalized by transforming it
to the [0, 1] interval.
The fuzzy rule consists of an antecedent and a consequent part. Here, the
antecedent part contains a row of U and the consequent part is an element of
V . Therefore, a few rows from the data base matrix U are selected. These rows
and the corresponding elements in the V matrix are used to construct the rule
base. It is called the boundary-value rule base (Rb ) because it mostly contains
those values of the inputs that lie on the boundary of the input space.
In our procedure, two different surfaces are constructed. These are called the
estimated and the fuzzy surfaces. The estimated surface is constructed directly
from the database (Eq. (4). Each selected row from the database matrix U
corresponds to a single rule. It is a row vector and it consists of unique values of
all the input variables (features). We will construct T2DFs for all input variables.
The input variables are usually measured using the feedback sensors. The Δ value
of each sensor depends on the tolerance intervals of the corresponding sensor. All
the Δ values are transformed into the [0, 1] interval to make these compatible
724 J. Dombi and A. Hussain
with the values of the input variables. T2DFs have a long tail. Consequently each
T2DF influences the other existing T2DFs. The ν value of each T2DFs will be
calculated based on the principle of minimum influence on all the other T2DFs.
This influence can never be zero, but it can be decreased by a factor k. For less
influence, a large value of k should be chosen. However from a practical point of
view, a value of 10 is sufficient. The required value of ν can be calculated using
1
ν= , (5)
1 + k−1
d
x −x x −x
where d = | i1 j1 |λ + · · · + | in jn |λ . Each rule is evaluated using the
Dombi operator. By applying the Dombi conjunctive/disjunctive operator over
the n input T2DFs, we get a single T2DF. This is called the output T2DF. All
these output T2DFs are superimposed in the input space to generate a fuzzy
surface G∗ .
An error surface E is defined as the difference between the estimated surface
G and the fuzzy surface G∗ . That is,
E(x1 , . . . , xn ) = G(x1 , . . . , xn ) − G∗ (x1 , . . . , xn ). (6)
We shall decrease the magnitude of E below a chosen threshold τE ( |E| <
τE ). This is achieved by an iterative procedure of adding new rules to Rb . To
add a new fuzzy rule, the coordinates of the maximum value on E are located.
The corresponding row in the database containing these coordinates is selected.
This row is then added to Rb as a new rule. This rule is evaluated to generate
an output T2DF. The ν value of this output T2DF is then calculated using Eq.
(5). This T2DF is superimposed in G∗ . It should be added that extracting the
type2 fuzzy model from the data is based on the DF. Therefore, we call this type2
model the DF-based fuzzy inference system (DFIS). Here, we describe a heuristic
approach used to decrease the number of rules in Rb . Rules reduction will lead
to a lower computational cost and better interpretability. Various output T2DFs
which are close to each other in the input space can be combined to get a single
T2DF (as shown in Fig. 3). The output T2DFs are segregated into different
groups. If the Euclidean distance between the peak value coordinates of various
output T2DFs is less than a predefined distance D, then these T2DFs are placed
in the same group:
Sum of Euclidean distances b/w peak value T2DFs
D= . (7)
Total no. of T2DFs in the same half
Each output T2DF is obtained by applying a unique rule in Rb . The output
T2DFs in the same group are combined together to produce a single T2DF.
Consequently the rules associated with all these output T2DFs are eliminated
and replaced by a single new rule. Therefore the number of rules in Rb decreases.
Now it is called a reduced rule base Rr . Using Rr , a new fuzzy surface is con-
structed and it is denoted by G∗r . Then a reduced error surface (Er ) is obtained
using
Er (x1 , . . . , xn ) = G(x1 , . . . , xn ) − G∗r (x1 , . . . , xn ). (8)
Distending Function-based Data-Driven Type2 Fuzzy Inference System 725
Ẋ =F (x, u) + N. (9)
Ωr −Ω1 + Ω2 − Ω3 + Ω4
Here, u2 , u3 , u4 control the roll, pitch and yaw angles. u1 is the total thrust
input and it controls the altitude z of the quadcopter. b is the thrust coefficient,
d is the drag coefficient and Ωr is the residual angular speed.
It should be noted that this quadcopter model is used only to generate the
data.
Here in these simulations, we seek to model the altitude controller of the
quadcopter. A training dataset which contains the samples of input and output
of the altitude controller is a requirement for applying the proposed procedure.
The requirement was satisfied by controlling the altitude of quadcopter using the
PD controller in Matlab. The dataset containing the inputs and output of the
PD controller was created. Later the proposed procedure was used to generate
the DFIS model of the altitude controller using this input and output dataset.
The threshold τE was set to 0.15. the proposed procedure extracted 26 rules
from dataset, and these formed the rule base Rb .
726 J. Dombi and A. Hussain
Fig. 4. Surface plot of the DFIS model. The upper surface is in blue and the lower
surface is in red. (Color figure online)
Table 1. Performance comparison of the proposed DFIS model with previously pro-
posed models
Fig. 7. Simulated altitude response of the quadcopter with various controllers (in Mat-
lab Simulink). The Measurements got from the altitude sensor (Sonar) were corrupted
with white noise.
5 Conclusion
In this study, we presented the solutions to some of the limitations associated
with the existing fuzzy type2 modeling and control techniques. A procedure was
proposed to identify the type2 model directly from the data, which we called
the DFIS model. This model consists of rules and Type2 Distending Functions
(T2DFs). The whole input space is covered using a few rules. T2DFs can model
various types of uncertainties using its parameters. A rule reduction procedure
is also proposed. It combines the T2DFs in the close vicinity and it significantly
reduces the number of rules. Because of its low computational complexity and
design simplicity, the controller is suitable for real-time control applications. The
future includes refinement of the procedure, more comparisons and real-time
implementation.
References
1. Angelov, P.P., Filev, D.P.: An approach to online identification of takagi-sugeno
fuzzy models. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 34(1), 484–498
(2004)
2. Bhattacharyya, S., Basu, D., Konar, A., Tibarewala, D.N.: Interval type-2 fuzzy
logic based multiclass ANFIS algorithm for real-time EEG based movement control
of a robot arm. Robot. Autonom. Syst. 68, 104–115 (2015)
Distending Function-based Data-Driven Type2 Fuzzy Inference System 729
3. Castillo, O., Melin, P.: A review on interval type-2 fuzzy logic applications in
intelligent control. Inf. Sci. 279, 615–631 (2014)
4. Dombi, J.: A general class of fuzzy operators, the demorgan class of fuzzy operators
and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8(2), 149–163
(1982)
5. József, D., Abrar, H.: Interval type-2 fuzzy control using distending function. In:
Fuzzy Systems and Data Mining V: Proceedings of FSDM 2019, pp. 705–714. IOS
Press (2019)
6. Dombi, J., Hussain, A.: A new approach to fuzzy control using the distending
function. J. Process Control 86, 16–29 (2020)
7. Duţu, L.-C., Mauris, G., Bolon, P.: A fast and accurate rule-base generation
method for mamdani fuzzy systems. IEEE Trans. Fuzzy Syst. 26(2), 715–733
(2017)
8. Gaxiola, F., Melin, P., Valdez, F., Castillo, O.: Interval type-2 fuzzy weight adjust-
ment for backpropagation neural networks with application in time series predic-
tion. Inf. Sci. 260, 1–14 (2014)
9. Hagras, H.: Type-2 flcs: a new generation of fuzzy controllers. IEEE Comput. Intell.
Mag. 2(1), 30–43 (2007)
10. Mathworks Matlab hardware team. Parrot Drone Support from MATLAB. https://
www.mathworks.com/hardware-support/parrot-drone-matlab.html. Accessed 11
Mar 2020
11. Hassani, H., Zarei, J.: Interval type-2 fuzzy logic controller design for the speed
control of dc motors. Syst. Sci. Control Eng. 3(1), 266–273 (2015)
12. Le, T.-L.: Intelligent fuzzy controller design for antilock braking systems. J. Intell.
Fuzzy Syst. 36(4), 3303–3315 (2019)
13. Le, T.L., Quynh, N.V., Long, N.K., Hong, S.K.: Multilayer interval type-2 fuzzy
controller design for quadcopter unmanned aerial vehicles using jaya algorithm.
IEEE Access 8, 181246–181257 (2020)
14. Li, C., Zhou, J., Chang, L., Huang, Z., Zhang, Y.: T-s fuzzy model identification
based on a novel hyperplane-shaped membership function. IEEE Trans. Fuzzy Syst.
25(5), 1364–1370 (2017)
15. Liang, Q., Mendel, J.M.: Interval type-2 fuzzy logic systems: theory and design.
IEEE Trans. Fuzzy Syst. 8(5), 535–550 (2000)
16. Mahfouf, M., Abbod, M.F., Linkens, D.A.: A survey of fuzzy logic monitoring and
control utilisation in medicine. Artif. Intell. Med. 21(1–3), 27–42 (2001)
17. Mendel, J.M.: Computing with words: zadeh, turing, popper and occam. IEEE
Comput. Intell. Mag. 2(4), 10–17 (2007)
18. Mendel, J.M.: Uncertain rule-based fuzzy logic systems: introduction and new.
Directions. Ed. USA: Prentice Hall, pp. 25–200 (2000)
19. Niewiadomski, A.: A type-2 fuzzy approach to linguistic summarization of data.
IEEE Trans. Fuzzy Syst. 16(1), 198–212 (2008)
20. Tai, K., El-Sayed, A.R., Biglarbegian, M., Gonzalez, C.I., Castillo, O., Mahmud, S.:
Review of recent type-2 fuzzy controller applications. Algorithms 9(2), 39 (2016)
21. Tsai, S.-H., Chen, Y.-W.: A novel identification method for takagi-sugeno fuzzy
model. Fuzzy Sets Syst. 338, 117–135 (2018)
22. Wu, D., Tan, W.W.: Genetic learning and performance evaluation of interval type-2
fuzzy logic controllers. Eng. Appl. Artif. Intell. 19(8), 829–841 (2006)
23. Yager, R.R., Zadeh, L.A.: An Introduction to Fuzzy Logic Applications in Intelli-
gent Systems, vol. 165. Springer, New York (2012)
24. Yu, L., Yan-Qing, Z.: Evolutionary fuzzy neural networks for hybrid financial pre-
diction. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 35(2), 244–249 (2005)
730 J. Dombi and A. Hussain
25. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and
decision processes. IEEE Trans. Syst. Man Cybern. SMC-3(1), 28–44 (1973)
26. Zheng, J., Wenli, D., Nascu, I., Zhu, Y., Zhong, W.: An interval type-2 fuzzy
controller based on data-driven parameters extraction for cement calciner process.
IEEE Access 8, 61775–61789 (2020)
Vsimgen: A Proposal for an Interactive
Visualization Tool for Simulation
of Production Planning and Control
Strategies
1 Introduction
Market scenarios are changeable due to product complexity, demand variation,
and competitiveness, and manufacturing companies require improved logistic
performance that optimizes the balance between cost and customer service.
High-quality decisions relating to production planning and control (PPC) strate-
gies, parameterization of selected PPC attributes, and capacity investment are
essential to improve the company’s logistic performance. PPC’s key function-
alities include planning material requirements, demand management, capacity
planning, and sequencing and scheduling jobs to meet manufacturing compa-
nies’ production-related challenges. Appropriate PPC strategies are dependent
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 731–752, 2023.
https://doi.org/10.1007/978-3-031-18461-1_48
732 S. Tripathi et al.
The advantage of the simgen model is the practical applicability of the mod-
els due to the required input parameters, which are selected from the Enterprise
Resource Planning (ERP) system’s data and are processed and stored in its
database. These sets of parameters are known master data parameters. The
master data define the simulation model by creating a production system struc-
ture.
The input parameters defined from master data for discrete event simula-
tions are the bill of materials (BOM), routing sequence of materials, qualifica-
tion matrix, production planning parameters for each item, shift calendars, skill
groups, total available employees, production program, the expected forecasts
Network Visualization and Analysis 733
of the final items and the customer’s demand in terms of order size, and the
customer’s expected lead time. The combined BOM and material routing table,
named WS Master, contains three attributes:
– Parent. Can have one or more child items. End products are always parents,
and product sub-assemblies can be a parent or child.
– Child. Material variants and product sub-assemblies that are required to build
a parent item.
– Machine/workstation group. The machines (assigned machine IDs) or group
of employees working at the workstations (workstation IDs) assembling parent
items or producing the items.
The transaction data define the second set of parameters. They are used
to characterize probability distributions and the respective parameter estima-
tions. The estimated distributions’ parameters are used to randomly initialize
processing, setup, sales data variables, repair time, delivery time, and produc-
tion planning variables. The parameter selections and experimental design are
then applied to discrete simulations for various PPC scenarios. The discrete
simulation results are validated and compared with previous years’ real-world
business outcome data, for example, previous years’ real-world inventory, work
in progress, and service level data for a manufacturing company. Further, the
results are analyzed by business experts for managerial insights related to various
production scenarios.
The PPC simulation parameterization follows three steps:
The preparatory steps before discrete simulation are not trivial tasks; expert
insight is required for data prepossessing and selecting BOM data, representa-
tive materials, other relevant parameters, and experimental designs. The initial
preparation requires the collaborative effort of business experts from different
industrial-production domains within a company. Therefore, a platform with
visual and interactive functionality can help experts from other domains to share,
exchange, and collaborate in an understandable and straightforward manner.
We propose the development of an interactive platform with visual assistance
that allows business experts to prepare, select, and validate the real-world pro-
duction data and parameters, and various production activities of a company
to produce a simplified version of the data and parameters that is suitable for
the requirements of a discrete simulation intended to provide optimal solutions
to various PPC challenges. The interactive platform would allow network-based
approaches to visualize and analyze data systematically. Further, the platform
should facilitate multiple users for interactive analysis of the results of discrete
simulation by the collaborating team members.
The structure of this paper is as follows. The second section briefly outlines
objectives, tasks, and network-based approaches for data exploration and tech-
nical implementation. The third section discusses a case study of ERP data used
for the sheet metal processing industry, including preprocessing, network con-
struction methods, and step-by-step network exploration methods for preparing,
selecting, and analyzing production data. Finally, Sect. 4 provides concluding
remarks about our analysis and future work approach.
2 Methods
In this section, we highlight the main objectives for the interactive visualization
platform and point out some of the tasks necessary to achieve these objectives.
We also provide a brief description of the key aspects of its network visualization
and exploration functionality, and basic details of technical implementation.
2.1 Objectives
Our proposal is for an interactive platform for visual modeling of discrete simula-
tion for various PPC objectives related to optimizing logistic performance in indus-
trial manufacturing. Visual modeling of discrete simulation for PPC will facilitate
a higher level of abstraction so that domain experts can develop various discrete
simulation scenarios without the requirement for any technical understanding.
This interactive platform would have data visualization features based on immer-
sive visualization technology that would allow different users, acting collabora-
tively, to execute various discrete simulation steps to optimize PPC issues.
The platform would allow users to perform various parameter selection tasks,
experimental design, validation of simulation outcomes with state-of-the-art data
visualization techniques, data analytics, and immersive and collaborative visual-
ization techniques. A schematic diagram of the platform is shown in Fig. 1. Our
proposed platform has the following objectives:
Network Visualization and Analysis 735
1. BOM
comparison of simulaon results with 2. Routing data including setup and
previous business year. processing time
compare different personnel and 3. Production planning parameters
qualificaon scenarios set-up Comparison 4. defining the available capacity
opmizaons, and lot-sizing policies with 5. Skill groups and number of employees
past business year for deeper business 6. Production program and forecast for end items
understanding 7. Customer demand, order amount size
8. customer required lead times
Fig. 1. Schematic diagram of the visual platform for discrete data simulation and
analysis
is easily readable and can be loaded into various analysis tools and platforms
for further analysis and discrete simulation.
– Allowing evaluation metrics and useful means of visualization to help business
experts evaluate, compare, and understand the simulation results in terms of
the outcomes in practical scenarios.
– A user-friendly interface with various interactive options both for initial data
preparation and parameters that connects with the discrete simulation mod-
ule for execution, and for systematic presentation of the results of the simu-
lation.
2.2 Tasks
To meet the challenges of developing the interactive visual platform, we divide
the necessary tasks into the following steps:
– Obtaining relevant data, including BOM, routing data, and other required
information, to construct a practical example from a company’s ERP data.
– Defining common standards for preprocessing, cleaning, and integrating data
for the approach to representative material selection.
– Developing construction and visualization algorithms for various types of
networks, such as workstation networks, BOM networks, bipartite networks
(between materials and workstation), and multilevel networks (which are a
combination of workstation and BOM networks).
– Implementing a representative material selection algorithm by using cluster-
ing and community detection algorithms on the data selected by users for
identifying different groups of materials based on material routing data.
– Implementing an immersive visualization platform with interactive features
such as options for preprocessing data, prior selection of materials, and work-
stations for clustering materials using a representative material selection algo-
rithm. Related tasks are implementing necessary visual options for parame-
terizing the discrete simulation model and results evaluation functions for
output results.
– Describing case studies for the interactive visualization of various user activ-
ities related to preprocessing, selecting, editing data for discrete simulation,
and validation outlines.
– User evaluation and testing of the visual platform.
respond based on the user’s query, performed interactively, and include various
functionalities such as collapsing of nodes and edges, adding new nodes and
edges, selection of subgraphs, expansion of abstract views of graphs, separate
visualization of modules, and multilayer visualization by combining BOM net-
works and workstation routing networks [43,46]. The materials and BOM details
of individual products and machines, with historical data, would be added to
the interactive visualization platform. We would also provide various structural
properties relevant to the BOM networks and workstation networks, such as clus-
tering coefficients, page rank, degree distribution, and node and edge entropies,
that correlate with various emerging characteristics of the different types of net-
works that influence PPC objectives [10,13,47]. Various graph-related distance
measures would allow users to compare the constructed networks of BOM, work-
stations, and multilayer networks for different simulation scenarios.
The initial step of network exploration is to address the redundant and old
information of ERP data that is not useful for PPC optimization of current pro-
duction orders. For example, if routing sequence(s) of a particular product or
group of products is required to reassign or some products are not required for
the new simulation scenarios. Therefore, in these cases, the routing sequences
and products’ information (BOM structure) are no longer important for new
scenarios. The other examples are if new products are customized and require
allocation of new routing sequences for production or the routing sequences at
various steps of materials’ assembly are needed to change as per the products’
complexity. In these cases, experts should utilize their domain understanding
to remove unwanted materials and their allocated routing sequences. Experts
should add new routing sequences and allocate the expected time at each work-
station for materials used for new customized products. These activities are per-
formed by visualizing and exploring the bipartite graph, workstation network, or
the multilevel BOM network and workstation networks. When the initial step is
completed, the next step is selecting representative materials using a clustering
algorithm based on different products’ features. The clustering solution provides
multiple groups of materials and workstation networks (routing). These groups
obtained by clustering are enriched into various product types and workstation
categories of production processes depending on the complexity, delivery time,
available resources, and other product and workstation network features. The
users and domain experts needed to work collaboratively to select representative
materials from different groups and reorganize the materials’ routing sequences
by interacting with the networks’ active components (nodes, edges). The interac-
tive operations select and merge workstations (node merging) and delete nodes
(if workstations are not functional or required). The users and domain experts
can perform various tasks collaboratively, such as:
other devices used for XVA [33]. However, new hardware advancements also
add opportunities and challenges regarding visualization and interaction tech-
niques, especially considering analyses and interactions using different devices
along the RV continuum. Particularly for XVA, collaborative features between
multiple users, potentially from different domains with different levels of knowl-
edge or backgrounds, and their use cases and scenarios, must be researched.
More recently, collaborative systems were introduced which integrate cross-
device functionality, such as the Dataspace system by Cavallo et al. [5], where
large screens are combined with augmented reality for a shared analysis experi-
ence of multivariate data.
For a practical implementation of Vsimgen using XVA, we will explore graph
and network analysis methods which are extensively used in the production and
supply chain domain [6,28]. The graph and network analysis is concerned with
the visualization of and interaction with complex networks for analysis purposes,
where nodes represent entities and edge the relationships between them. The
first step for the analysis of complex networks is typically visualization to dis-
cover patterns, generate new knowledge, and provide an interpretation of various
higher-level emergent properties of the system [38]. Research shows that immer-
sive visualization technologies can foster an interactive and efficient visualization
of networks [23,27,34]. Immersive technology, such as AR and VR HMDs, can
help to interactively visualize and modify graphs, which, until recently, have
been under-utilized due to technical and complexity reasons. Further, an exten-
sion of graph visualization and interaction to collaboration capabilities has yet to
be researched more thoroughly. For example, Cordeil et al. [8] utilize collabora-
tive visualization of graphs which aim to find patterns, structures, and complex
characteristics in a collaborative way. We believe XVA can be applied for the
following techniques in graph and network analysis:
loaded into the AR/VR application deployed on an HMD, which would parse
and visualize the file’s content in a three-dimensional immersive manner. Further
interactions would then be accomplished using hand/finger gestures or coupled
controllers containing buttons and sliders to control the graph environment.
3 Case Study
3.1 Data
We use real world manufacturing data for sheet metal processing. In the initial
phase, the data are exported from the ERP system relating to BOM, routing
data with processing time at each workstation, and other production planning
parameters required for the discrete simulation. The BOM data contain material
IDs (unique), sub-assembly IDs, and the end products and lot size policy for each
material. Lot size policies can be fixed order period (FOP), fixed order quantity
(FOQ), or consumption based (CB).
The routing data contain material IDs, workstation ID (unique), expected
time spent at the corresponding workstation, and operation sequence numbers
defined by integer values. There are multiple rows in routing data for individual
material IDs with different sequence numbers, representing the complete routing
sequence of the material.
The BOM data and routing sequence data are integrated by joining both
tables using material ID as the primary key. The joined table is called a master
table. An example master table is shown in Table 1.
Table 1. Example master data table joining BOM data and routing sequence data
In the master data set the bipartite network, Gbipartite , is constructed using
data from the Material ID column and Workstation column. The BOM network,
GBOM , is constructed by combining the End item, Sub assembly ID, and Mate-
rial ID columns. The workstation network is constructed using the Workstation
column. Examples of the networks constructed from the master data are shown
in Fig. 2.
M01:
W1 A W1 W2 W3
M02:
M01
W2 W1 W4
SA1
M02 W3
W1 W2 W3
M01 M02
W4
W4
In the first step, we load the required ERP data, perform preprocessing, and cre-
ate a master table; the complete data comprise 400, 000 rows. The initial data
visualization is performed by creating a bipartite network containing 124 work-
station details and 28, 500 materials. However, the bipartite network ignores the
routing details of the materials in the first step. Here, we only consider the rela-
tionship between a workstation and material if the material is processed in that
workstation. Users can select materials and workstation-related network con-
struction options. For example, the user can choose to construct only those rela-
tionships for materials processed through at least three workstations. An exam-
ple of a bipartite graph is shown in Fig. 3. We applied a multilevel-community
detection algorithm [3] and replaced each module with a weighted node and their
respective connections with a single weighted edge.
In Fig. 3, The module detection algorithm estimates 8 modules of bipartite
graph, G = (U, V, E), with V = ∪8i=1 Vi materials and U = ∪8i=1 Ui workstations.
An abstract bipartite graph, Gabs = (Uabs , Vabs , Eabs ), is drawn by replacing each
Vi and Ui with a single node and an edge is drawn if there is atleast one edge
between nodes i ∈ Vi and j ∈ Uj , where ij ∈ E. The edge widths show the total
number of materials connected with workstations in the respective modules. The
vertex weights correlate with the total numbers of materials and workstations.
Network Visualization and Analysis 743
M2 M1 M7 M5 M3 M6 M4 M8
W7 W1 W2 W3 W5 W4 W6 W8
The vertices and edges are used as active components in interactive visualization
and should provide various options to explore details by event driven functions
represented by a node or connections between nodes. The abstract visualiza-
tion fulfills two initial tasks. First, it reduces complexity, and second, it is used
to select specific modules for further analysis and masterdata preparation for
discrete simulation. The next step is to explore individual modules and their
connections, which can be shown by selecting a workstation or material module
for further exploration of the data. We provide examples of selecting a worksta-
tion module in Fig. 4a and selecting a material module in Fig. 4b. Each material
and workstation modules, represented as an edge, can be further explored for the
details of individual materials and workstations. An example is shown in Fig. 5.
In this example we visualize the workstation (Fig. 5a) and material (Fig. 5b) at
the center and the connected materials and workstations are arranged in circular
layouts.
M7 M1 W1
W5
W7
M5 W1 M2
M1
W3
M3 M6 W2
(a) (b)
1784−160−00−00−014
1202−696−00−00−015
1282−160−00−00−015
1668−980−00−00−014 2861006 APG
284500
1328−150−00−00−015 325100
2055−160−01−00−014 1482−150−00−00−015
1649−167−01−00−015 7 4
A
1322−150−00−00−015 2036−150−00−00−014 221000 284811
1200−170−00−00−015 222000
1712−170−01−00−014 284810
1321−170−01−00−015 211000
1227−980−00−00−015 284807
1954−160−00−00−014 212000
1984−170−01−00−014 284806
1675−170−00−00−015 212002
1927−170−00−00−014 284805
1629−170−01−00−015 222002
1985−170−00−00−014 284804
1231−170−01−00−015 284800
1953−980−00−00−014 0
1135−696−10−00−015 284305
1292−150−00−00−015 1
1244−952−85−00−015 222010
W5 1345−170−00−00−015 M5 286200
1268−160−00−00−015 281200
1202−160−01−00−015 212001
1361−980−00−00−015 AV
1996−166−01−00−014 211001
1728−980−00−00−015 282100
1060−980−00−00−015 222001
1930−170−00−00−014 281400
1395−170−01−00−015 221001
1754−170−00−00−014 283300
1347−170−01−00−013 283301
2023−980−00−00−014 282500
1767−150−00−00−015 282300
1698−950−83−00−015 APE
1712−160−01−00−014 251002
1411−150−00−00−015 284802
1415−170−01−00−015 1610−150−00−00−015 284808 282200
1711−696−10−00−015
1805−160−01−00−015 1909−170−00−00−014 284442 281500
1923−150−00−00−015
1167−170−00−00−015 282600 282000
251003
1291−160−00−00−015
1231−690−45−00−015
1994−170−00−00−014 284400 222011
284803
(a) (b)
(a) (b)
Fig. 6. (a) Workstation network aggregating all routing sequences of different materi-
als; (b) Showing the routing of a single material in the network.
The BOM network, GBOM , is composed of various BOM structures. The com-
bined network allows users to design products, analyze inventory details, and
perform PPC analysis. A user can create and explore new products by inter-
actively visualizing the BOM network, selecting representative materials for
discrete simulation. They can examine BOM structure similarity by compari-
son of graphs to find clusters of BOM that have common sub-assemblies and
materials. BOM trees can be compared based on the similarity of various
graphs and distance measures, such as graph edit distance [4], DeltaCon [25],
and Vertex-edge overlap [31], that would be provided in the visualization plat-
form. An example visualization of a BOM hierarchy and the aggregation is pre-
sented in Fig. 8. Figure 8a, 8b, and 8c are the individual BOM structures for
products P 1, P 2, and P 3, respectively, and these are aggregated in Fig. 8d.
The interactive visualization would allow users to visualize individual BOMs
or aggregated BOMs and explore various information utilizing networks’ topo-
logical properties. The visualization would also provide the two-level network
G = (V ws ∪ V BOM , E ws ∪ E BOM ∪ E ws,BOM ) for selecting and using BOM and
routing data for discrete simulation.
Network Visualization and Analysis 747
P1 P2
M1 M8 M9 M10
M14 M15
M2 M4
(a) (b)
P3 P1 P2 P3
M29 M30 M5 M6 M11 M12M17 M18 M20 M21 M22 M29 M30
M9 M35 M36 M23 M38 M39 M7 M26 M9 M23 M35 M36 M38 M39
(c) (d)
Fig. 8. (a), (b), and (c) are individual BOM structures; (d) is an aggregated BOM
network visualized in a layered layout.
and then randomly pick a material from each group and construct Gmsub (M ) for
M = {5, 25, 45, . . . 1305} and repeat 20 times for each Mi . The GED is shown
in Fig. 9b. As we increase the number of clusters from which materials are ran-
domly selected, the GED gets closer to 0, i.e., GED ∼ 0. This result validates
the clustering solution grouped based on overlap of routing sequences of materi-
als. The representative materials in each group can be selected based on domain
knowledge and users’ prior understanding. A user can also select several scenar-
ios for discrete simulation by selecting different sets of representative materials
considering the complexity of BOM structures and workstation routing. The net-
work visualization and analysis functionalities of graphs and interactive features
in the visualization tool would provide users with an efficient way to deal with
large numbers of materials and high complexity of workstation routing.
700
600
Graph Edit Distance
500
400
300
200
100
0
5 65 145 245 345 445 545 645 745 845 945 1045 1165 1285
#number of materials
(b)
(a)
Fig. 9. (a) Missing edges in the workstation network by comparing Gmsub (M ) with
Gws ; (b) GED, comparing Gws with Gmsub (M ) by randomly selecting materials, M,
from different groups.
4 Summary
In this paper, we discussed the importance of optimized PPC strategies for
improved logistic performance in customized production. We also discussed some
of the challenges in the discrete simulations approach to analyzing PPC deci-
sions, which mainly relate to data preparation, representative material selection,
and model parameterization. Further, we presented example visualizations for
interactive visualization technology for BOM networks, workstation networks,
and bipartite networks proposed for the systematic exploration of the master
data. The initial setup for a discrete event simulation requires a collaborative
effort between domain experts managing various stages of PPC strategies. To
enable efficient collaboration between experts for optimizing PPC, we have pro-
posed an interactive platform with state of the art visualization technology and
Network Visualization and Analysis 749
References
1. Altendorfer, K., Felberbauer, T., Jodlbauer, H.: Effects of forecast errors on optimal
utilisation in aggregate production planning with stochastic customer demand. Int.
J. Prod. Res. 54(12), 3718–3735 (2016)
2. Bach, B., Dachselt, R., Carpendale, S., Dwyer, T., Collins, C., Lee, B.: Immersive
analytics: exploring future interaction and visualization technologies for data ana-
lytics. In: Proceedings of the 2016 ACM International Conference on Interactive
Surfaces and Spaces, pp. 529–533 (2016)
3. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008
(2008)
4. Bunke, H., Dickinson, P.J., Kraetzl, M., Wallis, W.D.: A graph-theoretic approach
to enterprise network dynamics, vol. 24. Springer Science & Business Media (2007).
https://doi.org/10.1007/978-0-8176-4519-9
5. Cavallo, M., Dolakia, M., Havlena, M., Ocheltree, K., Podlaseck, M.: Immersive
insights: a hybrid analytics system for collaborative exploratory data analysis. In:
Symposium on Virtual Reality Software and Technology (VRST), pp. 1–12. ACM
(2019)
6. Cheng, Y., Tao, F., Xu, L., Zhao, D.: Advanced manufacturing systems: supply–
demand matching of manufacturing resource based on complex networks and inter-
net of things. Enterprise Inf. Syst. 12(7), 780–797 (2018)
7. Cinelli, M., Ferraro, G., Iovanella, A., Lucci, G., Schiraldi, M.M.: A network per-
spective on the visualization and analysis of bill of materials. Int. J. Eng. Bus.
Manage. 9, 1847979017732638 (2017)
750 S. Tripathi et al.
8. Cordeil, M., Dwyer, T., Klein, K., Laha, B., Marriott, K., Thomas, B.H.: Immersive
collaborative analysis of network connectivity: cave-style or head-mounted display?
IEEE Trans. Visual Comput. Graph. 23(1), 441–450 (2017)
9. de Groote, X., Yücesan, E.: The impact of product variety on logistics performance.
In: Proceedings of the 2011 Winter Simulation Conference (WSC), pp. 2245–2254.
IEEE (2011)
10. Dehmer, M., Emmert-Streib, F., Jodlbauer, H.: Methods and Applications. CRC
Press, Entrepreneurial Complexity (2019)
11. Dimitrova, T., Petrovski, K., Kocarev, L.: Graphlets in multiplex networks. Sci.
Rep. 10(1), 1–13 (2020)
12. Elmqvist, N., Moere, A.V., Jetter, H.C., Cernea, D., Reiterer, H., Jankun-Kelly,
T.J.: Fluid interaction for information visualization. Inf. Visual. 10(4), 327–340
(2011)
13. Emmert-Streib, F., et al.: Computational analysis of the structural properties of
economic and financial networks. arXiv:1710.04455 (2017)
14. Fröhler, B., et al.: A survey on cross-virtuality analytics. In: Computer Graphics
Forum, vol. 41, pp. 465–494. Wiley Online Library (2022)
15. Garg, S., Vrat, P., Kanda, A.: Equipment flexibility vs. inventory: a simulation
study of manufacturing systems. Int. J. Prod. Econ. 70(2), 125–143 (2001)
16. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012)
17. Hübl, A., Altendorfer, K., Jodlbauer, H., Gansterer, M., Hartl, R.F.: Flexible model
for analyzing production systems with discrete event simulation. In: Proceedings
of the 2011 Winter Simulation Conference (WSC), pp. 1554–1565. IEEE (2011)
18. Interdonato, R., Magnani, M., Perna, D., Tagarelli, A., Vega, D.: Multilayer net-
work simplification: approaches, models and methods. Comput. Sci. Rev. 36,
100246 (2020)
19. Jetter, H.C., Gerken, J., Zöllner, M., Reiterer, H., Milic-Frayling, N.: Materializing
the query with facet-streams: a hybrid surface for collaborative search on table-
tops. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, pp. 3013–3022 (2011)
20. Jodlbauer, H., Altendorfer, K.: Trade-off between capacity invested and inventory
needed. Eur. J. Oper. Res. 203(1), 118–133 (2010)
21. Kiyokawa, K., Takemura, H., Yokoya, N.: A collaboration support technique by
integrating a shared virtual reality and a shared augmented reality. In: Interna-
tional Conference on Systems, Man, and Cybernetics (SMC), vol. 6, pp. 48–53.
IEEE (1999)
22. Koh, S.-G., Bulfin, R.L.: Comparison of DBR with CONWIP in an unbalanced
production line with three stations. Int. J. Prod. Res. 42(2), 391–404 (2004)
23. Kotlarek, J., et al.: A study of mental maps in immersive network visualization.
In: IEEE Pacific Visualization Symposium (PacificVis), pp. 1–10 (2020)
24. Kotlarek, J., et al.: A study of mental maps in immersive network visualization
(2020)
25. Koutra, D., Vogelstein, J.T., Faloutsos, C.: DELTACON: a principled massive-
graph similarity function. In: Proceedings of the 2013 SIAM International Confer-
ence on Data Mining, pp. 162–170. SIAM (2013)
26. Kronberger, G., Weidenhiller, A., Kerschbaumer, B., Jodlbauer, H.: Automated
simulation model generation for scheduler-benchmarking in manufacturing. In:
Proceedings of the International Mediterranean Modelling Multiconference (I3M
2006), pp. 45–50 (2006)
Network Visualization and Analysis 751
27. Kwon, O.H., Muelder, C., Lee, K., Ma, K.L.: A study of layout, rendering, and
interaction methods for immersive graph visualization. IEEE Trans. Visual Com-
put. Graph. 22(7), 1802–1815 (2016)
28. Li, Y., Tao, F., Cheng, Y., Zhang, X., Nee, A.Y.C.: Complex networks in advanced
manufacturing systems. J. Manuf. Syst. 43, 409–421 (2017)
29. Milgram, P., Takemura, H., Utsumi, A., Kishino, F.: Augmented reality: a class
of displays on the reality-virtuality continuum. In: Das, H. (eds.) Photonics for
Industrial Applications, pp. 282–292 (1995)
30. Mula, J., Poler, R., Garcı́a-Sabater, J.P., Lario, F.C.: Models for production plan-
ning under uncertainty: a review. Int. J. Prod. Econ. 103(1), 271–285 (2006)
31. Papadimitriou, P., Dasdan, A., Garcia-Molina, H.: Web graph similarity for
anomaly detection. J. Internet Serv. Appl. 1(1), 19–30 (2010). https://doi.org/
10.1007/s13174-010-0003-x
32. Riegler, A., et al.: Cross-virtuality visualization, interaction and collaboration. In:
XR@ ISS (2020)
33. Sereno, M., Besançon, L., Isenberg, T.: Supporting volumetric data visualization
and analysis by combining augmented reality visuals with multi-touch input. In:
EG/VGTC Conference on Visualization (EuroVis) - Posters (2019)
34. Sorger, J., Waldner, M., Knecht, W., Arleo, A.: Immersive analytics of large
dynamic networks via overview and detail navigation. In: International Confer-
ence on Artificial Intelligence and Virtual Reality (AIVR), pp. 144–1447. IEEE
(2019)
35. Sorger, J., Waldner, M., Knecht, W., Arleo, A: Immersive analytics of large
dynamic networks via overview and detail navigation (2019)
36. Stevenson*, M., Hendry, L.C., Kingsman, B.G.: A review of production planning
and control: the applicability of key concepts to the make-to-order industry. Int.
J. Prod. Res. 43(5), 869–898 (2005)
37. Strasser, S., Peirleitner, A.: Reducing variant diversity by clustering. In: Proceed-
ings of the 6th International Conference on Data Science, Technology and Appli-
cations, pp. 141–148. SCITEPRESS-Science and Technology Publications, LDA
(2017)
38. Strogatz, S.H.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
39. Szalavári, Z., Schmalstieg, D., Fuhrmann, A., Gervautz, M.: “studierstube”: an
environment for collaboration in augmented reality. Virt. Real. 3(1), 37–48 (1998)
40. Thompson, M.B.: Expanding simulation beyond planning and design-in addition
to the increase in traditional uses, simulation is expanding into new and even more
valuable areas. Ind. Eng.-Norcross 26(10), 64–67 (1994)
41. Tiger, A.A., Simpson, P.: Using discrete-event simulation to create flexibility in
APAC supply chain management. Global J. Flexible Syst. Manage. 4(4), 15–22
(2003)
42. Trattner, A., Hvam, L., Forza, C., Herbert-Hansen, Z.N.L.: Product complexity
and operational performance: a systematic literature review. CIRP J. Manuf. Sci.
Technol. 25, 69–83 (2019)
43. Tripathi, S., Dehmer, M., Emmert-Streib, F.: NetBioV: an R package for visualizing
large network data in biology and medicine. Bioinformatics 30(19), 2834–2836
(2014)
44. Tripathi, S., Strasser, S., Jodlbauer, H.: A network based approach for reducing
variant diversity in production planning and control (2021)
45. Tseng, M.M., Radke, A.M.: Production planning and control for mass
customization–a review of enabling technologies. In: Mass Customization, pp. 195–
218. Springer (2011)
752 S. Tripathi et al.
46. Wang, C., Tao, J.: Graphs in scientific visualization: a survey. In: Computer Graph-
ics Forum, vol. 36, pp. 263–287. Wiley Online Library (2017)
47. Guihai, Yu., Dehmer, M., Emmert-Streib, F., Jodlbauer, H.: Hermitian normalized
Laplacian matrix for directed networks. Inf. Sci. 495, 175–184 (2019)
An Annotated Caribbean Hot Pepper
Image Dataset
Abstract. The Caribbean region is home to, and widely known for,
its many “hot” peppers. These peppers are now heavily researched to
bolster the development of the regional pepper industry. However, accu-
rately identifying the different landraces of peppers in the Caribbean
has since remained an arduous, manual task that involves the physical
inspection and classification of individual peppers. An automated app-
roach that uses machine-learning techniques can help with this task; how-
ever, machine learning approaches require vast amounts of data to work
well. This paper presents a new multi-label annotated, image-dataset
of Capsicum Chinense peppers from Trinidad and Tobago. The paper
also presents a benchmark for image-pepper classification and identifica-
tion. It serves as a starting ground for future work that can include the
compilation of larger datasets of regional peppers that can include more
morphological features. It additionally serves as the starting ground for
a Caribbean-based hot-pepper ontology.
1 Introduction
The Caribbean is well known for its hot peppers. Most of the region’s commer-
cially produced peppers are of the Capsicum Chinense Jacq. Species. The species
is recognized for its high capsaicin content which gives Caribbean hot peppers
their characteristic heat, pungent aromatic smell and flavor. These characteris-
tics have made them an important export for countries in the region [29]. Another
trend that favors their export is global interest in the so-called “super-hot” cate-
gory of peppers. Trinidad and Tobago is known in particular for its “super-hot”
landraces and pure line varieties such as the “Seven-Pot” and “Scorpion” pep-
pers. The Scorpion pepper is one of the hottest peppers in the world and is
sought after for its high capsaicin content [5,6].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 753–769, 2023.
https://doi.org/10.1007/978-3-031-18461-1_49
754 J. Mungal et al.
2 Related Work
2.1 Related Image Datasets
Since our image dataset is intended to be used for deep learning applications
it is important to examine other datasets designed for similar use. Deep neural
Caribbean Hot Pepper Image Dataset 755
notable mention which focuses on object description. The Visual Genome dataset
[17] features extensive annotations of objects, object attributes and object rela-
tions. This dataset was built with the aim of developing systems that can perform
reasoning tasks rather than recognition/classification tasks.
The initial task of determining what attributes to annotate was approached
differently for each study mentioned. Attributes need to be general enough so
that they can be applied to different types of objects in order to create rela-
tionships. The authors of [27] mined ImageNet synset definitions for descriptive
attributes. In [25] and [26] a human intelligence task was involved in identify-
ing common attributes. In both [27] and [25] this task was done using Amazon
Mechanical Turk (AMT). The top five most commonly selected attributes were
chosen in [25]. The authors of [26] reduced the number of attributes to those
that best describe visual properties.
Object specific datasets with attribute annotations reduce the difficulty of
choosing which attributes to include. Caltech-UCSD Birds-200–2011 [30] is an
image dataset of bird images containing part location and attribute annotations.
It is an extension of the dataset in [31]. Attributes were chosen from a bird field
guide as there is an already defined or commonly understood taxonomy for
describing the object instances. CelebFaces Attributes Dataset [21] is a large
scale dataset of celebrity faces. The attributes describe physical features of the
faces as well as clothing and hair. However, Liu et al. [21] do not describe how
the attributes were chosen. Animals with Attributes Dataset [18] used a group of
high-level semantic attributes to describe images of animals in their dataset. The
authors describe these attributes as ones that generalize across class boundaries
e.g. color. DeepFashion [20] is a fashion dataset put forward for the purpose of
clothing classification and attribute prediction. Attributes for DeepFashion [20]
were chosen by mining meta-data of clothing images from Google Images and
online clothing retailers. With our Capsicum Chinense image dataset, we used
The International Plant Genetic Resources Institute (IPGRI) Descriptors for
Capsicum Spp. [16], a well known set of descriptors for morphological traits of the
Capsicum species of plants. This descriptor set is well-defined and widely used.
Our attributes are therefore a subset of the descriptors defined in this document.
This is a more domain specific way to go about attribute annotations. The
vocabulary used for attributes in our dataset is consistent with that of studies
involving fruit from accessions of the Capsicum Chinense which we believe is
more beneficial for applications in this area.
for infield fruit identification by taking images of fruit trees in the field and
using similar images found online. Zhang et al. [35] created a set of “clean” fruit
images by applying pre-processing to remove background from their image data.
Datasets like ImageNet [10] were constructed with general image classification
in mind and therefore aim to be as diverse as possible. Deng et al. [10], the cre-
ators of ImageNet, included average image calculations against classes of other
datasets to show that ImageNet is more diverse. This type of problem is more
difficult as the context of real-world images cannot be controlled. The image set
must be as diverse as possible to not only improve accuracy but avoid problems
such as domain shift. Domain shift problems refers to using sets of data from
different distributions to train and evaluate a model. An example of domain shift
described is a classifier trained on images of objects on white background per-
forming worse on similar images with a different background. The model trained
on data from one domain does not generalize well enough to be used with data
from another domain.
It may not be necessary to have the most diverse set of images regarding vari-
ables such as backgrounds and obscured images. A context of classifying images
in lab conditions where the lighting, sensor, background, and fruit orientation
can be controlled may be sufficient. Zhang et al. [34] constructed a dataset for
fine-grained classification of banana ripeness. This dataset included images of
bananas in predetermined orientations on white background. The same camera,
camera settings, distance and lighting were used when capturing images. While
their experiment achieved good results, it should not be expected to perform
well on images outside of this domain e.g. images from a different camera on a
different background. Zhang et al. [35] also conclude that their convolution neu-
ral network was not able to perform well on imperfect images and images with
complex backgrounds due to the network being trained on their clean dataset.
The same expectations would apply to our Capsicum Chinense image dataset
which uses a similar method of data collection to Zhang et al. [34].
Scotch Bonnet pepper is known for its yellow color when it matures, campanu-
late (bell-like) shape, and its folded pericarp. These morphological traits form
the basis for our attribute annotations.
As varieties share similar traits we can say there is some inter-class similarity.
For example, the Scorpion pepper and Seven-Pot pepper have a very similar
appearance with small differences such as the Scorpion pepper’s pronounced
tail at the distal end of the fruit as seen in Fig. 1. Intra-class variation is also
a problem as the images of a class may have different orientations, scale or
occluded features. There is natural variation in morphological traits observed
for a specific variety.
Fig. 1. Image showing inter-class similarities between Scorpion and Seven-Pot Pepper
3 Methodology
3.1 Image Collection and Dataset Annotation
In order to start the data collection process, genuine Scorpion pepper fruit had
to be sourced directly from farmers. Images of other Capsicum Chinense such
as Trinidad Pimentos, Seven-Pot-brown, red and yellow Habanero varieties were
collected from local markets. The images were captured using two cameras in
fixed positions. The peppers were rotated to capture different sides of the fruit in
different orientations. In all, 10 images of each pepper were taken. A diagram of
this setup is shown in Fig. 2. The images were captured on a white background
with two fluorescent lamps for lighting. Both cameras had their white balances
set to fluorescent, to account for the artificial lighting used. They also had their
ISO set to 400 with image quality being fine. However, due to the differences in
cameras and their lenses, a few settings had to be configured differently. The first
Caribbean Hot Pepper Image Dataset 759
camera produced images with a 4:3 aspect ratio with a 3264 × 2488 resolution,
while the second produced images with a 3:2 ratio and a 5184 × 3456 resolution.
Both cameras used fixed lenses; however, due to differences in brands and camera
technology, the first camera used an exposure of +1 23 , while the second used a
1/125 shutter speed and a f/3.2 aperture. The differences in these settings led
to the consistent images. For the purpose of having uniformly sized images, the
larger images were center cropped to match the aspect ratio and resolution of the
first set of images in the final dataset. The result of this process was a collection
of just over 4,000 images.
Fig. 2. Figure showing the camera setup used in the image collection process
Fig. 3. Figure showing the AMT user interface that was shown to the annotators.
The dataset was used to perform a classification experiment. The goal was to
determine whether an image in the dataset was of a scorpion pepper or not.
The experiment made use of a convolutional neural network for classification
and employed a technique known as transfer learning. A two-phase fine-tuning
process was used where only a new classification head of the network is trained
using our dataset in the first phase, and the second phase where the top and
some of the higher layers in the convolutional base are retrained. VGG16 was
chosen as the classification network. For the first phase, K-Fold Cross Validation
was used to determine which split of the data provided the best representation
of the model performance. From this the second phase of the fine-tuning process
was carried out. Other common evaluation metrics were also used including
Precision, Recall, F-1 score and Accuracy.
Caribbean Hot Pepper Image Dataset 761
CNN Setup and Configuration. The experiment was set up using an Ana-
conda environment with Python 3.6. The model was trained on a machine with
an Nvidia 2060 Super GPU with 8 GB of VRAM. The VGG16 network, pre-
trained on ImageNet, was loaded from the Keras API. A two step fine-tuning
approach was used for adapting the network for the experiment with our dataset.
The VGG16 classification layers are replaced with a new classifier, the structure
of which is given in Fig. 4. The new classifier uses smaller dense layers and
replaces the soft-max layer for an output neuron with a sigmoid activation func-
tion. Training is done in two phases:
1. Freeze the convolutional base of the network. Train the new fully connected
head.
2. Unfreeze block 5 of the VGG16 convolutional base. Retrain block 5 and the
fully connected head.
Fig. 4. Image showing new classification layers added to the standard VGG16 base
network
The images were center cropped using a 900 × 900 box and then resized to
224 × 224 to match the default input size for VGG16 pre-trained on ImageNet.
Since there were ten images of each pepper, the images were grouped by pepper
to avoid the same fruit appearing in training, validation and test sets. Exactly 4
peppers, 40 images, were removed from the set of peppers that are not Scorpion
pepper. This was done to make the split even. The entire set was then randomly
split using a 80:20 ratio for training and evaluation sets. A breakdown of the
training and test set splits is shown in Table 1. The training set was then split
into k subsets for K-Fold cross validation where k = 10 with the same grouping
enforced as to prevent images of the same fruit showing up in both training and
validation sets. A breakdown of images per fold is shown in Table 2.
Training. Binary cross-entropy was used as the loss function since it is a binary
classification problem. Adam was selected as the optimizer for back propagation
with a learning rate of 0.0001. The training and validation data generator were
set to use a batch size of 20 images.
762 J. Mungal et al.
Table 1. Table showing breakdown of training and test splits of the dataset.
The same settings were used for phase 1 and phase 2 of training. The model
is trained for 15 epochs in Phase 1 and 10 epochs in Phase 2.
A further pre-processing step is done using the ImageDataGenerator. The
pixel values of the images are re-scaled from the range of 0–255 to the range of
0–1. This is a common data normalization step for images. While this method
does not match the pre-processing function used by the pre-trained model, it
seemed to help reduce over-fitting in our testing. The pre-processing and fine-
tuning procedure is based on the example by Chollet [8], the creator of the Keras
library, from his Deep learning with Python book.
4 Results
4.1 Classification Experiment Results
Table 3 shows the evaluation accuracy results after performing k-fold cross val-
idation. The average evaluation accuracy across all folds was roughly 98%.
Exactly 7 of the 10 folds were within 1 standard deviation of the mean. The
performance on the evaluation set suggested that the model was generalizing
Caribbean Hot Pepper Image Dataset 763
Table 3. Table showing accuracy of the model for K-fold validation for classification
experiment. * means values within 1 standard deviation of the mean
well to unseen samples. The data in Table 4 shows the evaluation metrics of the
model. The model showed good results for classification of both classes. The
precision score for the “not scorpion” class was higher than that of the scorpion
pepper class. The opposite is true of the recall scores and the F1 scores were
roughly the same. The model made more mistakes classifying the “not scorpion”
images as opposed to the images of scorpion peppers.
$627.60 USD. The most popular values reported by the five annotators for each
pepper’s attributes were selected as the final attribute values for that pepper.
Additional information on the MTurkers themselves were also collected. They
were asked about their birth country, residence country, native language and
other languages they speak. If they lived in a country that was not their birth
country, they were asked how long they lived in their new country. This was done
to capture any cultural differences that may have existed between MTurkers and
the way that they may annotate the peppers. A summary of the annotations is
shown in Table 5.
Table 5. Table showing results from annotation process broken down by attribute type
and pepper type. The numbers represent the count of peppers belonging to a landrace
for a specific attribute.
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Color All 0.357 0.005 (0.348, 0.366) 0.00
Color Local red 0.140 0.024 (0.093, 0.188) 0.00
Color Moruga red 0.411 0.011 (0.39, 0.432) 0.00
Color Pimento 0.283 0.010 (0.262, 0.303) 0.00
Color Red habanero 0.290 0.020 (0.251, 0.33) 0.00
Color Scorpion 0.271 0.010 (0.251, 0.29) 0.00
Color Seven pot 0.274 0.009 (0.257, 0.291) 0.00
Color Yellow habanero 0.264 0.021 (0.223, 0.305) 0.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Pebbling All 0.337 0.007 (0.322,0.351) 0.00
Pebbling Local red 0.058 0.025 (0.009, 0.107) 0.02
Pebbling Moruga red 0.055 0.013 (0.029, 0.081) 0.00
Pebbling Pimento 0.030 0.015 (−0.001, 0.06) 0.05
Pebbling Red habanero 0.026 0.022 (−0.018, 0.069) 0.25
Pebbling Scorpion 0.050 0.010 (0.03, 0.071) 0.00
Pebbling Seven pot 0.087 0.016 (0.055,0.119) 0.00
Pebbling Yellow Habanero 0.049 0.023 (0.003, 0.095) 0.04
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Pericarp Fold All 0.096 0.006 (0.084, 0.108) 0.00
Pericarp Fold Local red 0.060 0.023 (0.016, 0.105) 0.01
Pericarp Fold Moruga red 0.029 0.013 (0.005, 0.054) 0.02
Pericarp Fold Pimento 0.048 0.017 (0.016,0.081) 0.00
Pericarp Fold Red habanero 0.035 0.023 (−0.01, 0.08) 0.12
Pericarp Fold Scorpion 0.047 0.010 (0.029, 0.066) 0.00
Pericarp Fold Seven pot 0.098 0.015 (0.069, 0.126) 0.00
Pericarp Fold Yellow habanero 0.000 0.021 (−0.042, 0.041) 1.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Shape All 0.322 0.005 (0.312, 0.333) 0.00
Shape Local red 0.261 0.022 (0.218, 0.304) 0.00
Shape Moruga red 0.069 0.013 (0.044,0.094) 0.00
Shape Pimento 0.184 0.016 (0.153,0.214) 0.00
Shape Red Habanero 0.024 0.017 (−0.009,0.057) 0.15
Shape Scorpion 0.177 0.009 (0.16,0.195) 0.00
Shape Seven pot 0.167 0.013 (0.141,0.192) 0.00
Shape Yellow habanero 0.258 0.026 (0.207,0.309) 0.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Fruit Surface All 0.141 0.005 (0.132,0.151) 0.00
Fruit Surface Local red 0.080 0.021 (0.039,0.122) 0.00
Fruit Surface Moruga red 0.056 0.009 (0.038,0.074) 0.00
Fruit Surface Pimento 0.084 0.014 (0.057,0.111) 0.00
Fruit Surface Red habanero 0.013 0.015 (−0.017,0.043) 0.40
Fruit Surface Scorpion 0.048 0.008 (0.033,0.063) 0.00
Fruit Surface Seven pot 0.065 0.011 (0.043,0.087) 0.00
Fruit Surface Yellow habanero 0.109 0.020 (0.07,0.148) 0.00
766 J. Mungal et al.
Fleiss’ Kappa sometimes yields low values when the ratings suggest high
levels of agreement. This is known as the kappa paradox, identified by [12],
as well as [9]. Gwet [13] proposed an AC1 statistic as a more paradox-resistant
alternative to kappa measures. [32] concluded that Gwet’s AC1 statistic provides
a more stable inter-rater reliability coefficient than other kappa measures.
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Color All 0.448 0.004 (0.44,0.456) 0.00
Color Local red 0.549 0.016 (0.517,0.58) 0.00
Color Moruga red 0.520 0.009 (0.501,0.538) 0.00
Color Pimento 0.374 0.012 (0.351,0.396) 0.00
Color Red habanero 0.481 0.016 (0.449,0.512) 0.00
Color Scorpion 0.489 0.007 (0.475,0.503) 0.00
Color Seven pot 0.328 0.009 (0.31,0.345) 0.00
Color Yellow habanero 0.364 0.018 (0.328,0.401) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Pebbling All 0.339 0.008 (0.324,0.354) 0.00
Pebbling Local red 0.236 0.036 (0.166,0.306) 0.00
Pebbling Moruga red 0.409 0.019 (0.372,0.446) 0.00
Pebbling Pimento 0.527 0.023 (0.482,0.571) 0.00
Pebbling Red habanero 0.478 0.032 (0.415,0.54) 0.00
Pebbling Scorpion 0.558 0.013 (0.531,0.584) 0.00
Pebbling Seven pot 0.503 0.020 (0.464,0.542) 0.00
Pebbling Yellow habanero 0.462 0.036 (0.391,0.533) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Pericarp Fold All 0.408 0.004 (0.4,0.416) 0.00
Pericarp Fold Local red 0.398 0.038 (0.324,0.472) 0.00
Pericarp Fold Moruga red 0.266 0.019 (0.23,0.303) 0.00
Pericarp Fold Pimento 0.152 0.021 (0.11,0.194) 0.00
Pericarp Fold Red habanero 0.201 0.031 (0.14,0.262) 0.00
Pericarp Fold Scorpion 0.366 0.006 (0.353,0.379) 0.00
Pericarp Fold Seven pot 0.151 0.017 (0.118,0.185) 0.00
Pericarp Fold Yellow habanero 0.061 0.026 (0.009,0.113) 0.02
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Shape All 0.401 0.005 (0.391,0.41) 0.00
Shape Local red 0.395 0.024 (0.349,0.442) 0.00
Shape Moruga red 0.498 0.012 (0.476,0.521) 0.00
Shape Pimento 0.405 0.018 (0.371,0.44) 0.00
Shape Red habanero 0.473 0.019 (0.436,0.511) 0.00
Shape Scorpion 0.307 0.008 (0.29,0.323) 0.00
Shape Seven pot 0.383 0.010 (0.363,0.403) 0.00
Shape Yellow habanero 0.463 0.024 (0.416,0.51) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Fruit Surface All 0.171 0.005 (0.162,0.18) 0.00
Fruit Surface Local red 0.264 0.021 (0.222,0.306) 0.00
Fruit Surface Moruga red 0.156 0.010 (0.136,0.177) 0.00
Fruit Surface Pimento 0.225 0.014 (0.196,0.253) 0.00
Fruit Surface Red habanero 0.152 0.019 (0.114,0.19) 0.00
Fruit Surface Scorpion 0.216 0.008 (0.2,0.232) 0.00
Fruit Surface Seven pot 0.230 0.013 (0.204,0.256) 0.00
Fruit Surface Yellow habanero 0.183 0.022 (0.14,0.227) 0.00
Caribbean Hot Pepper Image Dataset 767
In general, the agreement was shown to range between slight and moderate.
The results can be seen in Tables 6 and 7. Some attributes are easier to identify;
so we expect to see lower agreement for more subjective attributes such as fruit
surface and the presence of a folded pericarp. Illustrations were provided as
examples during annotation; however, these still showed the lowest agreement.
This is a known problem in data collection tasks that involve human subjects as
human generated annotations can vary significantly [7,14].
Attributes such as surface pebbling are easier to identify, especially for vari-
eties such as the Pimento which rarely exhibits this trait. The results for agree-
ment for pebbling on the Pimento images show the kappa paradox. Gwet’s AC1
gives a better measure of agreement. This is also observed for other attributes.
5 Conclusion
References
1. Adams, H., Umaharan, P., Brathwaite, R., Mohammed, K.: Hot pepper production
manual for Trinidad and Tobago (2011)
2. Barth, R., IJsselmuiden, J., Hemming, J., Van Henten, E.J.: Data synthesis meth-
ods for semantic segmentation in agriculture: a capsicum annuum dataset. Comput.
Electr. Agric. 144 284–296 (2018)
3. Bharath, S.M., Cilas, C., Umaharan, P.: Fruit trait variation in a Caribbean
germplasm collection of aromatic hot peppers (capsicum chinense jacq.).
HortScience 48(5), 531–538 (2013)
4. Bharath, S.M.: Morphological characterisation of a Caribbean germplasm collec-
tion of capsicum chinense jacq. Master’s thesis, The University of the West Indies
(2012)
5. Bosland, P.W., Coon, D., Reeves, G.: Trinidad moruga scorpion pepper is the
world’s hottest measured Chile pepper at more than two million Scoville heat
units. HortTechnology 22(4), 534–538 (2012)
6. CARDI. Genuine caribbean hot pepper seed produced and sold by cardi (2014)
7. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P.: Data collection
and evaluation server, Lawrence Zitnick. Microsoft coco captions (2015)
8. Chollet, F.: Deep Learning with Python. Manning Publications Company (2017)
9. Cicchetti, D.V., Feinstein, A.V.: High agreement but low kappa: Ii. resolving the
paradoxes. J. Clin. Epidemiol. 43(6), 551–558 (1990)
10. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-
scale hierarchical image database. In 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 248–255 (2009)
11. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their
attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1778–1785 (2009)
12. Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems
of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990)
13. Gwt, K.L.: Computing inter-rater reliability and its variance in the presence of
high agreement. British J. Math. Stat. Psychol. 61(1), 29–48 (2008)
14. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking
task: Data, models and evaluation metrics. J. Artif. Int. Res. 47(1), 853–899 (2013)
15. Hou, S., Feng, Y., Wang, Z.: VegFru: A domain-specific dataset for fine-grained
visual categorization. In 2017 IEEE International Conference on Computer Vision
(ICCV), pp. 541–549 (2017)
16. International Plant Genetic Resources Institute IPGRI. Descriptors for Capsicum
(Capsicum Spp.) =: Descriptores Para Capsicum (Capsicum Spp.). IPGRI, Rome
(1995)
17. Krishna, R., et al.: Connecting language and vision using crowdsourced dense image
annotations, Visual genome (2016)
18. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object
classes by between-class attribute transfer. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 951–958 (2009)
19. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D.,
Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp.
740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48
20. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2016)
Caribbean Hot Pepper Image Dataset 769
21. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In
Proceedings of International Conference on Computer Vision (ICCV) (2015)
22. Minervini, M., Fischbach, A., Scharr, H., Tsaftaris, S.A.: Finely-grained anno-
tated datasets for image-based plant phenotyping. Pattern Recogn. Lett. 81, 80–89
(2016)
23. Minervini, M., Scharr, H., Tsaftaris, S.A.: Image analysis: the new bottleneck in
plant phenotyping [applications corner]. IEEE Signal Process. Mag. 32(4), 126–131
(2015)
24. Mureşan, H., Oltean, M.: Fruit recognition from images using deep learning. Acta
Universitatis Sapientiae, Informatica 10(1), 26–42 (2018)
25. Patterson, G., Hays, J.: Sun attribute database: Discovering, annotating, and rec-
ognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pat-
tern Recognition, pp. 2751–2758 (2012)
26. Patterson, G., Hays, J.: COCO Attributes: attributes for people, animals, and
objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9910, pp. 85–100. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46466-4 6
27. Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Kutu-
lakos, K.N. (ed.) ECCV 2010. LNCS, vol. 6553, pp. 1–14. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-35749-7 1
28. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.: Deepfruits: A fruit
detection system using deep neural networks. Sensors (Basel, Switzerland), vol.
16(8) (2016)
29. Sinha, A., Petersen, J.: Caribbean hot pepper production and post harvest manual
(2011)
30. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd
birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute
of Technology (2011)
31. Welinder, P. Ca, et al.:ltech-USCD birds 200. Technical Report CNS-TR-2010-001,
California Institute ,of Technology (2010)
32. Wongpakaran, N., Wongpakaran, T., Wedding, D., Gwet, K.L.: A comparison of
cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients:
a study conducted with personality disorder samples. BMC Med. Rese. Methodol.
13(1), 61 (2013)
33. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions.
Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
34. Zhang, Y., Lian, J., Fan, M., Zheng, Y.: Deep indicator for fine-grained classifica-
tion of banana’s ripening stages. EURASIP J. Image Video Process. 2018(1), 46
(2018)
35. Zhang, Y.-D., Dong, Z., Chen, X., Jia, W., Du, S., Muhammad, K., Wang, S.-
H.: Image based fruit category classification by 13-layer deep convolutional neural
network and data augmentation. Multimedia Tools Appl. 78(3), 3613–3632 (2017).
https://doi.org/10.1007/s11042-017-5243-3
36. Zheng, Y.-Y., Kong, J.-L., Jin, X.-B., Wang, X.-Y., Ting-Li, S., Zuo, M.: Cropdeep:
the crop vision dataset for deep-learning-based classification and detection in pre-
cision agriculture. Sensors 19(5), 1058 (2019)
37. Zitnick, C.L., Parikh, D.:. Bringing semantics into focus using visual abstraction.
In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE
(2013)
A Prediction Model for Student Academic
Performance Using Machine Learning-Based
Analytics
Abstract. The adoption of digitization in the education sector has led to transfor-
mational changes. The academic sector has become more digital, more extensive,
and more comprehensive but more complex as well. The topical advancements
include the rise of technology-driven learning, the use of digital learning plat-
forms, management systems, and technologies by students; the implementation
of artificial intelligence and machine learning approaches for improvising student
learning. In recent times, the solicitation of machine learning into academics has
led to an upsurge in the education sector embroidering the growth of novel are-
nas such as Academic Data Mining (ADM) or Education Data Mining (EDM).
ADM, based on machine learning techniques, helps in the prediction of students’
academic performance and is the subject of concern to many academic institu-
tions for the classification of its students according to their learning capabilities.
Moreover, the enormous amount of data about student academics can be handled,
pre-processed, analyzed, and transformed into meaningful results and interest-
ing patterns. The resulting patterns help in analyzing the academic performance
of students and further lead to the identification of students who require special
counseling. This paper proposes a model that predicts the performance of students
based on academic details that helps in the classification of different learners.
1 Introduction
In recent times Machine Learning (ML) techniques are being used for decision making
in many prominent areas out of which education is of utmost importance. It promotes
academic institutions by predicting the academic performance of their students. Fur-
thermore, it facilitates the instructors to differentiate between good and bad performers
based on their predicted academic performance [1–3].
There are several components of ML, depending upon the type of output, features
or input, the different depiction of data, types of feedback used when learning. The
variation comes in the type of data available, type of features extracted from the data,
type of output needed, and the type of algorithm used to obtain the learning model [4].
Figure 1 represents the components of machine learning.
FEEDBACK
COMPONENTS
DATA
TRAINING
EXPERIENCE
Fig. 2. Strategic flow diagram for assessing student performance analytics for getting the result
algorithms iteratively run on large datasets to evaluate the different patterns in the data
and let the machine respond to the situations for which they have not been explicitly
prepared. To deliver consistent results the machines learn from the historic data [11].
The results generated using learning analytic process have been used both by instruc-
tor and students represented using Fig. 3. The instructors have used intelligent learning
analytics in their teaching for the improvement of the learning experience of students.
The fundamental idea of intelligent learning analytics is to identify the students at risk
and provide them with timely intervention based on the results of student academics.
Early detection of students at-risk will help higher education institutions in the reduction
of dropout rates and increase the retention rate [8, 9].
Data
Student Learning
Collection
Academic Analytics on
from different
Analytics collected data
institutions
The very first step in performance prediction analytics is the availability of students’
academic data. Figure 4 shows that the academic data can be collected from different
sources. Afterward, learning analytics has been applied to the gathered data further
helps in the analysis of students and their academic data. The predictive analytics are
then applied for predicting the academic performance of the students. The prediction
results have been used by both learners and tutors so that appropriate actions can be
774 H. Kaur and T. Kaur
taken to improvise the performance of the weak students. The last step is the resulting
analytics which helps in the validation of generated results.
Predicted results
Positive Negative
Actual data Positive results True positive (TP) False negative (FN)
Negative results False positive (FP) True negative (TN)
communicated to their teachers. The teachers can then provide their valuable suggestions
and inputs so as to improvise such student’s academic performances. Moreover, for the
institutions, the identification of slow learners at the initial stage benefits them in increas-
ing their retention rate by taking corrective actions on time. The institutions can leverage
the capabilities of the proposed analytical model to bridge the gap between the student
learning capabilities, their behavior and teaching potential of the instructors. Overall,
the proposed model is beneficial to the institutions, instructors as well as the students.
The proposed model can be augmented with additional course recommendation abilities
that will help the students to select appropriates courses as per his/her performance.
References
1. Enughwure, A.A., Ogbise, M.E.: Application of machine learning methods to predict student
performance: a systematic literature review. Int. Res. J. Eng. Technol. 7(05), 3405–3415
(2020)
2. Albreiki, B., Zaki, N., Alashwal, H.: A systematic literature review of students’ performance
prediction using machine learning techniques. Educ. Sci. 11(9), 552 (2021)
3. Bhutto, E.S., Siddiqui, I.F., Arain, Q.A., Anwar, M.: Predicting students’ academic per-
formance through supervised machine learning. In: 2020 International Conference on
Information Science and Communication Technology (ICISCT), pp. 1–6. IEEE, February
2020
4. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
5. Lemay, D.J., Baek, C., Doleck, T.: Comparison of learning analytics and educational data
mining: a topic modeling approach. Comput. Educ. Artif. Intell. 2, 100016 (2021)
6. Namoun, A., Alshanqiti, A.: Predicting student performance using data mining and learning
analytics techniques: a systematic literature review. Appl. Sci. 11(1), 237 (2020)
7. Guo, B., Zhang, R., Xu, G., Shi, C., Yang, L.: Predicting students performance in educational
data mining. In: 2015 International Symposium on Educational Technology (ISET), pp. 125–
128. IEEE, July 2015
8. Akçapınar, G., Altun, A., Aşkar, P.: Using learning analytics to develop early-warning system
for at-risk students. Int. J. Educ. Technol. High. Educ. 16(1), 1–20 (2019). https://doi.org/10.
1186/s41239-019-0172-z
9. Miguéis, V.L., Freitas, A., Garcia, P.J., Silva, A.: Early segmentation of students according
to their academic performance: a predictive modelling approach. Decis. Support Syst. 115,
36–51 (2018)
10. Aldowah, H., Al-Samarraie, H., Fauzy, W.M.: Educational data mining and learning analytics
for 21st century higher education: a review and synthesis. Telematics Inform. 37, 13–49 (2019)
11. Chuan, Y.Y., Husain, W., Shahiri, A.M.: An exploratory study on students’ performance
classification using hybrid of decision tree and Naïve Bayes approaches. In: Akagi, M.,
Nguyen, T.T., Vu, D.T., Phung, T.N., Huynh, V.N. (eds.) ICTA 2016. AISC, vol. 538, pp. 142–
152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49073-1_17
12. Al Breiki, B., Zaki, N., Mohamed, E.A.: Using educational data mining techniques to pre-
dict student performance. In 2019 International Conference on Electrical and Computing
Technologies and Applications (ICECTA), pp. 1–5. IEEE, November 2019
13. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A., Nielsen, H.: Assessing the accuracy of
prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)
14. Papamitsiou, Z., Economides, A.A.: Learning analytics and educational data mining in prac-
tice: a systematic literature review of empirical evidence. J. Educ. Technol. Soc. 17(4), 49–64
(2014)
15. Mueen, A., Zafar, B., Manzoor, U.: Modeling and predicting students’ academic performance
using data mining techniques. Int. J Mod. Educ. Comput. Sci 8(11), 36–42 (2016)
Parameterized-NL Completeness
of Combinatorial Problems by Short
Logarithmic-Space Reductions
and Immediate Consequences
of the Linear Space Hypothesis
Tomoyuki Yamakami(B)
The study on the minimal memory space has attracted significant attention in
real-life circumstances at which we need to manage vast data sets for most net-
work users who operate memory-limited computing devices. It is therefore useful
in general to concentrate on the study of space-bounded computability within
reasonable execution time. In the past literature, special attention has been paid
to polynomial-time algorithms using logarithmic memory space and two corre-
sponding space-bounded complexity classes: L (deterministic logarithmic space)
and NL (nondeterministic logarithmic space).
In association with L and NL, various combinatorial problems have been
discussed by, for instance, Jones, Lien, and Laaser [8], Cook and McKenzie [4],
Àlvarez and Greenlaw [1], and Jenner [7]. Many graph properties, in particular,
can be algorithmically checked using small memory space. Using only logarithmic
space1 (cf. [1,10]), for instance, we can easily solve the problems of determining
whether or not a given graph is a bipartite graph, a computability graph, a
chordal graph, an interval graph, and a split graph. On the contrary, the directed
s-t connectivity problem (DSTCON) and the 2CNF Boolean formula satisfiability
problem (2SAT) are known to be NL-complete [8] (together with the result of
[6,12]) and seem to be unsolvable using logarithmic space. To understand the
nature of NL better, it is greatly beneficial to study more interesting problems
that fall into the category of NL-complete problems.
problems (L, m) form the complexity class PsubLIN [14] (see Sect. 2.2 for more
details). It is natural to ask if all NL problems parameterized by log-space size
parameters (or briefly, parameterized-NL problems) are solvable in polynomial
time using sub-linear space. To tackle this important question, we zero in on the
most difficult (or “complete”) parameterized-NL problems. As a typical exam-
ple, let us consider the 3-bounded 2SAT, denoted 2SAT3 , for which every variable
in a 2CNF Boolean formula φ appears at most 3 times in the form of literals,
parameterized by mvbl (φ) (succinctly, (2SAT3 , mvbl )). It was proven in [13] that
(2SAT3 , mvbl ) is complete for the class of all parameterized-NL problems.
Lately, a practical working hypothesis, known as the linear space hypoth-
esis [14], was proposed in connection to the computational hardness of
parameterized-NL problems. This linear space hypothesis (LSH) asserts that
(2SAT3 , mvbl ) cannot be solved by any polynomial-time algorithm using sub-
linear space. From the NL-completeness of 2SAT3 , LSH immediately derives
long-awaited complexity-class separations, including L = NL and LOGDCFL =
LOGCFL, where LOGDCFL and LOGCFL are respectively the log-space many-
one closure of DCFL (deterministic context-free language class) and CFL
(context-free language class) [14]. Moreover, under the assumption of LSH, it
follows that 2-way non-deterministic finite automata are simulated by “narrow”
alternating finite automata [15].
Notice that the completeness notion requires “reductions” between two prob-
lems. The standard NL-completeness notion uses logarithmic-space (or log-
space) reductions. Those standard reductions, however, seem to be too pow-
erful to use in many real-life circumstances. Furthermore, PsubLIN is not even
known to be close under the standard log-space reductions. Therefore, much
weaker reductions may be more suitable to discuss the computational hardness
of various real-life problems. A weaker notion, called “short” log-space reduc-
tions, was in fact invented and studied intensively in [13,14]. The importance of
such reductions is exemplified by the fact that PsubLIN is indeed closed under
short log-space reductions.
The key question of whether LSH is true may hinge at the intensive study of
parameterized-NL “complete” problems. It is however unfortunate that a very
few parameterized decision problems have been proven to be equivalent in com-
putational complexity to (2SAT3 , mvbl ) by short log-space reductions, and this
fact drives us to seek out new parameterized decision problems in this exposi-
tion in hope that we will eventually come to the point of proving the validity
of LSH. This exposition is therefore devoted to proposing a new set of problems
and proving their equivalence to (2SAT3 , mvbl ) by appropriate short log-space
reductions.
To replenish the existing short list of parameterized-NL complete problems,
after reviewing fundamental notions and notation in Sect. 2, we will propose
three new decision problems in NL, which are obtained by placing “natural”
Parameterized-NL Completeness of Combinatorial Problems 779
We briefly describe basic notions and notation necessary to read through the
rest of this exposition.
We denote by N the set of all natural numbers including 0, and denote by Z the
set of all integers. Let N+ = N − {0}. Given two numbers m, n ∈ Z with m ≤ n,
the notation [m, n]Z expresses the integer interval {m, m + 1, . . . , n}. We further
abbreviate [1, n]Z as [n] whenever n ≥ 1. All polynomials are assumed to take
non-negative coefficients and all logarithms are taken to the base 2. The informal
notion polylog(n) refers to an arbitrary polynomial in log n. Given a (column)
vector v = (a1 , a2 , . . . , ak )T (where “T ” is a transpose operator) and a number
i ∈ [k], the notation v(i) indicates the ith entry ai of v. For two (column) vectors
u and v of dimension n, the notation u ≥ v means that the inequality u(i) ≥ v(i)
holds for every index i ∈ [n]. A k-set refers to a set that consists of exactly k
distinct elements.
An alphabet is a finite nonempty set of “symbols” or “letters”. Given an
alphabet Σ, a string over Σ is a finite sequence of symbols in Σ. The length of
a string x, denoted |x|, is the total number of symbols in x. The notation Σ ∗
denotes the set of all strings over Σ. A language over Σ is a subset of Σ ∗ .
In this exposition, we will consider directed and undirected graphs and each
graph is expressed as (V, E) with a set V of vertices and a set E of edges. An edge
between two vertices u and v in a directed graph is denoted by (u, v), whereas
the same edge in an undirected graph is denoted by {u, v}. Two vertices are
called adjacent if there is an edge between them. When there is a path from
vertex u to vertex v, we succinctly write u v. An edge of G is said to be a
grip if its both endpoints have degree at most 2. Given a graph G = (V, E), we
set mver (G) = |V | and medg (G) = |E|. The following property is immediate.
Lemma 1. For any connected graph G whose degree is at most k, it follows that
mver (G) ≤ 2medg (G) and medg (G) ≤ kmver (G)/2.
780 T. Yamakami
log-space size parameters are equally difficult to solve in polynomial time using
sub-linear space. This is because PsubLIN is not yet known to be closed under
standard L-m-reductions. Fortunately, PsubLIN is proven to be closed under
slightly weaker reductions, called “short” reductions [13,14].
The short L-m-reducibility between two parameterized decision problems
(P1 , m1 ) and (P2 , m2 ) is given as follows: (P1 , m1 ) is short L-m-reducible to
(P2 , m2 ), denoted by (P1 , m1 ) ≤sL m (P2 , m2 ), if there is a polynomial-time,
logarithmic-space computable function f (which is called a reduction function)
and two constants k1 , k2 > 0 such that, for any input string x, (i) x ∈ P1 iff
f (x) ∈ P2 and (ii) m2 (f (x)) ≤ k1 m1 (x) + k2 . Instead of using f , if we use a
polynomial-time logarithmic-space oracle Turing machine M to reduce (P1 , m1 )
to (P2 , m2 ) with the extra requirement of m2 (z) ≤ k1 m1 (x) + k2 for any query
word z made by M on input x for oracle P2 , then (P1 , m1 ) is said to be short
L-T-reducible to (P2 , m2 ), denoted by (P1 , m1 ) ≤sL
T (P2 , m2 ).
For any reduction ≤ in {≤sL m , ≤sL
T }, we say that two parameterized deci-
sion problems (P1 , m1 ) and (P2 , m2 ) are inter-reducible (to each other) by ≤-
reductions if both (P1 , m1 ) ≤ (P2 , m2 ) and (P2 , m2 ) ≤ (P1 , m1 ) hold; in this
case, we briefly write (P1 , m1 ) ≡ (P2 , m2 ).
Lemma 2. [14] Let (L1 , m1 ) and (L2 , m2 ) be two arbitrary parameterized deci-
sion problems. (1) If (L1 , m1 ) ≤sLm (L2 , m2 ), then (L1 , m1 ) ≤T (L2 , m2 ). (2) If
sL
(L1 , m1 ) ≤sL
T (L2 , m2 ) and (L2 , m2 ) ∈ PsubLIN, then (L 1 , m1 ) ∈ PsubLIN.
One of the first problems that were proven to be NP-complete in the past lit-
erature is the 3CNF Boolean formula satisfiability problem (3SAT), which asks
whether or not a given 3CNF Boolean formula φ is satisfiable [3]. In sharp
comparison, its natural variant, called the 2CNF Boolean formula satisfiability
problem (2SAT), was proven to be NL-complete [8] (together with the results
of [6,12]). Let us further consider its natural restriction introduced in [14]. Let
k ≥ 2.
k-Bounded 2CNF Boolean Formula Satisfiability Problem
(2SATk ):
◦ Instance: a 2CNF Boolean formula φ whose variables occur at most k times
each in the form of literals.
◦ Question: is φ satisfiable?
As natural log-space size parameters for the decision problem 2SATk , we use
the aforementioned size parameters mvbl (φ) and mcls (φ).
Unfortunately, not all NL-complete problems are proven to be inter-reducible
to one another by short log-space reductions. An example of NL-complete prob-
lems that are known to be inter-reducible to (2SAT3 , mvbl ) is a variant of the
directed s-t-connectivity problem whose instance graphs have only vertices of
degree at most k (kDSTCON) for any number k ≥ 3.
782 T. Yamakami
Definition 1. The linear space hypothesis (LSH) asserts, as noted in Sect. 1.2,
the insolvability of the specific parameterized decision problem (2SAT3 , mvbl )
within polynomial time using only sub-linear space.
In other words, LSH asserts that (2SAT3 , mvbl ) ∈/ PsubLIN. Note that, since
PsubLIN is closed under short L-T-reductions by Lemma 2(2), if a parameterized
decision problem (A, m) satisfies (A, m) ≡sLT (2SAT3 , mvbl ), we can freely replace
(2SAT3 , mvbl ) in the definition of LSH by (A, m). The use of short L-T-reduction
here can be relaxed to a much weaker notion of short SLRF-T-reduction [13,14].
Lemma 4. [14] The following parameterized decision problems are all inter-
reducible to one another by short L-m-reductions: (LP2,k , mrow ), (LP2,k , mcol ),
and (2SAT3 , mvbl ) for every index k ≥ 3.
Proof. The reduction (LP2,k , mcol ) ≤sL m (2LP2,k , mcol ) is easy to verify by set-
ting b2 = b and b1 = (bi )i with bi = max{|aij1 | + |aij2 | : j1 , j2 ∈ [m], j1 < j2 } for
any instance pair A = (aij )ij and b = (bj )j given to LP2,k . Since the description
size of (A, b1 , b2 ) is proportional to that of (A, b), the reduction is indeed “short”.
To verify the opposite reducibility (2LP2,k , mcol ) ≤sL m (LP2,k+2 , mcol ), it
suffices to prove that (2LP2,k , mcol ) ≤sL m (LP 2,k , mcol ) since (LP2,l , mcol ) ≡sL
m
(LP2,3 , mcol ) for any l ≥ 3 by Lemma 4. Take an arbitrary instance (A, b, b )
given to 2LP2,k and assume that A = (aij )ij is an m × n matrix and b = (bi )i
and b = (bi )i are two (column) vectors of dimension m. We wish to reduce
(A, b, b ) to an appropriate instance (D, c) for LP2,k , where D = (dij )ij is a
4m × 2n matrix and c = (ci )i is a 4m-dimensional vector. For all index pairs
i ∈ [m] and j ∈ [n], let dij = aij , dm+i,n+j = −aij , and di,n+j = dm+i,j = 0.
For all indices i ∈ [m], let ci = bi and cm+i = −bi . Moreover, for any pair
(i, j) ∈ [m] × [n], we set d2m+i,j = 1, d2m+i,n+j = −1, and c2m+i = 0. In addi-
tion, we set d3m+i,j = −1, d3m+i,n+j = 1, and c3m+i = 0. Notice that each
column of N has at most k + 2 nonzero entries and each row of D has at most
2 nonzero entries.
Let x = (xj )j denote a {0, 1}-vector of dimension n for A and let y = (yj )j
denote a {0, 1}-vector of dimension 2n for D satisfying n yj = xj and yn+j = xj
for any j ∈ [n]. It then follows that the inequality j=1 dij yj ≥ ci is equiva-
n n
lent to aij xj ≥ bj . Furthermore, j=1 dn+i,j y ≥ cn+i is equivalent
n j=1 n n+j
to − j=1 aij xj ≥ −bj , which is the same as
j=1 ij xj ≤ bi . Therefore,
a
we conclude that b ≥ Ax ≥ b iff Dy ≥ c. In other words, it follows that
(A, b, b ) ∈ 2LP2,k , iff (D, c) ∈ LP2,k .
in Lemma 5, LE2,k falls into L, and thus this fact signifies how narrow the gap
between NL and L is. For the proof of the lemma, we define the exclusive-or
clause (or the ⊕-clause) of two literals x and y to be the formula x ⊕ y. The
problem ⊕2SAT asks whether, for a given collection C of ⊕-clauses, there exists
a truth assignment σ that forces all ⊕-clauses in C to be true. It is known that
⊕2SAT is in L [8].
Proof. Consider any instance (A, b) given to LE2,k . Since ⊕2SAT ∈ L, it suffices
to reduce LE2,k to ⊕2SAT by standard L-m-reductions. Note that the equation
Ax = b is equivalent to aij1 xj1 + aij2 xj2 = bi for all i ∈ [m], where aij1 and aij2
are nonzero entries of A with j1 , j2 ∈ [n]. Fix an index i ∈ [m] and consider the
first case where j1 = j2 . In this case, we translate aij1 xj1 = bi into a ⊕-clause
vj1 ⊕ 0 if xij1 = 1, and vj1 ⊕ 1 otherwise. In the other case of j1 = j2 , on the
contrary, we translate aij1 xj1 +aij2 xj2 = bi into two ⊕-clauses {xj1 ⊕0, xj2 ⊕1} if
(xj1 , xj2 ) = (1, 0), and the other values of (xj1 , xj2 ) are similarly treated. Finally,
we define C to be the collection of all ⊕-clauses obtained by the aforementioned
translations. It then follows that Ax = b iff C is satisfiable. This implies that
(A, b) ∈ LE2,k iff C ∈ ⊕2SAT.
The vertex cover problem (VC) is a typical NP-complete problem, which has
been used as a basis of the completeness proofs of many other NP problems,
including the clique problem and the independent set problem (see, e.g., [5,9]).
For a given undirected graph G = (V, E), a vertex cover for G is a subset V
of V such that, for each edge {u, v} ∈ E, at least one of the endpoints u and v
belongs to V .
The problem VC remains NP-complete even if its instances are limited to
planar graphs. Similarly, the vertex cover problem restricted to graphs of degree
at least 3 is also NP-complete; however, the same problem falls into L if we
require graphs to have degree at most 2. We wish to seek out a reasonable
setting situated between those two special cases. For this purpose, we intend
to partition all edges into two categories: grips and non-grips (where “grips”
are defined in Sect. 2.1). Since grips have a simpler structure than non-grips,
the grips need to be treated slight differently from the others. In particular, we
request an additional condition, called 2-checkeredness, which is described as
follows. A subset V of V is called 2-checkered exactly when, for any edge e ∈ E,
if both endpoints of e are in V , then e must be a grip. The 2-checkered vertex
cover problem is introduced in the following way.
2-Checkered Vertex Cover Problem (2CVC):
◦ Instance: an undirected graph G = (V, E).
◦ Question: is there a 2-checkered vertex cover for G?
Parameterized-NL Completeness of Combinatorial Problems 785
Associated with the decision problem 2CVC, we set up the log-space size
parameters: mver (G) and medg (G), which respectively express the total number
of the vertices of G and that of the edges of G.
Given an instance of graph G = (V, E) to 2CVC, if we further demand that
every vertex in V should have degree at most k for any fixed constant k ≥ 3,
then we denote by 2CVCk (Degree-k 2CVC) the problem obtained from 2CVC.
There exists a close connection between the parameterizations of 2CVC3 and
2SAT3 .
Theorem 1. (2CVC3 , mver ) ≡sL
m (2CVC3 , medg ) ≡m (2SAT3 , mvbl ).
sL
Proof. Firstly, it is not difficult to show that (2CVC3 , mver ) ≡sL m (2CVC3 , medg )
by Lemma 1.
Next, we intend to prove that (2SAT3 , mvbl ) ≤sL m (2CVC3 , mver ). Let φ be
any instance to 2SAT3 made up of a set U = {u1 , u2 , . . . , un } of variables and
a set C = {c1 , c2 , . . . , cm } of 2CNF Boolean clauses. For convenience, we write
U for the set {u1 , u2 , . . . , un } of negated variables and define Û = U ∪ U . In the
case where a clause contains any removable literal x, it is possible to delete all
clauses that contain x, because we can freely assign T (true) to x. Without loss of
generality, we assume that there is no removable literal in φ. We further assume
that φ is an exact 2CNF formula in a clean shape (explained in Sect. 2.1). Since
every clause has exactly two literals, each clause cj is assumed to have the form
cj [1] ∨ cj [2] for any index j ∈ [m], where cj [1] and cj [2] are treated as “labels”
that represent two literals in the clause cj .
Let us construct an undirected graph G = (V, E) as follows. We define
(1) (2) (1) (2)
V = {ui , ui , cj [1], cj [2] | i ∈ [n], j ∈ [m]} and we set Ũ to be {ui , ui |
(1) (2)
i ∈ [n]} by writing ui for ui and ui for ui . We further set E as the union
(1) (2)
of {{ui , ui }, {cj [1], cj [2]} | i ∈ [n], j ∈ [m]} and {{z, cj [l]} | z ∈ Ũ , l ∈
[2], and cj [l] represents z}. Since each clause contains exactly two literals, it fol-
lows that deg(cj [1]) = deg(cj [2]) = 2. Thus, the edge {cj [1], cj [2]} for each index
j is a grip. Moreover, since each variable ui appears at most 3 times in the
(1) (2)
form of literals (because of the condition of 2SAT3 ), deg(ui ) + deg(ui ) ≤ 5.
(1) (2)
Since no removable literal exists in φ, we obtain max{deg(ui ), deg(ui )} ≤ 3.
It follows by the definition that mver (G) = 2(|U | + |C|) ≤ 8|U | = 8mvbl (φ) since
|C| ≤ 3|U |.
Here, we want to verify that φ ∈ 2SAT3 iff G ∈ 2CVC3 . Assume that φ ∈
2SAT3 . Let σ : U → {T, F } be any truth assignment that makes φ satisfiable.
We naturally extend σ to a map from Û to {T, F } by setting σ(ū) to be the
opposite of σ(u). Its corresponding vertex cover Cσ is defined in two steps.
Initially, Cσ contains all elements z ∈ Û satisfying σ(z) = F . For each index
j ∈ [m], let Aj = {i ∈ [2] | ∃z ∈ Û [cj [i] represents z and σ(z) = T ]}. Notice that
Aj ⊆ {1, 2}. If Aj = {i} for a certain index i ∈ [2], then we append to Cσ the
vertex cj [i]; however, if Aj = {1, 2}, then we append to Cσ the two vertices cj [1]
and cj [2] instead.
To illustrate our construction, let us consider a simple example: φ ≡ c1 ∧
c2 ∧ c3 ∧ c4 with c1 ≡ u1 ∨ u2 , c2 ≡ u2 ∨ u1 , c3 ≡ u1 ∨ u3 , and c4 ≡ u2 ∨ u3 .
786 T. Yamakami
The corresponding graph G is drawn in Fig. 1. Take the truth assignment σ that
satisfies σ(u1 ) = σ(u2 ) = σ(x3 ) = T . We then obtain A1 = {1}, A2 = {1, 2},
and A3 = {2}. Therefore, the resulting 2-checkered vertex cover Cσ is the set
(2) (2) (2)
{u1 , u2 , u3 , c1 [1], c2 [1], c2 [2], c3 [2], c4 [1]}.
By the definition of Cσ ’s, we conclude that G belongs to 2CVC3 . Conversely,
we assume that φ ∈ / 2SAT3 . Consider any truth assignment σ for φ and con-
struct Cσ as before. By the construction of Cσ , if Cσ is a 2-checkered vertex
cover, then σ should force φ to be true, a contradiction. Hence, G ∈ / 2CVC3
follows. Overall, it follows that φ ∈ 2SAT3 iff G ∈ 2CVC3 . Therefore, we obtain
(2SAT3 , mvbl ) ≤sL
m (2CVC3 , mver ).
Conversely, we need to prove that (2CVC3 , mver ) ≤sL m (2SAT3 , mvbl ). Given
an undirected graph G = (V, E), we want to define a 2CNF Boolean formula
φ to which G reduces. Let V = {v1 , v2 , . . . , vn } and E = {e1 , e2 , . . . , em } for
certain numbers m, n ∈ N+ .
Hereafter, we use the following abbreviation: u → v for u ∨ v, u ↔ v for
(u → v) ∧ (v → u), and u ↔ v for (u ∨ v) ∧ (u ∨ v). Notice that, as the notation
↔ itself suggests, u ↔ v is logically equivalent to the negation of u ↔ v.
We first define a set U of variables to be V . For each edge e = {u, v} ∈ E, we
define Ce as follows. If one of u and v has degree more than 2, then we set Ce
to be u ↔ v; otherwise, we set Ce to be u ∨ v. Finally, we define C to be the set
{Ce | e ∈ E}. Let φ denote the 2CNF Boolean formula made up of all clauses in
C.
Next, we intend to verify that G has a 2-checkered vertex cover iff φ is
satisfiable. Assume that G has a 2-checkered vertex cover, say, V . Consider C
obtained from G. We define a truth assignment σ by setting σ(v) = T iff v ∈ V .
Take any edge e = {u, v}. If one of u and v has degree more than 2, then either
(u ∈ V and v ∈/ V ) or (u ∈ / V and v ∈ V ) hold, and thus σ forces u ↔ v to
be true. Otherwise, since either u ∈ V or v ∈ V , σ forces u ∨ v to be true. This
concludes that φ is satisfiable. On the contrary, we assume that φ is satisfiable
by a certain truth assignment, say, σ; that is, for any edge e ∈ E, σ forces Ce to
be true. We define a subset V of V as V = {v ∈ V | σ(v) = T }. Let e = {u, v}
be any edge. If Ce has the form u ∨ v for u, v ∈ V , then either u or v should
Parameterized-NL Completeness of Combinatorial Problems 787
Proof. Assume that LSH is true. If (2CVC, mver ) is solvable in polynomial time
using O(mver (x)1−ε ) space for a certain constant ε ∈ (0, 1), since 2CVC3 is a
“natural” subproblem of 2CVC, Theorem 1 implies the existence of a polynomial-
time algorithm that solves (2SAT3 , mvbl ) using O(mrow (x)1−ε ) space as well.
This implies that LSH is false, a contradiction.
The exact cover by 3-sets problem (3XC) was shown to be NP-complete [9].
Fixing a universe X, let us choose a collection C of subsets of X. We say that
C is a set cover for X if every element in X is contained in a certain set in C.
Furthermore, given a subset R ⊆ X, C is said to be an exact cover for X exempt
from R if (i) every element in X − R is contained in a unique member of C and
(ii) every element in R appears in at most one member of C. When R = ∅, we
say that C is an exact cover for X. Notice that any exact cover with exemption
is a special case of a set cover.
To obtain a decision problem in NL, we need one more restriction. Given a
collection C ⊆ P(X), we introduce a measure, called “overlapping cost,” of an
element of any set in C as follows. For any element u ∈ X, the overlapping cost
of u with respect to (w.r.t.) C is the cardinality |{A ∈ C | u ∈ A}|. With the
use of this special measure, we define the notion of k-overlappingness for any
k ≥ 2 as follows. We say that C is k-overlapping if the overlapping cost of every
element u in X w.r.t. C is at most k.
2-Overlapping Exact Cover by k-Sets with Exemption Problem
(kXCE2 ):
◦ Instance: a finite set X, a subset R of X, and a 2-overlapping collection C
of subsets of X such that each set in C has at most k elements.
◦ Question: does C contain an exact cover for X exempt from R?
The use of an exemption set R in the above definition is crucial. If we are
given a 2-overlapping family C of subsets of X as an input and then ask for the
788 T. Yamakami
existence of an exact cover for X, then the corresponding problem is rather easy
to solve in log space [1].
The size parameter mset for kXCE2 satisfies mset ( X, R, C
) = |C|, provided
that all elements of X are expressed in O(log |X|) binary symbols. Obviously,
mset is a log-space size parameter. In what follows, we consider 3XCE2 parame-
terized by mset , (3XCE2 , mset ), and prove its inter-reducibility to (2SAT3 , mvbl ).
Theorem 2. (3XCE2 , mset ) ≡sL
m (2SAT3 , mvbl ).
σ(x3 ) = F . The set R consists of the elements of the form xi [j] and x − i[j]
for all indices i ∈ [3] and j ∈ [4]. The exact cover Xσ for X exempt from R
consists of {x1 [1], s1 }, {x1 [2], s2 }, {x2 [3], s3 }, {x3 [4], s4 }, {x1 [2], t1 [1], t1 [2]}, and
{x3 [2], t3 [1], t3 [2]}.
Returning to the proof, let us define two groups of sets. For each index
j ∈ [m], Aj is composed of the following 2-sets: Aj = {{zi1 [j], sj }, {zi2 [j], sj } |
i1 , i2 ∈ [n], Cj = {zi1 , zi2 } ⊆ V̂ }. Associated with V , we set V (+) to be composed
of all variables xi such that xi appears in two clauses and xi appears in one clause.
Similarly, let V (−) be composed of all variables xi such that xi appears in one
clause and xi appears in two clauses. In addition, let V (∗) consist of all other
variables. Note that, since there is no removable literal in φ, any variable xi in
V (∗) appears in one clause and its negation xi appears also in one clause. Our
assumption guarantees that V = V (+) ∪V (−) ∪V (∗) . For each variable xi ∈ V (+) ,
(+)
we set Bi = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}, {xi [j3 ], ti [1], ti [2]}}, provided that
Cj1 and Cj2 both contain xi and Cj3 contains xi for certain indices j1 , j2 ,
(−)
and j3 with j1 < j2 . Similarly, for each variable xi ∈ V (−) , we set Bi =
{{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}, {xi [j3 ], ti [1], ti [2]}}, provided that Cj1 and Cj2 both
contain xi and Cj3 contains xi . In contrast, given any variable xi ∈ V (∗) , we
(∗)
define Bi = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [1]}}. Finally, we set D = ( j∈[m] Aj ) ∪
(+) (−) (∗)
( i∈[n] (Bi ∪Bi ∪Bi )). Notice that every element in X is covered by exactly
two sets in D. To complete our construction, an exemption set R is defined to
be X1 .
Hereafter, we intend to verify that φ is satisfiable iff there exists an exact
cover for X exempt from R. Given a truth assignment σ : V → {T, F }, we
define a set Xσ as follows. We first define X1 to be the set {z[j], sj | j ∈
[m], σ(z) = T, z ∈ V̂ }. For each element xi ∈ V (+) , if σ(xi ) = F , then we
(+)
set X2,i = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}} ⊆ Bi , and if σ(xi ) = F , then we
(+)
set X2,i = {{xi [j3 ], ti [1], ti [2]}} ⊆ Bi . Similarly, for each element xi ∈ V (−) ,
we define X2,i . For any element xi ∈ V (∗) , however, if σ(z) = F for a literal
= {{z[j1 ], ti [1]}} ⊆ Bi . In the end, Xσ is set
(∗)
z ∈ {xi , xi }, then we define X2,i
to be the union X1 ∪ ( i∈[n] X2,i ). Assume that φ is true by σ. Since all clauses
Cj are true by σ, each sj in X2 has overlapping cost of 1 in Xσ . Moreover,
each ti [j] in X3 has overlapping cost of 1 in Xσ . Either xi [j] or xi [j] in X1 has
overlapping cost of at most 1. Thus, Xσ is an exact cover for X exempt from R
(= X1 ).
On the contrary, we assume that X is an exact cover for X exempt
from R. We define a truth assignment σ as follows. For each 2-set Aj , if
{zid [j], sj } ∈ X for a certain index d ∈ [2], then we set σ(zid ) = T . For
each Bi , if {xi [j1 ], ti [1]}, {xi [j2 ], ti [2]} ∈ X , then we set σ(xi ) = F . If
(+)
{xi [j3 ], ti [1], ti [2]} ∈ X , then we set σ(xi ) = F . The case of Bi is similarly
(−)
we set σ(z) = F . Since X is an exact cover for X − R, for any clause Cj , there
exists exactly one z in Cj satisfying σ(z) = T .
790 T. Yamakami
We call (v, v) a trivial pair and we first include all trivial pairs to M . We
then eliminate the trivial perfect matching M = {(u, u) | u ∈ X} from our
consideration by introducing the following restriction. Given any subset M ⊆
M and two elements x, y ∈ X, we say that x is linked to y in M if there
exists a series z1 , z2 , . . . , zt ∈ X with a certain odd number t ≥ 1 such that
(x, z1 ), (zt , y) ∈ M and (zi , zi+1 ) ∈ M for any index i ∈ [t − 1]. For any
subset R of X, we say that R is uniquely connected to X − R in M if, for
any element v ∈ R, there exist two unique elements u1 , u2 ∈ X − R such that
(v, u1 ), (u2 , v) ∈ M .
As the desired variant of 2DM, we introduce the following decision problem
and study its computational complexity.
Almost All Pairs 2-Dimensional Matching Problem with Trivial
Pairs (AP2DM):
◦ Instance: a finite set X, a subset R of X, and a subset M ⊆ X ×X including
all trivial pairs such that R is uniquely connected to X − R in M .
◦ Question: is it true that, for any distinct pair v, w ∈ X, if either v ∈
/ R or
w∈ / R, then there exists a perfect matching Mvw in M for which v is linked
to w in Mvw ?
Proof. For ease of describing the proof, we use (3DSTCON, mver ) instead of
(2SAT3 , mvbl ) because (2SAT3 , mvbl ) ≡sL T (3DSTCON, mver ) by Lemma 3(3).
As the first step, we wish to verify that (3DSTCON, mver ) ≤sL m
(AP2DM4 , mset ) although this is a stronger statement than what is actually
needed for our claim (since ≤sL m implies ≤T ). Let (G, s, t) be any instance given
sL
to 3DSTCON with G = (V, E). Notice that G has degree at most 3. To simplify
our argument, we slightly modify G so that G has no vertex whose indegree is 3
or outdegree is 3. For convenience, we further assume that s and t are of degree
1. Notationally, we write V (−) for V − {s, t} and assume that V (−) is of the form
{v1 , v2 , . . . , vn } with |V (−) | = n.
Let us construct a target instance (X, R, M ) to which we can reduce (G, s, t)
by an appropriately chosen short L-m-reduction. For any index i ∈ {0, 1, 2}, we
prepare a new element of the form [v, i] for each v ∈ V and define Xi to be
{[v, i] | v ∈ V (−) }. The desired universe X is set to be {s, t} ∪ X0 ∪ X1 ∪ X2 .
792 T. Yamakami
[v1,1] [v2,1] [v3,1] [v4,1] s t [v1,0] [v2,0] [v3,0] [v4,0] [v1,2] [v2,2] [v3,2] [v4,2]
[v1,1] [v2,1] [v3,1] [v4,1] s t [v1,0] [v2,0] [v3,0] [v4,0] [v1,2] [v2,2] [v3,2] [v4,2]
Fig. 3. The subset M of X × X with X = {s, t, vi [j] | i ∈ [4], j ∈ [0, 2]Z } (seen here
as a bipartite graph) constructed from G = (V, E) with V = {v1 , v2 , v3 , v4 , s, t} and
E = {(s, v2 ), (v3 , v2 ), (v2 , v4 ), (v4 , v3 ), (v3 , t)}. Every pair of two adjacent vertices forms
a single element in X. The edges expressing trivial pairs are all omitted for simplicity.
Every vertex has degree at most 4 (including one omitted edge).
As subsets of X × X, we define the following seven sets: M0 = {([v, 0], [w, 0]) |
u, w ∈ V (−) , (v, w) ∈ E}, M1 = {([vi , 1], [vi+1 , 1]), ([vi+1 , 1], [vi , 1]) | i ∈
[n − 1]}, M2 = {([vi+1 , 2], [vi , 2]), ([vi , 2], [vi+1 , 2]) | i ∈ [n − 1]}, M3 =
{([v, 2], [v, 0]), ([v, 0], [v, 1]) | v ∈ V (−) }, M4 = {([v1 , 1], s), ([vn , 1], s), (t, [v1 , 2]),
(t, [vn , 2])}, M5 = {(s, [u, 0]), ([v, 0], t) | (s, u), (v, t) ∈ E}, and M6 = {(ũ, ũ) | ũ ∈
6
X}. Finally, M is defined to be the union i=0 Mi and R is set to be {[v, 0] | v ∈
V (−) }. Note that R is uniquely connected to X − R because of M3 .
To illustrate the aforementioned construction, let us consider a sim-
ple example of G = (V, E) with V = {v1 , v2 , v3 , v4 , s, t} and E =
{(s, v2 ), (v3 , v2 ), (v2 , v4 ), (v4 , v3 ), (v3 , t)}. The universe X is the set {s, t, vi [j] |
i ∈ [4], j ∈ [3]}. The constructed M from G is illustrated in Fig. 3.
In what follows, we claim that there is a simple path from s to t in
G iff, for any two distinct elements ũ, ṽ ∈ X, there is a perfect matching,
say, Mũṽ for which ũ is linked to ṽ. To verify this claim, we first assume
that there is a simple path γst = (w1 , w2 , . . . , wk ) in G with w1 = s and
wk = t. Let T = {([v, 0], [w, 0]) | v, w ∈ γst − {s, t}, (v, w) ∈ E} and
S = {(s, [w2 , 0]), ([wk−1 , 0], t)}. We remark that s and t are linked to each other
in M because there exists a path s t in G. Hereafter, ũ and ṽ denote two
arbitrary distinct elements in X with either ũ ∈ / R or ṽ ∈/ R.
(1) Let us consider the case where ũ, ṽ ∈ / {s, t}. In this case, let ũ = [vi0 , l]
and ṽ = [vj0 , l ] for l, l ∈ [0, 2]Z and i0 , j0 ∈ [n]. It follows that (l, i0 ) = (l , j0 ).
We then define the desired perfect matching Mũṽ as follows, depending on the
choice of ũ and ṽ.
(Case 1) Consider the case of l, l ∈ {1, 2}. Let M0 = T , M1 =
{([vi , 1], [vi+1 , 1]) | i ∈ [n − 1]}, M2 = {[vi+1 , 2], [vi , 2]) | i ∈ [n − 1]},
M3 = {([v1 , 2], [v1 , 0]), ([v1 , 0], [v1 , 1])}, M4 = {([vn , 1], s), (t, [vn , 2])}, M5 = S,
6
and let M6 contain (z, z) for all other elements z. Finally, we set Mũṽ = i=0 Mi .
It then follows by the definition that Mũṽ is a perfect matching. Since s is linked
to t in Mũṽ , ũ and ṽ are also linked to each other.
(Case 2) In the case where l = 0, l ∈ {1, 2}, and vi0 ∈ / γst , there are three
separate cases (a)–(c) to examine. The symmetric case of Case 2 can be similarly
handled and is omitted here.
Parameterized-NL Completeness of Combinatorial Problems 793
Since its first proposal in [14], the linear space hypothesis (LSH) has been
expected to play a key role in showing the computational hardness of numerous
combinatorial parameterized-NL problems. However, there are few problems that
have been proven to be equivalent in computational complexity to (2SAT3 , mvbl ).
This situation has motivated us to look for natural, practical problems equiva-
lent to (2SAT3 , mvbl ). Along this line of study, the current exposition has intro-
duced three parameterized decision problems (2CVC3 , mver ), (3XCE2 , mset ),
and (AP2DM4 , mset ), and demonstrated that those problems are all equivalent
in power to (2SAT3 , mvbl ) by “short” log-space reductions.3 The use of such short
reductions is crucial in the equivalence proofs of these parameterized decision
problems presented in Sects. 3–5 because PsubLIN is unlikely to be closed under
“standard” log-space reductions, and short reductions may be more suitable for
the discussion on various real-life problems. Under the assumption of LSH, there-
fore, all parameterized decision problems that are equivalent to (2SAT3 , mvbl )
by short log-space reductions turn out to be unsolvable in polynomial time using
sub-linear space.
In the end, we remind the reader that the question of whether LSH is true
still remains open. Nevertheless, we hope to resolve this key question in the near
future.
References
1. Àlvarez, C., Greenlaw, R.: A compendium of problems complete for symmetric
logarithmic space. Comput. Complex. 9, 123–142 (2000)
3
We remark that it is unknown that (AP2DM4 , mset ) ≡sL
m (2SAT3 , mvbl ) holds.
Parameterized-NL Completeness of Combinatorial Problems 795
2. Chandra, A., Stockmeyer, L., Vishkin, U.: Constant depth reducibility. SIAM J.
Comput. 13, 423–439 (1984)
3. Cook, S.A.: The complexity of theorem-proving procedure. In: Proceedings of the
3rd Annual ACM Symposium on Theory of Computing, pp. 151–158. ACM (1971)
4. Cook, S.A., McKenzie, P.: Problems complete for deterministic logarithmic space.
J. Algorithms 8, 385–394 (1987)
5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman and Company (1979)
6. Immerman, N.: Nondeterministic space is closed under complementation. SIAM J.
Comput. 17, 935–938 (1988)
7. Jenner, B.: Knapsack problems for NL. Inf. Process. Lett. 54, 169–174 (1995)
8. Jones, N.D., Lien, Y.E., Laaser, W.T.: New problems complete for nondeterministic
log space. Math. Syst. Theory 10, 1–17 (1976)
9. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E.,
Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum
Press, New York (1972)
10. Reif, J.H.: Symmetric complementation. J. ACM 31, 401–421 (1984)
11. Reingold, O.: Undirected connectivity in log-space. J. ACM 55(4), 17 (2008)
12. Szelepcsényi, R.: The method of forced enumeration for nondeterministic
automata. Acta Informatica 26, 279–284 (1988)
13. Yamakami, T.: Parameterized graph connectivity and polynomial-time sub-linear-
space short reductions. In: Hague, M., Potapov, I. (eds.) RP 2017. LNCS, vol.
10506, pp. 176–191. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
67089-8 13
14. Yamakami, T.: The 2CNF Boolean formula satisfiability problem and the linear
space hypothesis. In: Proceedings of the 42nd International Symposium on Math-
ematical Foundations of Computer Science, vol. 83 of Leibniz International Pro-
ceedings in Informatics (LIPIcs), Leibniz-Zentrum für Informatik 2017, pp. 1–14
(2017). A corrected and complete version available as arXiv preprint
15. Yamakami, T.: State complexity characterizations of parameterized degree-
bounded graph connectivity, sub-linear space computation, and the linear space
hypothesis. Theor. Comput. Sci. 798, 2–22 (2019)
Rashomon Effect and Consistency
in Explainable Artificial Intelligence
(XAI)
1 Introduction
Rashomon is the name of an old Japanese film by Akira Kurosawa in which
four different witnesses, called to report about a murder, describe their different
and partly also contradictory views regarding facts of the crime. In his publica-
tion by the title: “Statistical Modeling: The Two Cultures" [1,12], Leo Breiman
established the notion Rashomon effect for ML to describe the fact that different
statistical models or different data predictors can work equally good in fitting
the same data. To explain a model’s behavior, one usually tries to identify a
subset of the model’s parameters, and especially those that seem to have the
strongest influence on the model’s prediction. For a linear regression model with
thirty parameters for example, to search for the best five-parameter approxima-
tion would mean to have to choose out of a set with 140,000 member functions.
Naturally, each approximating function attributes different importance to each
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 796–808, 2023.
https://doi.org/10.1007/978-3-031-18461-1_52
Rashomon Effect and Consistency in XAI 797
In this section, machine learning models trained with the California Housing
Prices Datasets are described and the variability of their interpretations is dis-
played. The task is to practically demonstrate discrepancies created by estab-
lished model explanation techniques for models of the same accuracy trained
with the same data. In Table 1, there are depicted published information about
the here used dataset, derived by the 1990 U.S. Census Bureau for the estimation
of the median house value for California districts.
Table 1. California housing dataset [2]. The number of instances is 20640, the number
of attributes is 8 (numeric). Target value is the median house value for California
districts.
3.1 XGBoost-Model
XGBoost or eXtreme Gradient Boosting is a highly efficient and portable open-
source implementation of the stochastic gradient boosting ensemble algorithm
for machine learning. It provides interfaces for the use with Python, R, and
other programming languages. Gradient boosting refers to a class of ensemble
machine learning algorithms that can be used for classification or regression
predictive modeling problems [3]. Ensembles are here constructed from decision
tree models. During training, trees are gradually added to the ensemble and
fitted in the direction of reducing the error of the prior models. This process is
referred to as boosting. In the listings 1.1 and 1.2 respectively, there are described
two model instantiations, their fitting and evaluation code is given in listing 1.3.
The two models display both the same degree of accuracy (R2 = 83 %) and have
been used to produce the plots of Figs. 1 and 2.
Listing 1.1. Model-1
xgbr = xgboost . XGBRegressor ( learning_rate =0.14 ,
n_estimators =500 ,
random_state =1001)
plotting interface and derived with three different metrics, weight, cover and
gain. In Fig. 3 the feature importance is displayed for both models, extracted
through the native XGBoost plotting interface and derived with two further
metrics: total_cover and total_gain.The three basic metrics to measure feature
importance are described as follows:
Fig. 1. Scatter plot of prediction over real price as well as Feature Importance and Per-
mutation Importance extracted from the XGBoost-Models with Scikit-Learn, whereby
the default metric is gain, above for Model-1 and below for Model-2.
Fig. 2. Feature Importance for three different metrics: weight, cover and gain,
extracted from Model-1 (above) and Model-2 (below) using the Native XGBoost API.
Weight: number of times a feature is used to split the data across all trees.
Cover: number of times a feature is used to split the data across all trees
weighted by the number of training data points that go through those splits.
Gain: average training loss reduction gained when using a feature for splitting.
Rashomon Effect and Consistency in XAI 801
Fig. 3. Feature Importance with the metrics: total_cover and total_gain extracted
from Model-1 (above) and Model-2 (below) using the Native XGBoost API.
Fig. 4. Feature Importance as calculated and expressed by the mean absolute SHAP
values: Model-1 (above) and Model-2 (below).
802 A.-M. Leventi-Peetz and K. Weber
A comparison of the bar charts within the upper (lower) part of Fig. 2 shows
that the kind of the employed metric: weight, cover or gain, greatly affects the
resulting order of the calculated feature importance for one and the same model,
in this case Model-1 (Model-2). Comparing the upper and lower parts of Fig. 2
for the same metric, shows that the feature importance as calculated with the
metric weight and the metric cover respectively, is different between the two
models, despite the fact that the two models were trained on the same data
set and have the same accuracy (are competing models). To the contrary, the
metric gain delivers a constant order of feature importance for both models of
Fig. 2, as is obvious by a comparison between the upper and lower plots in the
rightmost column of Fig. 2. This constant order, delivered with the metric gain,
is also independent of the employed API for the feature extraction, as shows a
comparison between the rightmost column of plots in Fig. 2 with the middle col-
umn of plots in Fig. 1. Similar arguments count when comparing the importance
order calculated for the metric total_cover and the metric total_gain respec-
tively in Fig. 3, where it is obvious that for one and the same model the results
are again metric dependent. The new metric: total_cover yields different results
when applied on the two competing models, as seen by comparing the upper and
lower elements of the left column of plots in Fig. 3. The interpretation results
with the metric: total_cover are also different than the results with the metric:
cover, as a comparison between the corresponding columns of plots in Figs. 2 and
3 respectively shows. The differences in the feature order for one and the same
model are not at all negligible. For instance, the feature MedInc is ranked first
in the list of importance when the metric total_cover is chosen for Model-1, as
displayed in the upper left part of Fig. 3, while the same feature is ranked fourth
by the metric cover for the same model, as showed in the upper middle part
of the Fig. 2. In conclusion, a model can be evaluated in different ways which
delivers different interpretations of the model results. Because the gain metric
appears to deliver a firm ranking of feature importance for also competing mod-
els, (models of the same accuracy), this metric is usually preferred. However,
the here discussed stability of the explanations relates to the native XGBoost
API and the scikit-learn feature extraction methods. Other explanation methods
deliver yet different orders of feature importance, as the example of Fig. 4 shows,
where the feature importance is calculated with the help of the mean absolute
SHAP values, to be discussed in the next section. It is not trivial to measure
the consistency and accuracy of model interpretation results, especially when
using global attribution methods as is the case here. As easy to understand, the
wide variety of possible parameter combinations to configure the model training,
renders its interpretation results strongly dependent from the learning strategy.
created with the training data set. For the results of the Figs. 4 and 5 respec-
tively, the Tree Kernel SHAP (shap.KernelExplainer) algorithm has been here
employed, which has been especially developed for XGBoost [15]. A short and
precise summary about advantages and disadvantages of SHAP is provided in
Chap. 9.6 of the online book by Christoph Molnar [19]. In Fig. 4 the mean abso-
lute SHAP values for the features of the trained XGBoost models are displayed,
while in Fig. 5 the SHAP interpretation, the importance of features as well as
the individual influence of each of the features on the model result are presented.
Figure 6, taken from Fan et al. (2021) [8], shows the Shapley values computed
for a fully connected layer neural network (NN) model, which was trained on the
same California Housing Dataset, with the same eight features. No detailed infor-
mation about the NN or the SHAP implementation that delivered the results of
Fig. 6 are given in [8]. By comparing the Figs. 6 and 5, it is obvious that Fan et
al. assessed a different feature order and different values impact, in comparison
to those calculated for the XGBoost models here. In SHAP plots the features
are ranked in descending order along the vertical axis. Higher feature values are
marked in red color, lower values are marked in blue. The horizontal deviation,
(distance from the zero axis), is associated with the scale of the impact of the
variable on the result. The partial mixing of colors in the horizontal plot shapes
indicates that the feature’s influence on the target value, (here the house price),
is ambiguous. The vertically changing shape of the bars indicates that there
exist interactions of features with other features. As an example, Fig. 5 shows
that higher values of the feature MedInc push house prices up to also higher val-
ues, in which case this feature is said to correlate with the price, whereas higher
values of the feature AveOccup push prices down, this feature anti-correlates.
In Fig. 7, the SHAP feature interaction values for the two XGBoost models is
depicted. For the SHAP values of Fig. 8, the general shap.KernelExplainer [14]
Fig. 5. SHAP values for the XGBoost Model-1, trained with the california housing
dataset [2]. The respective plot for Model-2 is omitted because it is very Similar.
804 A.-M. Leventi-Peetz and K. Weber
Fig. 6. SHAP values calculated with a fully connected layer NN Model, trained by Fan
et al. with the california housing dataset [2, 8].
has been applied on a NN model, especially developed and trained here with
the California Housing Dataset. The according results in Fig. 8, display distinct
differences as compared to the SHAP values for the XGBoost models in Fig. 5,
but also to the borrowed graphic of Fan et al. in Fig. 6. The exact calculation of
a model feature explainability with Shapley Values, would demand to solve an
NP-complete problem, an exercise which is exponential in the number of features
and can not be solved in polynomial time in most of the cases. Therefore, various
model-specific but also model-agnostic approximate solutions have been devel-
oped under the name KernelSHAP. They perform Shapley Values estimation
by solving a linear regression-based exercise. KernelSHAP utilize data set sam-
pling approaches that lead to solving a constrained least squares problem with
a manageable number of data points [4]. The properties of KernelSHAP are not
yet thoroughly understood. It is not clear if Shapley value estimators are indeed
statistical estimators, or if the uncertainty of their results can be quantified and
how unbiased are their sampling methods. The issue is still under investigation.
SHAP values estimated with KernelExplainers are also not deterministic due
to the sampling methods and the background data set selection. SHAP values
also do not provide explanation causality and have to be handled with care, if
used with predictive models [4,9,13,18]. In addition to technical discrepancies
between explainability plots, taken for example the two Figs. 6 and 5, which show
SHAP values of two different models built on the same data, there often exist
also subjective human interpretation factors and opposing views to make things
even more complicated towards a unique understanding of model results. Fan et
al. [8] deduce from the Shapley value analysis of their NN model that the model
is biased because the house age positively correlates with the house price, which
goes against experience, as they say. However, the house age has also a positive
Shapley value as calculated from the XGBoost models trained here and depicted
in Fig. 5, or the NN model, depicted in Fig. 8. This should not necessarily be
a sign of bias for the model or the training data as many old houses are not
mass products, can be of better quality or particular architectural design, or go
with more land. To the disadvantages of SHAP, there belongs the production of
unintuitive feature attributions. It should be possible also to create intention-
Rashomon Effect and Consistency in XAI 805
Fig. 7. Feature SHAP interaction values for the XGBoost Models: Model-1 (above)
and Model-2 (below).
Fig. 8. SHAP values calculated with a here developed fully connected layer NN Model,
trained with the california housing dataset [2].
806 A.-M. Leventi-Peetz and K. Weber
ally misleading interpretations with SHAP, which can hide biases [19]. This is
certainly a point that needs special care and further investigation. For instance,
it is intuitive that the price of a house strongly depends from the location but it
is unintuitive that it depends from the income of the buyer. In this case, there is
certainly a relation between feature and target value, because a higher income
makes it possible for a buyer to afford a more expensive house, but this relation
is not causal in the sense of a house price prediction model. In other words, the
price of the house can not causally depend on the income of its buyer.
4 Conclusions
References
1. Breiman, L. : Statistical modeling: the two cultures. Stat. Sci. 16(3), 199–215,
(2001). https://www.jstor.org/stable/2676681
2. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, pp.
785–794 (2016). Scikit-Learn California Housing dataset. http://scikit-learn.org/
stable/datasets/real_world.html#california-housing-dataset. Accessed Apr 2022.
https://doi.org/10.1145/2939672.2939785
3. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, pp.
785–794 (2016). https://doi.org/10.1145/2939672.2939785
4. Covert, I.: Understanding and improving KernelSHAP. Blog by Ian Covert (2020).
https://iancovert.com/blog/kernelshap/. Accessed Apr 2022
5. D’Amour, A.: Revisiting Rashomon: a comment on “the two cultures”. Observa-
tional Stud. 7(1) (2021). https://doi.org/10.1353/obs.2021.0022
6. Dressel, J., Farid, H.: The accuracy, fairness, and limits of predicting recidivism.
Sci. Ad. 4(1), eaao5580 (2018). https://doi.org/10.1126/sciadv.aao5580
7. Fisher, A., Rudin, C., Dominici, F.: All models are wrong, but many are useful:
learning a variable’s importance by studying an entire class of prediction mod-
els simultaneously. J. Mach. Learn. Res. 20(177), 1–81 (2019). http://jmlr.org/
papers/v20/18-760.html
8. Fan, F.L., et al.: On interpretability of artificial neural networks: a survey. IEEE
Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021). https://doi.org/10.1109/
TRPMS.2021.3066428
9. Gerber E.: A new perspective on shapley values, part II: the Naïve Shapley
method. Blog by Edden Gerber (2020). https://edden-gerber.github.io/shapley-
part-2/. Accessed Apr 2022
10. Gibney, E.: This AI researcher is trying to ward off a reproducibility crisis. Inter-
view Joelle Pineau. Nat. 577, 14 (2020). https://doi.org/10.1038/d41586-019-
03895-5
11. Jia, E.: Explaining explanations and perturbing perturbations, Bachelor’s the-
sis, Harvard College (2020). https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:
37364690
12. Koehrsen, W.: Thoughts on the two cultures of statistical modeling. Towards
Data Sci. (2019). https://towardsdatascience.com/thoughts-on-the-two-cultures-
of-statistical-modeling-72d75a9e06c2. Accessed Apr 2022
13. Kuo, C.: Explain any models with the SHAP values - use the Kernelexplainer.
Towards Data Sci. (2019). https://towardsdatascience.com/explain-any-models-
with-the-shap-values-use-the-kernelexplainer-79de9464897a. Accessed Apr 2022
14. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predic-
tions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Sys-
tems 30, pp. 4765–4774 (2017). https://proceedings.neurips.cc/paper/2017/hash/
8a20a8621978632d76c43dfd28b67767-Abstract.html
15. Lundberg, S.M., et al.: From local explanations to global understanding with
explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020). https://doi.org/10.
1038/s42256-019-0138-9
16. Marx, C.T., Calmon, F., Ustun, B.: Predictive multiplicity in classification. In:
ICML (International Conference on Machine Learning), Proceedings of Machine
Learning Research, vol. 119, pp. 6765–6774 (2020). https://proceedings.mlr.press/
v119/marx20a.html
808 A.-M. Leventi-Peetz and K. Weber
17. Merrick, L., Taly, A.: The explanation game: explaining machine learning models
using shapley values. In: Holzinger, A., et al. (eds.) Machine Learning and Knowl-
edge Extraction, vol. 12279, pp. 17–38 (2020). https://doi.org/10.1007/978-3-030-
57321-8_2
18. Mohan, A.: Kernel SHAP. Blog by Mohan, A. (2020). https://www.telesens.co/
2020/09/17/kernel-shap/. Accessed Apr 2022
19. Molnar, C.: Interpretable machine learning. Free HTML version (2022). https://
christophm.github.io/interpretable-ml-book/
20. Villa, J., Yoav Zimmerman, Y.: Reproducibility in ML: why it matters and
how to achieve it. Determined AI (2018). https://www.determined.ai/blog/
reproducibility-in-ml. Accessed Apr 2022
21. Warden, P.: The machine learning reproducibility crisis. Domino Data Lab
(2018). https://blog.dominodatalab.com/machine-learning-reproducibility-crisis.
Accessed Apr 2022
22. Zafar, M.R., Khan, N.: Deterministic local interpretable model-agnostic explana-
tions for stable explainability. Mach. Learn. Knowl. Extr. 3(3), 525–541 (2021).
https://doi.org/10.3390/make3030027
Recent Advances in Algorithmic Biases
and Fairness in Financial Services:
A Survey
1 Introduction
AI and ML are the disruptive technologies that have become valuable tools for
government organizations and large businesses including financial institutions.
In the present era of increased computational capacity, abundant digital data
and its low cost storage, AI has seen significant growth in the financial sector.
ML algorithms can identify a wide range of non-instinctive correlations in struc-
tured and unstructured “big data”. So, they can conceptually be more efficient
than humans at utilizing them for proper forecasting in the present generation
where highly concentrated computing power allows to generate, collect, and store
massive datasets [45].
The possibilities of AI in the different sectors of finance are unlimited and
span across the entire value chain. Combining AI with other existing technologies
such as blockchain, cloud computing, etc. increases the possibilities even further.
AI has streamlined the processes in the different stages of providing financial
services, enhanced cybersecurity, automated routine tasks such as credit scor-
ing, pricing and most importantly improved the customer service experience.
Algorithms are also used in risk assessment and management, real-time fraud
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 809–822, 2023.
https://doi.org/10.1007/978-3-031-18461-1_53
810 A. Bajracharya et al.
detection, stock trading, providing financial advisories and so on. AI makes deci-
sions that have a crucial impact over the business and it is important to know
the reasons behind such critical decisions.
Aside from the multitude of opportunities brought forward by AI, every
machine learning system is prone to retaining multiple forms of bias present
in the tainted data [4]. While computer programming languages are explicitly
instructed with codes, ML algorithms are provided with an underlying frame-
work and are trained to learn through data observation. In the course of learn-
ing, ML algorithms will develop biases towards certain types of input. There are
multiple forms of biases that reflect human prejudices towards race, color, sex,
religion, and many other common forms of discrimination which are amplified
by the ML model. For instance, unjust decisions taken by ML models that are
based on historical police records, bias in under-sampled data from minority
groups and so on. The racial wealth gap that exists between black and white
Americans is preserved in part through such biases in credit and lending [45].
The data fed to the computer is simplified to allow the algorithms to be
programmed to learn by example, but the example is often a faulty one leading
all data mining applications to be capable of replicating human biases [29,44].
Regardless of whether the intentions was good or bad, the financial institutions
risk using either biasedly or specifically selected data or biased algorithmic design
that induces discriminatory outcomes towards legally protected traits such as
race, gender, religion or sexual orientation [39]. Therefore, it is important to
design algorithms in ways that mitigate against the potential for bias and ensure
fairness. A software is called fair if it is not affected by any prejudice that favors
the inherent or acquired characteristics of an individual or a group.
In the context of financial institutions, the scope of bias in the ML model
expands further due to the fact that the data is collected from customers [52]. It
is a consumer facing industry with a reducing level of human involvement, and
therefore it is important for the organizations to be aware of the potential for
algorithmic bias and have strategies for mitigation.
Bias mitigation tools and strategies continue advance in improving algo-
rithms’ accuracy and fairness. Most of the strategies are developed with the
objective to mitigate the effects of bias-from sampling issues, feature selection,
labeling, etc. and to prevent discrimination in a given model’s outputs [47]. In
this paper we discuss the common strategies of achieving fairness such as pre-
processing, in-processing and post-processing tools, fairness metrics and resam-
pling as well as algorithm audit and the use of “alternative” data. It can be
a challenge to mitigate every fairness condition simultaneously, and so fairness
always entails some degree of trade-off with respect to accuracy.
This paper presents a survey on the recent advances of algorithmic biases and
fairness in financial services. The remainder of the paper is organized as follows.
Section 2 discusses the different sources of algorithmic bias. Section 3 presents
an overview of different instances of biases in the major areas of financial ser-
vices. Section 4 focuses on the bias detection and mitigation techniques. Finally,
conclusions are presented in Sect. 5.
Hamiltonian Mechanics 811
Historical Bias. Data may incorporate the gender, racial, economic and other
biases that have existed for a long time. Geographical bias is one common out-
come of historical bias where residents of poor Zip code areas or from minority
communities have historically had more cases of defaults which is reflected in the
data and thus result in a higher proportion of declined loan applications. The
cycle goes on and reinforces the historical biases over time as a phenomenon
called feedback loop.
Measurement Bias. Bias can arise when a study variable is inaccurately mea-
sured, or systematic data recording errors are stored. Such errors generally affect
the entire model but sometimes there is a possibility of it impacting a particular
group when the data collection method of just that particular group was faulty.
Representation Bias. The sample size could be small or skewed towards cer-
tain groups that are not representative of the entire population.
the basic fact that algorithms often use biased programmed reasoning that are
invented by humans themselves.
In this section, we aim at presenting the recent advances of algorithmic bias
in the most prominent areas of the financial services industry.
lack of access to fairly priced credit and leads to racially disparate outcomes.
Lenders have shifted from excluding redlined neighborhoods from the main-
stream credit to exploiting them to maximize profits [26]. For instance, Cathy
O’Neil cites an example of a borrower in the majority black neighborhood of
East Oakland, California receiving a low e-score due to historical correlations
between her ZIP code and high default rates [38]. Lenders are basically abusing
borrowers with such low credit scores and little-to-no credit histories by extend-
ing credit at arguably unreasonably high interest rates. The borrowers, despite
being aware of the exploitation, have no other option except to submit to the
unfair credit terms [29].
Predatory tactics such as aggressive advertising, consent solicitation and bait-
and-switch schemes have been historically employed by creditors to target vul-
nerable borrowers [17]. For instance, these creditors visit university campuses
annually to conduct exciting advertisement campaigns full of music and freebies
to attract students towards their offers of teaser low interest rate credits. As AI
keeps on advancing, the ability of such predatory creditors and lenders to target
vulnerable consumers would also advance.
Black and Hispanic borrowers suffer from racial gaps in mortgage costs and
are more likely to be rejected when they apply for a loan. In a Northwestern Uni-
versity meta-analysis, authors find that racial disparities in the mortgage market
have no evidence of decline over the past four decades. Subtle but persistent
forms of discrimination between whites and minorities have been compounding
Hamiltonian Mechanics 815
While the majority of the bias detection and mitigation approaches are concen-
trated towards credit assessment, there seems to be a limited focus on the biases
prevalent in the operational models in financial services [33]. The accelerating
use of AI in the operation of financial firms has digitalized the customer service
area of business. AI is pervasively used in credit card fraud detection, customer
authentication, chatbots, etc. Facial recognition algorithms are known to exhibit
bias-researchers have found it to falsely identify Black and Asian faces 10 to 100
times more than white faces. The algorithms also falsely identified female faces
more often than they did male faces. This would especially increase the vulnera-
bility of Black women towards algorithmic bias [24]. On the other hand, Natural
Language Processing (NLP) models, widely used in the customer service appli-
cations, financial assistants, recruiting, and personnel management, are basically
a product of linguistic data full of discriminatory patterns that reflect human
biases, such as racism, sexism, and ableism [11].
Fintech firms are experiencing remarkable growth and have reached 64% global
adoption rate [20]. Fintech firms are increasingly attracting business from many
financial institutions claiming to provide sophisticated, analytical and model-
based interpretation of big data. Such services are primarily oriented towards
predicting the creditworthiness of borrowers and their results are treated almost
like universal truths by the financial institutions [29].
While fintech is experiencing remarkable growth, the reports of bias have
also increased, especially the concerns regarding privacy and fairness. Fintech
firms around the world utilize thousands of features and attributes for assessing
creditworthiness [50]. Fintech makes use of digital footprints to analyze the indi-
viduals such as marital and dating status, social media profiles, SMS message
contents, cookie data, facial analysis and micro-expressions and even the typing
speed and accuracy and they are scrutinized for assessing the creditworthiness
[33].
Fintech firms are promising that their developers are expressly programming
the algorithm to prevent it from replicating statistical discrimination. Several
researches conducted to study the discrepancies in FinTech lending also claim
that Fintech mortgage lenders show little to no gap in lending terms provided
to Black and Hispanic borrowers after adjusting for GSE credit-pricing deter-
minants and loan size [48]. The findings of the research conducted by Bartlett
et al. [7] suggest that, “In addition to the efficiency gains of these innovations,
they may also serve to make the mortgage lending markets more accessible to
African-American and Latinx borrowers.” They found that FinTech algorithms
also discriminate, but 40% less than face-to-face lenders. Shoag, 2021 claims that
Fintech mortgage lenders show little to no gap in lending terms provided to Black
and Hispanic borrowers after adjusting for GSE credit-pricing determinants and
loan size [48]. During the pandemic, researchers at New York University [27]
Hamiltonian Mechanics 817
found Black businesses owners were 12.1% points more likely to get PPP funds
from a Fintech firm than from a conventional bank, while small banks were much
less likely to lend to Black businesses.
Fintech firms are now writing the most home mortgages but they face less
regulatory scrutiny due to which their AI models pose growing ethical concerns
threatening the most marginalized individuals and families [10]. Recent federal
banking regulations adopt a deregulatory approach that enable fintech firms to
dominate the financial markets. While the approach was intended to encourage
innovation, it may amplify the exploitation of the most vulnerable communities
[29]. A research [22] claims no evidence of fintech firms working on increasing
financial services to low-income borrowers. A possibility that cannot be ignored
is that Fintech lenders may ingrain predatory inclusion, existing inequities, and
unconscious biases into the financial system for many decades to come and it
will continue to accelerate the wealth gap and constrain minority communities
from developing [29].
Algorithms can be used for mitigating bias during the three stages of process-
ing namely pre-processing, in-processing (algorithm modifications), and post-
processing. Pre-processing bias mitigation is all about preparing and optimizing
the data to accurately represent the population and reduce the predictability
of the protected attribute. Neglecting bias in the source data can cause big-
ger bias in the model conclusion. Resampling, reweighting, massaging, and data
transformation tactics such as flipping the class labels across groups, and omit-
ting sensitive variables or proxies are some methods of bias mitigation in the
early stage [31]. In-processing tends to focus on creating a classifier and train-
ing it to optimize for both accuracy and fairness. Mitigation can range from
using adversarial techniques, ensuring underlying representations are fair such
as Kamishama’s prejudice remover [32], or by framing constraints and regulariza-
tion [12]. Finally, there is an abundance of methods that focus only on adjusting
the outcome of a model i.e. post-processing. Early works in this area focus on
modifying thresholds in a group-specific manner whereas recent work has sought
to extend these ideas to regression models [42].
4.3 Resampling
Unequal representation of data is one of the major reasons for bias because the
model lacks sufficient data of a certain class and is therefore unable to learn
about that particular class. The algorithms are much more likely to classify new
Hamiltonian Mechanics 819
5 Conclusion
The advent of Al technology has the potential to generate ample AI solutions
that can upgrade traditional banking, develop new market infrastructure, and
foster the inclusion of unbanked or underbanked consumers. It has to be backed
by effective state and federal regulations that monitor the consumer lending
market and promote the accountability, transparency, and explainability of their
algorithms. In this paper, we have presented a survey of recent advances in the
algorithmic biases and fairness in financial services. Algorithmic bias continues
to prevail in all sectors of the financial industry catalyzing the deeply rooted
economic segregation and racism in the U.S. With the advent of more and more
bias detection and mitigation tools, the AI/ML solutions will gradually gain
more trust from the consumers of the financial industry and as a result, help to
facilitate a more equitable future. There is also a crucial need for a detail study
of modern datasets to identify the current status of lending disparities.
References
1. Credit score, August 2021
2. How america banks: Household use of banking and financial services, 2019 fdic
survey, December 2021
3. Minority depository institutions program, December 2021
4. Akula, R., Garibay, I.: Audit and assurance of AI algorithms: a framework
to ensure ethical algorithmic practices in artificial intelligence. arXiv preprint
arXiv:2107.14046 (2021)
5. Bakelmun, A., Shoenfeld, S.J.: Open data and racial segregation: mapping the
historic imprint of racial covenants and redlining on American cities. In: Hawken,
S., Han, H., Pettit, C. (eds.) Open Cities — Open Data, pp. 57–83. Springer,
Singapore (2020). https://doi.org/10.1007/978-981-13-6605-5 3
6. Federal Reserve Banks. Small business credit survey: 2021 report on employer firms
(2021)
7. Bartlett, R., Morse, A., Stanton, R., Wallace, N.: Consumer-lending discrimination
in the era of fintech. Unpublished working paper. University of California, Berkeley
(2018)
8. Bhutta, N., Chang, A.C., Dettling, L.J., et al.: Disparities in wealth by race and
ethnicity in the 2019 survey of consumer finances (2020)
9. Broady, K.E., McComas, M., Ouazad, A.: An analysis of financial institutions
in black-majority communities: black borrowers and depositors face considerable
challenges in accessing banking services, March 2022
10. Buckley, R.P., Arner, D.W., Zetzsche, D.A., Selga, E.: The dark side of digital
financial transformation: the new risks of fintech and the rise of techrisk. In: UNSW
Law Research Paper (19-89) (2019)
Hamiltonian Mechanics 821
11. Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from
language corpora contain human-like biases. Science 356(6334), 183–186 (2017)
12. Celis, L.E., Huang, L., Keswani, V., Vishnoi, N.K.: Classification with fairness
constraints: a meta-algorithm with provable guarantees. In: Proceedings of the
Conference on Fairness, Accountability, and Transparency, pp. 319–328 (2019)
13. Chakraborty, J., Majumder, S., Yu, Z., Menzies, T.: Fairway: a way to build fair ml
software. In: Proceedings of the 28th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineer-
ing, pp. 654–665 (2020)
14. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
15. Federal Trade Commission, et al.: Big data: a tool for inclusion or exclusion?
understanding the issues. FTC report (2016)
16. Demyanyk, Y., Kolliner, D.: Peer-to-peer lending is poised to grow. Economic
Trends (2014)
17. Engel, K.C., McCoy, P.A.: A tale of three markets: the law and economics of
predatory lending. Tex. L. Rev. 80, 1255 (2001)
18. Fairlie, R., Robb, A., Robinson, D.T.: Black and white: access to capital among
minority-owned start-ups. Manage. Sci. 68, 2377–2400 (2021)
19. Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst.
(TOIS) 14(3), 330–347 (1996)
20. Frost, J.: The economic forces driving fintech adoption across countries. The tech-
nological Revolution in Financial Services: How Banks, Fintechs, and Customers
win Together, pp. 70–89 (2020)
21. Daniel James Fuchs: The dangers of human-like bias in machine-learning algo-
rithms. Missouri S&T’s Peer Peer 2(1), 1 (2018)
22. Fuster, A., Plosser, M., Schnabl, P., Vickery, J.: The role of technology in mortgage
lending. Rev. Financ. Stud. 32(5), 1854–1899 (2019)
23. Garg, P., Villasenor, J., Foggo, V.: Fairness metrics: a comparative analysis. In:
2020 IEEE International Conference on Big Data (Big Data), pp. 3662–3666. IEEE
(2020)
24. Grother, P.J. Ngan,, M.L., Hanaoka, K.K., et al.: Face recognition vendor test part
3: demographic effects (2019)
25. Hassani, B.K.: Societal bias reinforcement through machine learning: a credit scor-
ing perspective. AI Ethics 1(3), 239–247 (2020). https://doi.org/10.1007/s43681-
020-00026-z
26. Howell, B.: Exploiting race and space: Concentrated subprime lending as housing
discrimination. Calif. L. Rev. 94, 101 (2006)
27. Howell, S.T., Kuchler, T., Snitkof, D., Stroebel, J., Wong, J.: Racial disparities in
access to small business credit: Evidence from the paycheck protection program.
Technical report, National Bureau of Economic Research (2021)
28. Jagtiani, J., Lemieux, C.: The roles of alternative data and machine learning in
fintech lending: evidence from the lendingclub consumer platform. Financ. Manage.
48(4), 1009–1029 (2019)
29. Johnson, K., Pasquale, F., Chapman, J.: Artificial intelligence, machine learning,
and bias in finance: toward responsible innovation. Fordham L. Rev. 88, 499 (2019)
30. Kallus, N., Mao, X., Zhou, A.: Assessing algorithmic fairness with unobserved
protected class using data combination. Manage. Sci. 68(3), 1959–1981 (2022)
31. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without
discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
822 A. Bajracharya et al.
32. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with
prejudice remover regularizer. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.)
ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 35–50. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33486-3 3
33. Kurshan, E., Chen, J., Storchan, V., Shen, H.: On the current and emerging chal-
lenges of developing fair and ethical AI solutions in financial services. arXiv preprint
arXiv:2111.01306 (2021)
34. Liu, X.M., Murphy, D.: A multi-faceted approach for trustworthy AI in cyberse-
curity. Journal of Strategic Innovation & Sustainability 15(6), 68–78 (2020)
35. KPMG LLP. Algorithmic bias and financial services. Technical report (2021)
36. Mitchell, T.M.: The need for biases in learning generalizations. Department of
Computer Science, Laboratory for Computer Science Research . . . (1980)
37. Neal, M., Walsh, J.: The Potential and Limits of Black-Owned Banks. Urban Insti-
tute, Washington, DC (2020)
38. O’neil, C.: Weapons of math destruction: how big data increases inequality and
threatens democracy. Broadway Books (2016)
39. Nizan Geslevich Packin: Consumer finance and AI: the death of second opinions?
NYUJ Legis. Pub. Pol’y 22, 319 (2019)
40. Perry, A., Rothwell, J., Harshbarger, D.: Five-star reviews, one-star profits: the
devaluation of businesses in black communities. Brookings Institutution (2020)
41. Petrasic, K., Saul, B., Greig, J., Bornfreund, M., Lamberth, K.: Algorithms and
bias: what lenders need to know. White & Case (2017)
42. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., Weinberger, K.Q.: On fairness
and calibration. In: Advances in Neural Information Processing Systems, vol. 30
(2017)
43. Quillian, L., Lee, J.J., Honoré, B.: Racial discrimination in the us housing and
mortgage lending markets: a quantitative review of trends, 1976–2016. Race Soc.
Probl. 12(1), 13–28 (2020)
44. Rawal, A., McCoy, J., Rawat, D., Sadler, B., Amant, R.: Recent advances in trust-
worthy explainable artificial intelligence: status, challenges and perspectives (2021)
45. Rea, S.: A survey of fair and responsible machine learning and artificial intelligence:
implications of consumer financial services. Available at SSRN 3527034 (2020)
46. Seamster, L.: Black debt, white debt. Contexts 18(1), 30–35 (2019)
47. Selbst, A.D., Barocas, S.: The intuitive appeal of explainable machines. SSRN
Electron. J. (2018). https://doi.org/10.2139/ssrn.3126971
48. Shoag, D.: The impact of fintech on discrimination in mortgage lending. Available
at SSRN 3840529 (2021)
49. Simonite, T.: When bots teach themselves to cheat. Wired Magazine (Aug. 2018)
(2018)
50. Ramandeep Singh. Gk digest (2015)
51. Srivastava, B., Rossi, F.: Towards composable bias rating of AI services. In: Pro-
ceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 284–
289 (2018)
52. Zhang, Y., Zhou, L.: Fairness assessment for artificial intelligence in financial indus-
try. arXiv preprint arXiv:1912.07211 (2019)
Predict Individuals’ Behaviors from Their
Social Media Accounts, Different
Approaches: A Survey
1 Introduction
Predicting individual behavior is a key objective in the social sciences, ranging
from business to sociology, psychology, and economics. In marketing, prediction
helps select groups of individuals to target with promotional material. Specif-
ically, organizations can determine people most likely to take action, such as
adopting a new product. For many years, demographic, geographic, and behav-
ioral targeting were the main prediction methods. However, with the increasing
availability of social media data, it is now possible to predict the behaviors of
individuals from their accounts. For example, [3]. State that by analyzing the
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 823–836, 2023.
https://doi.org/10.1007/978-3-031-18461-1_54
824 A. Almutairi and D. B. Rawat
2 Problem Statement
3 Background
According to [7], social ties that exist among online users of the social systems
play a vital role in determining the behaviors of such users. In particular, social
influence is one of the critical factors that shape the behaviors of the users
of online social networks. The actions of an individual user trigger his or her
friends to follow the same trend of behavior while using the same social network.
Behavioral modes, ideas, and new technological advancements can easily spread
via social networks through the power of social influence among the users. For
instance, the interaction of people on online social networks like Facebook and
Flicker attracts huge traffic that denotes the possible influence of users on the
behavioral aspects of other users in the same social networks [7]. The extent of
influence is immense due to the huge number of users on online social platforms.
Noteworthy, the online social networks avail huge volumes of data about the
actions of the users of social networks. As a result, it is possible to extract and
analyze such data for studies regarding the influence of actions of individuals on
the behaviors of fellow users sharing social networks.
Predict Individuals’ Behaviors 825
4 Methodology
The current paper does not generate new knowledge but compiles research on
the different approaches to predict individuals’ behaviors from their social media
accounts. A literature review methodology was used to fulfill this overarching
aim. IEEE Xplore was the main search tool, targeting peer-reviewed articles pub-
lished in the last five years. The string “predicting behaviors from social media
accounts” was initially used to select the first potential sources. Twenty-one
sources were identified and scrutinized to identify the main prediction approaches
available in literature. The main varieties identified included the Lexicon app-
roach, the Louvain algorithm, Naı̈ve Bayes classification, and Multi-Criteria
Decision Making (MCDM). IEEE was then used to identify specific publications
on each presented method. The string “-prediction method- for predicting behav-
iors from social media accounts’ was used for each approach”. For example, the
string “Lexicon approach for predicting behaviors from social media accounts’
was used to identify sources for the said method”. Cumulatively, 10 sources were
selected using this technique. The identified references were analyzed based on
several criteria. Firstly, they had to align with the literature review’s overarching
aim. Secondly, the sources had to be peer-reviewed and published by reputable
journals or presented in major conferences. Notably, journals and conferences
focusing on computing applications, data science, analytics, and computational
linguistics were given more weight. The evidence extracted from the references
was collated, summarized, aggregated, organized, and compared. Ultimately, the
four prediction methods were elaborated based on the derived insights.
negative and positive tweets: disgust, anger, fear, sadness, and happiness. Conse-
quently, the lexicon approach was combined with MCDM to categorize collected
tweets into the identified emotion clusters using the Co-Plot method. This pro-
cess adequately classifies fine-grained and mixed emotions without the need for
factor analysis [1]. Thus, Co-Plot (Fig. 1) analysis can depict the emotional posi-
tioning of social network posts using a two-dimensional analysis surface. This
possibility further proves the feasibility of the Lexicon approach to predict emo-
tions and their intensity.
We should apply a hybrid approach to Lexicon and Multi-Criteria Deci-
sion Taking. We use a predetermined plot to define and evaluate the text by
constructing a graphical analysis space of two dimensions, one aspect reflects
findings (tweets) and the other reflects our ratings.
Incoming Tweet
Segmenation
Scoring Algorithm
Behavioral
Generating Tweet by Emotion Matrix words
Lexicon
Matrix Normalization
Dissimilarity Measurement
Fig. 2. Lexicon
Co-occurrence
DataBase
Louvain
Pre-processing
Base
ADVF
1 ki kj
Q= Aij − δ(ci , cj ), (1)
2m i,j 2m
where ki = σj Aij is the sum of the weight of edges connecting node i, Aij is
node i and node j’s weight, and ci is the community holding I. δ assigns 1 in
case of similar communities, and m = σij Aij .
The Louvain method comprises two stages that recur iteratively. For a graph
containing N nodes, each node is taken as a community. In the first phase, I takes
a new community with a maximum and positive modularity. This phase ends
when the local maximum is attained a no additional modularity improvement is
possible. Phase two involves new graph generation, where communities are taken
as the new nodes. The stages are iterated until no modularity gain is attainable.
These processes produce highly accurate topic modeling results.
The authors employed the Louvain algorithm as a tool for the identification
of the communities to ensure high levels of accuracy and the ability to analyze
huge datasets [11]. It also enables the separation of communities based on mod-
ularity scores leading to improved efficiency compared to the tools used by [14].
After sorting users into communities, labeling of the communities based on likes,
retweets, and hashtags follows. The authors applied a support vector machine
(SVM) algorithm to label the identified communities due to its high level of
accuracy. In particular, the efficiency of SVM pegs on its ability to demarcate
the communities using a line [11].
(p(X|H|p(H)
P (H|X) = (2)
P (X)
Openness
Conscien-
tiousness
Neuroticism Personality
Extraversion
Agreeabl-
eness
P (y)P (x1 , . . . xn | y)
P (y | x1 , . . . , xn ) = (3)
P (x1 , . . . , xn )
Bayes to extract data about the emotional tweets posted by Twitter users.
The use of naive Bayes enabled the classification of tweets into negative and
positive tweets [14]. As a result, this allows the study of emotional messages
posted by Twitter users and the impact of such emotional posts on the relation-
ships of the individuals using the social media network. The authors employed
the Brunner-Munzel test to carry out the analysis of user relationships and the
influence on the followers on the Twitter platform. The use of naive Bayes classi-
fication showed that accuracy increases with the use of data from many Twitter
users [14]. This machine-learning tool showed a high level of accuracy in the
classification of messages into negative, neutral, and positive compared to other
tools like the decision tree and random forest classification tools. Consequently,
this proved that the use of the naive Bayes tool yields accurate results in the
analysis of emotional Twitter messages, especially with data from many users [9].
As a result, it helps in the identification of both pessimist and optimist
aspects of the posts [14]. The use of graph theory enabled researchers to ana-
lyze the data based on the density of follower ship on Twitter. The classification
of the follower networks into low co-link, and high co-link categories help in
understanding the patterns of interactions and the behaviors of the followers in
both categories. In particular, a few famous users dominated the conversations
and interactions in low co-link groups. The detection of communities is the key
aspect of the user relationships on the Twitter platform [14]. According to [14],
Predict Individuals’ Behaviors 831
Establish a matrix
5 Discussion
Based on [1] findings, the Lexicon approach based on the Co-Plot method can
be decomposed into different sections, as illustrated in Fig. 1. The system starts
by accepting a tweet comprising several words. A tokenization process based
on the AMIRA toolkit is then deployed. The resultant outcome is a sequence of
segmented words from which emotion words can be extracted. Consequently, the
834 A. Almutairi and D. B. Rawat
6 Conclusion
References
1. Abd Al-Aziz, A.M., Gheith, M., Eldin, A.S.: Lexicon based and multi-criteria deci-
sion making (MCDM) approach for detecting emotions from Arabic microblog
text. In: 2015 First International Conference on Arabic Computational Linguistics
(ACLing), pp. 100–105 (2015). https://doi.org/10.1109/ACLing.2015.21
2. Tiwari, D., Singh, N.: Sentiment analysis of digital India using lexicon approach.
In: 2019 6th International Conference on Computing for Sustainable Global Devel-
opment (INDIACom), pp. 1189–1193 (2019)
3. Chatzakou, D., et al.: Detecting Cyberbullying and Cyberaggression in Social
Media. ACM Trans. Web 13, 3, Article 17, August 2019, 51 pages. https://doi.
org/10.1145/3343484
4. Kido, G.S., Igawa, R.A., Junior, S.B.: Topic Modeling based on Louvain method
in Online Social Networks. In: Proceedings of the XII Brazilian Symposium on
Information Systems on Brazilian Symposium on Information Systems: Informa-
tion Systems in the Cloud Computing Era - Volume 1 (SBSI 2016). Brazilian
Computer Society, Porto Alegre, BRA, pp. 353–360 (2016)
5. Sarwani, M., Salafudin, M., Sani, D.: Knowing personality traits on Facebook
status using the Naı̈ve Bayes classifier. Int. J. Artif. Intell. Robot. (IJAIR) 2, 22
(2020). https://doi.org/10.25139/ijair.v2i1.2636
6. Samuel, H., Noori, B., Farazi, S., Zaiane, O.: Context prediction in the
social web using applied machine learning: a study of Canadian Tweeters. In:
IEEE/WIC/ACM International Conference on Web Intelligence (WI), vol. 2018,
pp. 230–237 (2018). https://doi.org/10.1109/WI.2018.00-85
7. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social
networks. In: Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD 2008). Association for Computing
Machinery, New York, NY, USA, 7–15 (2008). https://doi.org/10.1145/1401890.
1401897
8. Farooq, A., Joyia, G.J., Uzair, M., Akram, U.: Detection of influential nodes using
social networks analysis based on network metrics. In: 2018 International Confer-
ence on Computing, Mathematics and Engineering Technologies (iCoMET), pp.
1–6 (2018). https://doi.org/10.1109/ICOMET.2018.8346372
9. Meeragandhi, G., Muruganantham, A.: Potential influencers identification using
multi-criteria decision making (MCDM) methods. Procedia Comput. Sci. 57, 1179–
1188 (2015). https://doi.org/10.1016/j.procs.2015.07.411
10. King, I., Li, J., Chan, K.T.: A brief survey of computational approaches in social
computing. In: International Joint Conference on Neural Networks, pp.1625–1632
(2009). https://doi.org/10.1109/IJCNN.2009.5178967
836 A. Almutairi and D. B. Rawat
11. Nizar, L., Yahya, B., Mohammed, E.: Community detection system in online social
network. In: Fifth International Symposium on Innovation in Information and
Communication Technology (ISIICT), pp. 1–6 (2018). https://doi.org/10.1109/
ISIICT.2018.8613285
12. Ozer, M., Kim, N., Davulcu, H.: Community detection in political twitter networks
using nonnegative matrix factorization methods (2016)
13. Panda, M., Jagadev, A.K.: TOPSIS in multi-criteria decision making: a survey.
In: 2018 2nd International Conference on Data Science and Business Analytics
(ICDSBA), pp. 51–54 (2018). https://doi.org/10.1109/ICDSBA.2018.00017
14. Tago, K., Jin, Q.: Analyzing influence of emotional tweets on user relationships
by Naive Bayes classification and statistical tests. In: 2017 IEEE 10th Conference
on Service-Oriented Computing and Applications (SOCA), pp. 217–222 (2017).
https://doi.org/10.1109/SOCA.2017.37
15. Çakır, E., Ulukan, Z.: An intuitionistic fuzzy MCDM approach adapted to mini-
mum spanning tree algorithm for spreading content on social media. In: 2021 IEEE
11th Annual Computing and Communication Workshop and Conference (CCWC),
pp. 0174–0179 (2021). https://doi.org/10.1109/CCWC51732.2021.9375942
16. Umamaheswari, S., Harikumar, K.: Analyzing product usage based on twitter
users based on datamining process. In: 2020 International Conference on Compu-
tation, Automation and Knowledge Management (ICCAKM), pp. 426–430 (2020).
https://doi.org/10.1109/ICCAKM46823.2020.9051488
Environmental Information System Using
Embedded Systems Aimed at Improving
the Productivity of Agricultural Crops
in the Department of Meta
1 Introduction
The problem of world overpopulation and the massive exploitation of the resources
present on the planet has caused innumerable social difficulties in different regions of
the world that affect the quality of life of people and puts their food security at risk.
It is also worth mentioning that it aligns perfectly with two sustainable development
objectives of the United Nations Organization, goal two (Zero Hunger) and goal eleven
(Sustainable Communities) [1].
The concept of food security began to be used in the national spheres of government
since the 1990s. However, it has not been possible to consolidate a government policy
that guarantees that people do not suffer from this scourge [2]. From the international
scene, the Food and Agriculture Organization of the United Nations, have proposed
sustainability policies and urge the governments of the world to implement such food
and environmental sustainability policies [3]. The main obstacle of the traditional way
of doing agriculture in Colombia is that it is a manual process, artisanal and traditional,
in which the inclusion of technology is very limited and therefore productivity levels
are very low, compared to the technician agriculture of the great world powers, in which
research and development have a direct impact on the entire production chain, including
food production [2]. In addition to this low crop productivity, there is the freedom of
the market generated by the implementation of free trade agreements, where a large
amount of foreign food or products can be imported into the national territory, and these
products have been obtained from processes in which technology has notably increased
their productivity and efficiency, Therefore, the local market has problems keeping the
price of products at the level of said disproportionate competition.
Information systems, automation, the advances in the internet and web technologies,
provide platforms where information becomes easier to access. Therefore, information
systems are appropriate tools for the design of an agricultural control and monitoring
system; where the interaction between electronic devices and programming languages
allows the connectivity of all the physical elements in their environment, allowing users
to access and control devices from anywhere and at any time. In addition, the internet is
a dynamic information network that can be adapted to any place or situation, that allows
communication between sensors and intelligent devices enabling decision making.
Not only is an embedded system needed to automate a process, although it is impor-
tant to optimize resources and tasks, the data and information that the system can throw
must be taken into account, therefore, these systems usually lose data due to lack of
storage and treatment, For the development of this project, the environmental conditions
of the passion fruit crop were taken into account within its production process. In this
way, it was seeked to control the variables of temperature, humidity and air of the fruit,
therefore, storing the data and representing it through the web is another function of this
project that allows generating traceability of the system for different future analyses; in
the same way, to be able to have enough information to generate future predictions and
improve the conditions of agricultural production.
2.2 Methods
The research project has a mixed method where there are quantitative and qualitative
data that allow the analysis of behavior of the crop. The Environmental Information
System using embedded systems aimed at improving the productivity of agricultural
crops in the department of Meta, It takes place in three fundamental stages.
Environmental Information System Using Embedded Systems 839
In the first stage, the environmental conditions of temperature, humidity and air of
an agricultural crop are determined.
The meteorological data of the vanguard station belonging to IDEAM, which is the
Institute of Hydrology and Environmental Studies in Colombia, are consulted, likewise
the NASA weather database.
In the second stage, a control system is implemented to maintain the environmental
conditions of the crop.
The different sensors and devices for the system are identified by means of a com-
parison matrix discussed below. With the AutoCAD software, the electronic circuit is
designed to make the corresponding implementation.
In the final stage, a web application is developed to provide information on the
conditions of the crop in real time.
From different programming languages, hardware connections are made, the pro-
gramming of electronic devices is done with the Arduino sketch, for the following, Free
software such as XAMPP, PHP, MySQL and HTML tags are used to store sensor data,
HTML tags are added from the Arduino sketches code to send the data to the database,
from the following, with HTML and JavaScript tools the web application is developed.
3 Results
Temperature and humidity data are obtained from the Vanguardia weather station (code
35035020 from the Institute of Hydrology and Environmental Studies - IDEAM),
acquired in flat files which were treated for analysis in the last 11 years, and classi-
fied in Table 1, 2, 3 and 4 by their minimum and maximum monthly values per year,
of temperature and humidity, unit of measurement in degree Celsius and percentage of
humidity.
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2010 37,6 38,2 35,3 33,7 32,4 32,4 32,0 32,6 33,4 33,6 32,6 32,7
maxima
Temperatura 2011 34,0 34,4 34,4 32,7 32,2 32,4 32,2 33,2 32,6 33,0 32,4 33,0
maxima
Temperatura 2012 34,0 34,4 34,4 32,7 32,2 32,4 32,2 33,2 32,6 33,0 32,4 33,0
maxima
Temperatura 2013 34,9 35,8 34,4 34,6 32,2 32,0 31,4 31,4 33,2 32,8 31,8 33,2
maxima
Temperatura 2014 34,9 35,8 34,8 34,2 33,4 32,8 32,6 33,4 33,8 32,8 32,8 33,4
maxima
Temperatura 2015 33,6 34,8 36,4 34,3 33,4 32,0 31,8 33,4 34,6 33,6 32,2 34,0
maxima
(continued)
840 O. H. Romero Ocampo
Table 1. (continued)
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2016 36,2 36,4 37,2 33,4 32,0 31,8 31,6 32,6 33,0 34,0 32,3 –
maxima
Temperatura 2017 36,2 36,4 37,2 33,4 32,0 31,8 31,6 32,6 33,0 34,0 32,3 –
maxima
Temperatura 2018 33,8 35,4 35,8 32,1 32,0 30,8 30,4 32,6 33,2 33,5 32,2 33,6
maxima
Temperatura 2019 34,8 35,8 35,6 32,8 33,0 31,9 31,2 32,6 32,8 34,4 32,2 33,2
maxima
Temperatura 2020 34,2 35,8 36,0 33,1 32,4 32,6 31,4 33,6 33,2 33,0 32,7 32,8
maxima
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2010 20,6 21,2 20,8 20,7 20,7 18,8 19,8 19,2 19,5 20,2 19,8 18,8
minima
Temperatura 2011 19,4 21,0 19,5 19,7 19,2 18,2 19,0 18,7 18,4 18,6 19,1 19,0
minima
Temperatura 2012 19,4 21,0 19,5 19,7 19,2 18,2 19,0 18,7 18,4 18,6 19,1 19,0
minima
Temperatura 2013 19,9 20 18,5 20,2 18,1 18,5 17,3 18,3 19,1 18,8 18,3 18,5
minima
Temperatura 2014 18,6 19,6 17,7 18,1 19,6 17,0 18,7 17,7 17,8 18,0 18,8 18,8
minima
Temperatura 2015 18,0 20,6 20,4 20,1 20,2 19,7 19,2 19,4 19,6 19,4 20,3 19,0
minima
Temperatura 2016 20,4 21,4 21,8 20,0 20,2 19,8 19,2 19,7 20,0 19,3 19,7 –
minima
Temperatura 2017 20,4 21,4 21,8 20,0 20,2 19,8 19,2 19,7 20,0 19,3 19,7 –
minima
Temperatura 2018 18,6 19,6 19,0 18,8 18,4 18,8 18,6 18,8 18,8 19,6 19,7 18,6
minima
Temperatura 2019 19,0 21,6 20,6 19,8 19,8 18,2 18,6 19,0 19,0 17,9 19,8 19,2
minima
Temperatura 2020 18,4 21,4 19,2 20,2 20,2 19,0 19,3 19,4 18,8 20,1 20,4 19,3
minima
Environmental Information System Using Embedded Systems 841
From the data consulted on the maximum and minimum temperature of the last
11 years, an average is projected, so that the behavior of the temperature of the area in
said time interval is known, ranging between 17 °C and 38.2 °C, information used as a
starting point for agricultural crop conditions (Fig. 1).
Fig. 1. Average temperature of the last 11 years per month in the city of Villavicencio, (own
elaboration)
The analyzed data allows comparing the environmental conditions of the passion fruit
crop, considering that the best conditions for cultivation are given with a temperature of
17 to 30 °C, for the following, the environmental conditions of the department of Meta
converge with the data for the cultivation of passion fruit [4].
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2010 88 98 99 99 99 98 100 99 98 98 97 100
maxima
Humedad relativa 2011 99 97 96 97 96 97 97 100 98 97 97 95
maxima
Humedad relativa 2012 99 97 96 97 96 97 97 100 98 97 97 95
maxima
Humedad relativa 2013 88 93 97 96 99 99 98 99 97 98 97 97
maxima
(continued)
842 O. H. Romero Ocampo
Table 3. (continued)
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2014 98 97 96 97 97 97 97 99 96 100 99 97
maxima
Humedad relativa 2015 94 93 96 97 96 96 96 96 97 95 95 96
maxima
Humedad relativa 2016 89 95 97 97 98 97 97 97 96 97 96 –
maxima
Humedad relativa 2017 89 95 97 97 98 97 97 97 96 97 96 –
maxima
Humedad relativa 2018 97 91 95 99 99 99 98 98 98 98 99 –
maxima
Humedad relativa 2019 97 97 100 98 98 97 99 98 97 97 98 97
maxima
Humedad relativa 2020 95 92 96 97 98 97 97 97 97 97 98 97
maxima
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2010 24 21 32 51 46 50 49 41 41 46 48 44
minima
Humedad relativa 2011 41 35 42 43 45 47 43 39 43 43 43 34
minima
Humedad relativa 2012 41 35 42 43 45 47 43 39 43 43 43 34
minima
Humedad relativa 2013 33 28 32 41 48 48 48 47 43 41 44 40
minima
Humedad relativa 2014 36 34 34 40 34 40 43 35 40 41 43 39
minima
Humedad relativa 2015 32 34 34 38 38 38 40 40 32 37 41 34
minima
Humedad relativa 2016 34 29 32 41 44 44 42 43 38 39 39 –
minima
Humedad relativa 2017 34 29 32 41 44 44 42 43 38 39 39 –
minima
Humedad relativa 2018 34 32 33 42 44 43 44 38 40 38 39 –
minima
(continued)
Environmental Information System Using Embedded Systems 843
Table 4. (continued)
Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2019 36 32 36 42 41 36 37 33 37 40 45 39
minima
Humedad relativa 2020 37 29 29 37 34 36 38 42 36 38 47 36
minima
From the data consulted on the maximum and minimum temperature of the last
11 years, an average is projected, so that the behavior of the relative humidity of the area
in said time interval is known, ranging between 21% and 100%, information used as a
starting point for agricultural crop conditions (Fig. 2).
Fig. 2. Average relative humidity of the last 11 years per month in the city of Villavicencio, (own
elaboration)
5 System Architecture
The system is based on a group of devices connected through a Mega card (micro-
processor) and Ethernet mega shield for communication between the different hard-
ware. A high-level system architecture is presented, composed of hardware - software.
The proposed system was conceived using hardware capable of supporting different
communication protocols and flexible enough to adapt adaptable software (Fig. 5).
5.1 Hardware
As can be seen in Figs. 3 and 4, the hardware devices adopted for the implementation of
the system are made up of: An Arduino MEGA 2560 card, a laptop with processing power
and memory, which supports programming compatible with various programs, input-
output ports and use of standard peripherals with gateway; Arduino Shield Ethernet
for communication between devices (sensors), which collect information on ambient
temperature, relative humidity and air; To control the sending of the data to the server, it
will be done through the Arduino mega card and in this way it is stored in the database.
5.2 Software
There are several important pieces of software for the proposed architecture, divided
into two sections mentioned below:
Web Platform
It consists of visualizing the information obtained from the sensors for its monitoring
in the cultivation process. The platform allows you to store the information from the
sensors in a MySQL database, designed with HTML, PHP and JavaScript language for
agricultural control and monitoring (Fig. 6).
846 O. H. Romero Ocampo
Figure 7 shows the programming code where the sensors are parameterized and the
connection to the database is configured to store the information. The code is designed
so that the system responds to the conditions of temperature, humidity and air and allows
control of the agricultural cultivation process (Fig. 8).
As illustrated in Fig. 3 and 4, the hardware devices adapted to the agricultural farming
system consist of the design of a web application to monitor the environmental conditions
(temperature, humidity and air) of the agricultural system, made up of electronic devices,
sensors, Arduino MEGA 2560 card, laptop with processing capacity and memory, also
allowing the use of programmable software and also the use of peripheral ports for the
traffic of the data reported by the sensors to the database.
Figure 9 shows the web page where the temperature data obtained from the DTH22
sensor is presented and subsequently displayed.
848 O. H. Romero Ocampo
Figure 10 shows the web page where the relative humidity data obtained from the
sensor is presented and Vs the number of times the data was taken is displayed.
Figure 11 shows the web page where the air data obtained from the MQ 135 sensor
is presented and subsequently displayed.
As evidenced in Figs. 9, 10 and 11, on the Y axis, the temperature in degrees Celsius,
the percentage relative humidity and the air in PPM are represented, on the X axis, for
the two graphs it corresponds to the number of intervals of sensor data storage.
The system allows to visualize in real time the meteorological data of temperature,
humidity, and air, from, of a network of sensors with Ethernet technology facilitating
communication and transmission of information; in this way, the environmental con-
ditions and range of the passion fruit crop are compared, allowing the monitoring and
control of the crop during its agricultural process.
Environmental Information System Using Embedded Systems 849
6 Conclusion
Web development can be adjusted to system conditions and to an app application for
practicality in the field. In the same way, the system presents a scalability in time since
it can be adjusted to the number of sensors or electronic devices added and desired for
its continuous improvement.
The system could be self-sufficient if the devices were powered by renewable energy.
In this case and due to its location, photovoltaic energy is recommended as an alternative
to electricity. By the following, the system can change its power consumption and adapt
to new hardware conditions, thus being more economical.
References
1. PNUD: Objetivos de desarrollo sostenible (2019). https://www.undp.org/content/undp/es/
home/sustainable-development-goals.html
2. Mejía, M.A.: Seguridad alimentaria en Colombia después de la apertura económica (2016)
3. FAO: Sistemasalimentarios (2020). Villavicencio, “Presentación”. http://www.fao.org/food-
systems/es/.de
4. CDT CEPASS: El maracuyá en Colombia. Corporación Centro de Desarrollo Tecnológico de
las Pasfloras de Colombia (2015)
An Approach of Node Model TCnNet: Trellis
Coded Nanonetworks on Graphene Composite
Substrate
1 Introduction
Considering that Wireless Sensor Networks (WSNs) are an important infrastructure for
the Internet of Things (IoT) and the interest in using sensor networks in the same universe
as IP networks, this work innovates by solving the limited hardware resources of the
network nodes using the new concept of “Trellis Coded Network”- (TCNet) introduced
in previous works: (i) “A new algorithm and routing protocol based on convolutional
codes using TCNet: Trellis Coded Network” [1], where the network nodes are associated
to the states of low complexity Finite State Machine (FSM) and the routing discovery
corresponds to the best path in a trellis based on trellis theory; (ii) “Robustness situations
in cases of node failure and packet collision enabled by TCNet: Trellis Coded Network
- a new algorithm and routing protocol” [2], where is shown the robustness of the
TCNet algorithm in making decisions in cases of nodes failure and packages collisions,
taking advantage of the regeneration capacity of the trellis. This proposal innovates in
making decisions on the node itself, without the need for signaling messages such as
“Route Request”, “Route Reply” or the “Request to Send (RTS)” and “Clear to Send
(CTS)” to solve the hidden node problem that is known to degrade the throughput of
ad hoc networks due to collisions and the exposed node problem that results in poor
performance by wasting transmission opportunities.
An extension of this proposal is to apply the same concepts of TCNet to networks of
nanodevices where the nodes are cooperatively interconnected with a low-complexity
Mealy Machine (MM) topology composed of XOR gates and shift registers. This new
configuration can be called TCnNet: Trellis Coded nanonetwork, a “firmware protocol”,
where the nanonetwork node can integrate on the same substrate: rectifiers, Finite State
Machine and on-chip Transmission/Reception.
This approach considers the use of a Graphene Composite Substrate (GCS) for
the integrated electronic circuits, enabling TCnNet: Trellis Coded nanonetwork to inte-
grate the necessary electrical and mechanical characteristics of sensor nodes to meet ad
hoc network scenarios with the following characteristics: limitations of energy sources,
changes topologies, poor link quality, and bandwidth limitation.
In addition, Graphene is an advantageous composition along with semiconductor
oxides in passive and active applications of electronic circuits with excellent cost benefit
[3, 4].
The challenges to integrate electronic systems in the most diverse IoT objects present in
wireless sensor networks (WSNs) and extending its applications to nanonetworks, must
meet the following characteristics: autonomy in terms of energy sources, mechanical
flexibility, miniaturization, and optical transparency, in addition to being ecological.
The use of Graphene Composite Substrate (GCS) has attracted attention because it
is a two-dimensional structure and responds very efficiently when used as a channel in
Field Effect Transistors (FET) used in electronic models of sensors [5]. (Furthermore,
a carbon-based system could benefit from the integration antennas, rectifiers, sensors
and transmit/receive) on the same substrate with the function of sensor node and energy
storage [6].
The research [7] compares techniques to obtain maximum conductivity with
graphene composite films as shown in Fig. 1. The process in Fig. 1(a) subjects a “binder-
free graphene ink” jet base containing nanoflakes, dispersants and solvent. After drying,
the result is an excellent film with 2D structure and porous characteristics as shown in
Fig. 1(b). The next step of the process is to apply compression, obtaining a highly dense
nanoflake laminate, Fig. 1(c).
852 D. F. Lima Filho and J. R. Amazonas
Fig. 1. Schematic illustration of the formation of binder-free graphene laminate. No binder was
used in graphene ink due to the strong strength of Van der Waals graphene nanoflakes. The adhesion
and conductivity of the graphene laminate were improved by rolling compression [7]
Other techniques for obtaining graphene composite films, such as growth of graphene
films by Chemical Vapor Deposition (CVD) [8], consisting of an alternative to produce
monolayer graphene films with good efficiency to be used on a large scale and even the
most empirical mechanical processes like Micromechanical Cleavage [9].
Fig. 2. Block diagram of the approach of WSN TCnNet node model on graphene composite
substrate – GCS
One of the most critical and ubiquitous problems for nodes in a network is battery life.
Although batteries have followed the evolution of smart devices improving efficiency, it
comes up against the supply of thousands of wireless sensors used in IoT networks, due
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 853
to the logistics of replacement and disposal with consequences for the environment. In
contrast to conventional copper surfaces the conductivity of graphene adds better yields
and performances due to increased carrier concentration and minimized film strength by
optimizing energy scavenged using the Electromagnetic Waves rectification concept in
DC power.
On the other hand, the increased demand for self-sustaining systems converges to one
of the main objectives of this research project, which is the use of harvesting of energy
that is collectors of residual energy sources present in the environment. In addition to the
already known energy sources: solar energy, heat, gradient, thermoelectric, electromag-
netic, wind and others, the radiofrequency (RF) signals present in urban environments
represent an interesting energy source with recycling potential, considering that they are
already integrated into smart devices [10, 11].
Research on Energy Harvesting technologies is promising for the near future due
to the speed with which electronic devices with low energy consumption and use in
Wireless Sensor Networks (WSNs) are emerging.
Looking at the Electromagnetic Spectrum, it shows the feasibility of applying the Energy
Harvesting technique, considering the region of the Spectrum where the best energy
efficiency can be obtained to be used by WSN nodes.
In this analysis, it is possible to identify the viable energy intensity distributed con-
sidering the frequency band range, from the very low frequency (VLF) to the super high
frequency (SHF) that correspond the frequencies from 10 kHz to 30 GHz [12].
Previous works show power efficiencies for Energy Harvesting corresponding to
7.0 µW at 900 MHz and 1.0 µW at 2.4 GHz with coverage areas with 40 m of radius
Fig. 3. Signal patterns obtained in the research using spectrum analyzer [14]
854 D. F. Lima Filho and J. R. Amazonas
and experiences in the 1.584 MHz corresponding to the Amplitude Modulated (AM)
band with average currents of 8 µA [13].
Outdoor searches in urban regions of cities like Tokyo, using Spectrum Analyzer
as Receiver and Dipole Antenna, obtained patterns of spectrum readings as shown in
Fig. 3, where signal levels of −15 dBm can be observed close to the 800 MHz band and
levels close to 0 dBm, in the 920 MHz band, corresponding to mobile telephony [14].
The applications of Energy Harvesting in WSN scenarios are justified due to the
large number of low-cost sensors used and working collaboratively collecting data and
transmitting it to a Sink node (Base Station) with IoT applications making it difficult to
replace energy sources. The combination of Energy Harvesting with the small charges
required for the batteries guarantees the WSN’s extended autonomy.
The scenario shown in Fig. 4 illustrates the power consumption of a WSN node
where current peaks occur during transmission, reception and shows the battery recharge
periods when the node is not being requested [15].
2.2 Code Generators - FSM and Trellis Decoders TCNet Node Configuration
The TCNet node configuration is associated with the states of a Finite State Machine
(FSM) with the function of Code generators and Trellis-based decoders implemented as
firmware, contributing for building links in the network using the state transition of the
Convolutional Codes, proposed in [1] and [2].
The integration in the same Graphene Composite Substrate (GCS) of the sensor node,
allows the implementation of FSM registers using the Graphene Field Effect Transistor
(GFET) in a low complexity topology (XOR gates and shift registers) with the concepts
of a Mealy Machine (MM) as shown in the configuration in Fig. 5.
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 855
Fig. 5. Example of: (a) The code generator, configured by an MM with the input sequence Kn
(t) generating an output sequence out n (t) = (c1, c2); (b) Trellis decoding for a 4-node network
implemented as firmware
The wireless communication of the sensor node integrated in the same Graphene Com-
posite Substrate (GCS) is possible with the use of Plasmonic Antennas or Graphennas
through Electromagnetic Waves in the THz range (0.1 to 10 THz), below optical commu-
nications, allowing the use of surfaces radiation in the transmission range of antennas in
graphene compounds, thus reducing radiating structures. The low complexity of TCNet
concepts justifies the application in nanodevices networks such as TCnNet using the
Wireless Network-on-Chip (WNoC) paradigm [16].
The propagation and detection of plasmonic waves, proposed in 2000 by the group
of Professor Harry Atwater [17]; this approach, consists of the coupling between the
electromagnetic field (EM) and the free charges in the metal that propagate at the interface
metal-dielectric as waveguides, called Surface Plasmon Polariton (SPP).
Figure 6 shows the description of SPP waves using classical electromagnetism where
the relationship between the charge distribution on the metal surface and the electric field
attenuates exponentially from the interface. The oscillation of the wave coupled to the
electromagnetic field propagates as a “package” in the x-direction.
Fig. 6. Visualization of the SPP wave, resulting from the coupling between the EM field and the
free charges on the metal surface [17]. In TM polarization, there are EM field components: Hy,
Hx and Hz
856 D. F. Lima Filho and J. R. Amazonas
As well the conductivity of graphene has been considered both for DC and for
frequencies that range from the Terahertz band (0.1–10 THz), the experiments show the
RF results obtained from the graphene laminate processes illustrated in Fig. 1, taking
advantage of the graphene laminate’s flexibility when printed on paper or plastic, which
is very important for flexible electronics such as wearable and RFID dipole antenna
applications [18]. As displayed in Fig. 7(a), the gain obtained reaches peaks to −1 dBi
between 930 MHz and 990 MHz, and (b)–(c) the radiation pattern shows a typical dipole
pattern, showing that printed graphene laminated dipole antenna can radiate effectively.
Fig. 7. (a) Gain of graphene laminate dipole antenna. Measured gain radiation patterns: (b)
Elevation plane and (c) Azimuth plane
Establishing links between nanonetwork nodes in the THz band takes advantage
of the huge bandwidth, allowing high-speed transmission rates with very low energy
consumption, using low-complexity hardware. Figure 8(a) shows a simple conceptual
implementation of a Wireless Network-on-Chip (WNoC). On the other hand, the com-
plexity of the channels in the THz band must be considered limited to distances of
a few meters due to attenuation and noise. Figure 8(b) shown in research [19] illus-
trates the behavior of a section (L × W) of graphene subjected to radiation in THz and
Fig. 8(c) shows the survey of the resonance frequency of a graphene nano- antenna with
dimensions L = 5 µm and W = 10 µm.
The results obtained in a previous work [20] show the energy consumed in a WSN,
considering: Transmission (tx), Reception (rx), Processing (proc) and Guard Band (bg)
situations using the IEEE 802.11b Standards:
The contribution of the energy consumption of the node to obtain the necessary
consumption of the network was considered by Eq. 1:
ΣE(n) = Etx + Erx + Eproc + Ebg (1)
Tests were done with an eight-node network using the simulation environment
OMNet++ based on C++ [21], where the sink node sends a query with CBR traffic
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 857
Fig. 8. (a) Schematic diagram of TCnNet wireless network-on-chip (WNoC), in which wireless
links between nodes are omitted for simplicity. (b) Graphene section considered. (c) Resonance
frequency obtained from graphene nano-antenna (L = 5 µm and W = 10 µm) in the THz band
to verify the reachability of the nodes. Figure 9 shows the used node’s model config-
ured by a MM with k/n = 1/2, the respective trellis decoder and results of the energy
consumed by the nodes in the network using OMNet++ are shown in Table 1.
Fig. 9. (a) MM with k/n = ½, resulting output words (n1, n2); (b) Trellis diagram for an 8-node
network corresponding to the MM and (c) Energy distributed among TCNet Nodes
Table 1 shows the necessary energy consumption ΣE(n), considering a network with
8 nodes using the simulation environment OMNet++ [21] implemented by Eq. 1.
858 D. F. Lima Filho and J. R. Amazonas
Table 1. Individual contribution of energy consumed by network nodes in joule (J) [1] and [2]
3 Conclusion
The proposal of this research studies the feasibility of obtaining an integrated model of
a nanonetwork node on a Graphene Composite Substrate (GCS) exploring the mechan-
ical electrical and self-sustainable characteristics necessary for Internet of Things (IoT)
infrastructures. The techniques presented correspond to the state-of-the-art of research
that can be integrated into the nodes of a nanonetwork, taking advantage of the effi-
ciency of graphene, consisting of a layer of carbon atoms with a honeycomb crystal
lattice configuration; which has attracted the attention of the scientific community due
to its unique electrical characteristics, that can contribute with all the necessary charac-
teristics to nanodevices: low energy consumption, scalability, broadband communication
in the network, in addition to innovative mechanical aspects such as flexibility, reduced
thicknesses and optical transparency.
References
1. Lima Filho, D.F., Amazonas, J.R.: Robustness situations in cases of node failure and packet
collision enabled by TCNet: Trellis Coded Network - a new algorithm and routing protocol.
In: Pathan, A.S., Fadlullah, Z., Guerroumi, M. (eds.) SGIoT 2018. LNICST, vol. 256, pp. 100–
110. Springer, Cham (2019). https://doi.org/10.4108/eai.7-8-2017.152992
2. Lima, D.F., Amazonas, J.R.: Robustness situations in cases of node failure and packet collision
enabled by TCNet: Trellis Coded Network – a new algorithm and routing protocol. In: The
2nd EAI International Conference on Smart Grid Assisted Internet of Things, Niagara Falls,
Canada, 11 July 2018. http://sgiot.org/2018
3. Neves, A.I.S., et al.: Transparent conductive graphene textile fibers. Sci. Rep. 5, 9866-1–
9866-7 (2015)
4. Kumar, S., Kaushik, S., Pratap, R., Raghavan, S.: Graphene on paper: a simple, low-cost
chemical sensing platform. ACS Appl. Mater. Interfaces 7(4), 2189–2194 (2015)
5. Novoselov, K.S., et al.: Electric field effect in atomically thin carbon films. Science 306(5696),
666–669 (2004)
6. Zhu, J., Yang, D., Yin, Z., Yan, Q., Zhang, H.: Graphene and graphene based materials for
energy storage applications. Small 10(17), 3480–3498 (2014)
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 859
7. Huang, X., et al.: Binder-free highly conductive graphene laminate for low cost printed
radiofrequency applications. Appl. Phys. Lett. 106(20), 203105-1–203105-4 (2015)
8. Mattevi, C., et al.: A review of chemical vapour deposition of graphene on cooper. J. Mater.
Chem. 21, 3324–3334 (2011)
9. Torres, L., Armas, L., Seabra, A.: Optimization of micromechanical cleavage technique of
natural graphite by chemical treatment, January 2014. https://doi.org/10.4236/graphene.2014.
31001. http://www.scirp.org/journal/graphene
10. Sangkil, K., Rushi, V., Kyriaki, N., Collado, A., Apostolos, G., Tentzeris, M.M.: Ambient
RF energy-harvesting technologies for self-sustainable standalone wireless sensor platforms.
Proc. IEEE 102(11) (2014). http://www.ieee.org/publications_standards/publications/rights.
html
11. Paradiso, A.J., Starner, T.: Energy scavenging for mobile and wireless electronics. IEEE
Pervasive Comput. 4(1), 18–27 (2005)
12. Mantiply, E.D., Pohl, K.R., Poppell, S.W., Murphy, J.A.: Summary of measured radio fre-
quency electric and magnetic fields (10 kHz to 30 GHz) in the general and work environment.
Bioelectromagnetics 18(8), 563–577 (1997)
13. Le, T.T.: Efficient power conversion interface circuits for energy harvesting applications.
Doctor of philosophy thesis, Oregon State University, USA (2008)
14. Tentzeris, M.M., Kawahara, Y.: Novel energy harvesting technologies for ICT applications.
In: IEEE International Symposium on Applications and the Internet, pp. 373–376 (2008)
15. Vullers, R.J.M., et al.: Micropower energy harvesting (2009)
16. Abadal, S., Alarcón, E., Lemme, M.C., Nemirovsky, M., Cabellos-Aparicio, A.: Graphene-
enabled wireless communication for massive multicore architectures. IEEE Commun. Mag.
51(11), 137–143 (2013)
17. Atwater, H.A.: The promise of plasmonics. Sci. Am. 296, 38–45 (2007)
18. Huang, X., et al.: Binder-free highly conductive graphene laminate for low cost printed radio
frequency applications. Appl. Phys. Lett. 105, 203105 (2015). https://doi.org/10.1063/1.491
9935
19. Llatser, I., Kremers, C., Cabellos-Aparicio, A., Jornet, J.M., Alarcón, E., Chigrin, D.N.:
Graphene-based nano-patch antenna for terahertz radiation. Photonics Nanostruct. Fundam.
Appl. 10, 353–358 (2012)
20. Lima, D.F., Amazonas, J.R.: Novel IoT applications enabled by TCNet: Trellis Coded
Network. In: Proceedings of ICEIS 2018, 20th International Conference on Enterprise
Information Systems (2018). http://www.iceis.org
21. Varga, A.: OMNeT++ Discrete Event Simulation System (2011). http://www.omnetpp.org/
doc/manual/usman.html
CAD Modeling and Simulation of a Large
Quadcopter with a Flexible Frame
1 Introduction
There has been a growing interest in the field of robotics. In fact, several indus-
tries require robots to replace the human presence in dangerous and onerous sit-
uations. Among them, a wide area of the research is dedicated to aerial robots.
The Vertical Take Off and Landing systems represents a valuable class of flying
robots. Quadrotors have been successfully used for monitoring and inspection of
rural and remote areas as well as many other industrial applications.
A quadrotor, with four propellers around a main body, presents the advan-
tage of having quite simple dynamic features. Quadrotors have gained a large
amount of interest due to their high manoeuvrability and multi-purpose usage.
They are a suitable alternative for monitoring, ground mapping, agricultural
and environmental preservation [1–3]. Self-sustainable quadrotors require auto
charging to allow a long operating time and a wider coverage area. However, the
difficulties of power consumption and mission planning lead to the challenge of
optimal sizing of the power supply such as the case of solar powered quadrotors
[1]. In [1], the main objective was to allow a large enough structure and mini-
mize the total weight while maintaining the system rigidity. Results show that
the optimal design of the quadrotor platform system is dependent on PV panel
size, and total weight, which affect the output power of the PV system, as well
as the power consumption profile.
In most of the literature, the discussion on quadcopters are usually focused
on rigid structural frames. In such scenarios, Newton − Euler or Lagrangian
equations could be used to derive the dynamic equations. However, in the case
of a large quadcopters a different approach needs to be followed. The frame’s
large size leads to a flexible strucrure with possible bending modes, which will
be challenging for the control system design. To deal with these challenges this
paper presents a CAD based modeling approach using SolidWorks and MAT-
LAB/Simulink to generate a realistic mathematical model based on the actual
parameters and material properties of the quadcopter. To deal with the challenge
of the platform flexibility, this paper presents an optimal approach for sizing the
frame and platform structure for solar quadrotors to allow a systematic model-
ing and design of quadrotors with large size and flexible frames. This is achieved
by using a modern approach to computer modeling of quadcopters through the
integration process of SolidWorks CAD modeling and MATLAB/Simulink envi-
ronments. This is followed by identifying the resonant frequencies of the model
using ANSYS Workbench.
There are multiple parameters, including principle axes of inertia, moment of
inertia and location of centre of mass, that needs to be obtained for evaluating the
kinematics and dynamics of the quadcopter. This is not an easy task and usually
experiments are employed to identify them. In this work, as the properties of
the materials are incorporated while working in SolidWorks, these parameters
are computed within the software and are obtained. A procedure to export the
quadcopter along with the properties from SolidWorks to MATLAB/Simulink
is discussed. Next, the techniques for creating an improved Simulink layout are
also specified. The academic contribution of this paper is the detailed procedure
862 A. Roshan and R. Dhaouadi
and adjustments that needs to be followed while working with these software
tools for a large quadcopter.
The paper is organized as follows: Sect. 1 gives an introduction of quadro-
tors and their immense potential. Section 2 describes modeling the quadrotor
in SolidWorks software as well as in MATLAB software. Section 3 identifies the
resonant frequencies of the model. Section 4 investigates into transfer functions
from the mathematical modeling techniques as well as from MATLAB. Section 5
is the discussions. Section 6 summarises the works performed.
In the case of quadcopter, studies usually focus on first principles approach where
the body’s equation of motion are defined to determine the forces and moments
that are applied to a dynamic model. With advanced modeling techniques, it is
now possible for an individual to define the dynamics of the systems and apply
propeller forces to understand how the quadcopter will behave and then develop
the control strategies. This approach simplifies the design process as one need
not derive the equations to analyze the behaviour of a body. In addition to this,
it is possible to import a CAD model into the simulation software. This further
facilitates the overall work [2,4–6]. Before we test it in real world, simulation
software could help us to understand the dynamics of the system.
In this section, a quadcopter was modeled using SolidWorks software. It was
then imported to MATLAB Simscape Multibody. Certain techniques were also
implemented to create an improved block diagram layout in Simscape Multi-
body, which is also described. Finally, a lift force and a torque were given to
the propellers and the simulation was analyzed using Mechanics Explorer in
MATLAB.
As can be seen from Table 1, the moment of inertia about the x and the z
axis (i.e., Ixx and Izz ) have the same value. It means the quadcopter has a high
degree of symmetry with respect to these axes. On the other hand, the moment
of inertia round the y axis (Iyy) is almost twice that of the other two. This
implies that it is easier to change the angular speed on the x or the z axes than
to change it on the y axis.
A few components (namely the motor mount) were assembled beforehand,
and then the assembled components were imported as sub-assemblies in the
final work. This was done so as to create a better block diagram layouts in the
Simscape Multibody after importing. MATLAB Simscape Multibody creates
a block diagram for each component in the SolidWorks model. Grouping the
components as a sub-assembly helps to create improved block diagrams in the
XML document while working in MATLAB Simscape Multibody.
864 A. Roshan and R. Dhaouadi
An axis transformation was also carried out. The axis transformation is done
in such a manner that the X-axis and the Y-axis are along the arms of the
quadcopter and the Z-axis points upwards following the right hand rule.
A platform was also incorporated such that the quadcopter rests on the
platform. The Spatial Contact Force block from the Simulink library is used for
this purpose connecting the World Frame and the quadcopter.
The XML import creates the geometry of the body. At the same time, there
are some changes that needs to be made in the block diagram environment. In
the file that is imported, 6-DoF joints were observed between the rigid trans-
form blocks, connectors, and frames. These are unnecessary blocks that needs
to be eliminated. Hence, all 6-DoF joints between rigid transform blocks, con-
nectors, and frames were deleted. As we are interested in the motion of the
quadcopter with respect to the World frame, a 6-DoF joint was inserted after
the World frame. In MATLAB Simscape Multibody, revolute joints are used for
rotational motions and prismatic joints are used for translational motions. It was
also observed that revolute joints and prismatic joints are positioned between
multiple parts of the quadcopter where there are no such motions associated in
real world. Such joints were also eliminated.
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 867
Lif tF orce = Ct ω 2
T orque = Cq ω 2
(1)
ρ. Kq .Dr 5 ρ. Kt .Dr 4
where, Cq = 2 and Ct = 2
(2π) (2π)
ρ − Air Density (taken as 1.225 Kg m3 )
Kq - Torque coef f icient
Kt - Lift Force coef f icient
Dr - Propeller Diametre
ω - Motor speed (rad/s)
Ct - Lift Force constant (taken as 44 ∗ 10−6 N.s2 )
Cq - Torque constant (taken as 5.96 ∗ 10−6 N.m.s2 )
The values for ρ, Ct andCq were taken from [13] and [14].
Let ω1 , ω2 , ω3 and ω4 be the angular velocities of the front, right, rear and
left propellers respectively, and U1 , U2 , U3 and U4 be the thrust, rolling moment,
pitching moment and yawing moment respectively. ‘L’ is the distance between
the centre of the quadcopter and the centre of the propeller. Then,
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 869
In the MATLAB model, the input is the thrust, rolling moment, pitching
moment and yawing moment. The angular velocity needs to be calculated from
them. For this the above equations were inverted [15] as follows:
1 1 1
ω12 = U1 − U3 − U4
4Ct 2LCt 4Cq
1 1 1
ω22 = U1 − U2 + U4
4Ct 2LCt 4Cq
(3)
1 1 1
ω32 = U1 + U3 − U4
4Ct 2LCt 4Cq
1 1 1
ω42 = U1 + U2 + U4
4Ct 2LCt 4Cq
These equations were used in the MATLAB Simulink model.
As an outrunner motor was used in the model, a revolute joint was attached
to the motor. This causes the motor to spin. The propeller is mated in SolidWorks
Assembly in such a way as that when the motor spins the propeller spins with
it. A revolute joint is attached to all the four motors. It should be noted that all
revolute joints in Simscape Multibody cause the follower frame to rotate with
respect to the base frame about the Y-axis only. Hence, if the desired axis of
rotation of the component is not in Y-axis, a rigid transform block could be
used to align the axis of rotation. In order to cause the rotation, an actuation
value should be fed to the revolute joint [9,10]. The actuation torque setting in
revolute joint gives us three options:
– None
– Provided by Input
– Automatically Computed
In our model, ‘Provided by Input’ option was selected, and the calculated
torque was fed to the revolute joint.
These changes were done to improve the imported XML file from the Solid-
Works software to simulate real world mechanics. In the next step, the run button
in the Simulink window is selected to view the simulated model in Mechanics
Explorer in MATLAB. The Mechanics Explorer window is shown in Fig. 8.
870 A. Roshan and R. Dhaouadi
There are multiple literatures [16–19] where the 3D CAD modeling was com-
pleted in SolidWorks software and followed by simulations in ANSYS. The Solid-
Works Connected Help [20] webpage, describes a method to export a SolidWorks
file into ANSYS software. A list of file types that are compatible in SolidWorks
and ANSYS [21] are given in Table 3:
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 871
Table 3. Formats that are compatible with both SolidWorks and ANSYS
Mode 1 2 3 4 5 6
Frequency (Hz) 0 0 2.4528e−3 2.4562 3.3645 3.5357
Mode 7 8 9 10 11 12
Frequency (Hz) 6.6066 9.6004 12.922 13.433 16.317 22.648
Mode 13 14 15 16 17 18
Frequency (Hz) 25.636 25.736 31.217 36.268 64.734 67.407
A transfer function represents the relationship between the input and the output
of a component or a system.
In this chapter, the transfer function of the quadcopter is determined by two
methods:
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 873
T otal T hrust(U1 ) = T1 + T2 + T3 + T4
Rolling M oment(U2 ) = L(T3 − T4 )
(4)
P itching M oment(U3 ) = L(T1 − T2 )
Y awing M oment(U4 ) = Q1 + Q2 + Q3 + Q4
Let:
p - rate of change of roll angle in the body axis system
q - rate of change of pitch angle in the body axis system
r - rate of change of yaw angle in the body axis system
To relate Euler angle rates to body angular rates, we could use the rotational
matrix from Eq. 5 ⎡ ⎤ ⎡ ⎤⎡ ⎤
φ̇ cψ −sψ 0 p
⎣ θ̇ ⎦ = ⎣ sψ cψ 0⎦ ⎣q ⎦ (6)
cφ cφ
ψ̇ sψtφ cψtφ 1 r
From [24–27], differentiating Eq. 6 and substituting the inertia matrix gives
the following equation:
cψU2 sψU3
Φ̈ = ψ̇ θ̇cΦ + − (7)
Ixx Iyy
ψ̇ Φ̇ sψU2 cψU3
θ̈ = + ψ̇ θ̇tanΦ + + (8)
cΦ cΦIxx cΦIyy
874 A. Roshan and R. Dhaouadi
at equilibrium
d2 z ∂U1
=
dt2 m
ΔU1
s2 Δz(s) =
m
Δz(s) 1
= 2 (13)
ΔU1 s m
Here altitude (the translational motion along z - axis) is the output and thrust
(U1 ) is the input
the transfer functions for the roll, pitch, yaw and altitude. Figure 11 shows the
Altitude-Time graph for the initial 5 s. The PRBS signal is shown in the Fig. 12.
The internal structure of the Simulink block for altitude control is shown in
Fig. 13. Figure 14 shows the pitch curve with PRBS input.
The Transfer Functions for roll, pitch, yaw and altitude obtained by System
Identification toolbox is shown in Table 5.
Transfer functions
Δz(s) 6.135(s2 −1.506s+431.8)
For altitude, G(s) = ΔU 1 (s) (s+38.81)(s+0.5191)(s2 +6.611s+67.66)
Δφ(s) 0.004297(s−1.884)(s−0.06632)
For the roll, G(s) = ΔU2 (s) (s2 +0.3575s+0.1322)(s2 +1.817s+1.597)
Δθ(s) 0.004982(s−0.9959)(s−0.1242)
For the pitch, G(s) = ΔU 3 (s) (s+0.9572)(s+0.2654)(s2 +0.4334s+0.1165)
Δψ(s) −0.358(s−5.138)(s+0.2242)
For the yaw, G(s) = ΔU 4 (s) (s+1.096)(s+0.1277)(s2 +7.8173s+6.416)
5 Discussions
6 Conclusion
References
1. Dhaouadi, R., Takrouri, M., Shapsough, S., Al-Bashayreh, Q.: Modelling and
design of a large quadcopter. In: Proceddings of the Future Technologies Con-
ference (FTC), vol. 1, pp. 451–467 (2021)
2. Jatsun, S., Lushnikov, B., Leon, A.S.M.: Synthesis of SimMechanics model of quad-
copter using SolidWorks CAD translator function. In: Proceedings of 15th Interna-
tional Conference on Electromechanics and Robotics “Zavalishin’s Readings”, pp.
125–137 (2020)
3. Shreurs, R.J.A., Tao, H., Zhang, Q., Zhu, J., Xu, C.: Open loop system identifica-
tion for a quadrotor helicopter. In: 10th IEEE International Conference on Control
and Automation, Hangzhou, China, 12–14 June 2013 (2013)
4. Cekus, D., Posiadala, B., Warys, P.: Integration of Modeling in SolidWorks and
MATLAB/Simulink Environments. Archive of Mechanical Engineering, Vol. LXI
(2014)
5. Gordan, R., Kumar, P., Ruff, R.: Simulating Quadrotor Dynamics using Imported
CAD Data. Modeling and Simulating II: Aircraft, Mathworks (2013)
6. Shaqura, M., Shamma, J.S.: An automated quadcopter CAD based design and
modeling platform using SolidWorks APLI and smart dynamic assembly. In: 14th
International Conference of Informatics in Control, Automation and Robotics, vol.
2, pp. 122–131 (2017). https://doi.org/10.5220/0006438601220131
7. Performance Composites, Mechanical Properties of Carbon Fibre Composite Mate-
rials, Fibre / Epoxy Resin. http://www.performance-composites.com/carbonfibre/
mechanicalproperties 2.asp
8. MathWorks R2021b, Install the Simscape Multibody Link Plugin. https://ww2.
mathworks.cn/help/physmod/smlink/ug/installing-and-linking-simmechanics-
link-software.html
9. MATLAB Simulink, SimMechanics User’s Guide, The MathWorks, United States
of America
10. MathWorks R2021b, Revolute Joint. https://ww2.mathworks.cn/help/physmod/
sm/ref/revolutejoint.html
11. Tijonov, K.M., Tishkov, V.V.: SimMechanics Matlab as a dynamic modeling tool
complex aviation robotic systems. J. Trudy MAI 41, 1–19 (2010)
12. Blinov, O.V., Kuznecov, V.B.: The study of mechanical systems in the environment
of SimMechanics (MatLab) using the capabilities of three-dimensional modeling
programs. Ivanovo State Polytechnic University (2012)
13. International Civil Aviation Organisation (ICAO): ICAO Standard Atmosphere,
Doc 7488-CD (1993)
14. T-motor. Test Report - Load Testing Data’, MN3508 KV380 Specifications.
https://store.tmotor.com/goods.php?id=354
15. Bresciani, T.: Modelling, Identification and Control of a Quadrotor Helicopter.
M.Sc. Lund University (2008)
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 879
16. Hassan, M.A., Phang, S.K.: Optimized autonomous UAV design for duration
enhancement. In: 13th International Engineering Research Conference, Malaysia,
27 November 2019, pp. 030004–1:030004–10. AIP Publishing (2020). https://doi.
org/10.1063/5.0001373
17. Pretorius, A., Boje, E.: Design and modelling of a quadrotor helicopter with vari-
able pitch rotors for aggressive manoeuvres. In: The 19th International Federation
of Automatic Control World Congress, South Africa, pp. 12208–12213 (2014)
18. Ibrahim, S., Alkali, B., Oyewole, A., Alhaji, S.B., Abdullahi, A.A., Aku, I.: Prelimi-
nary structural integrity investigation for quadcopter frame to be deployed for pest
control. Proceedings of Mechanical Engineering Research Day 2020, pp. 176–177
(2020). http://repository.futminna.edu.ng:8080/jspui/handle/123456789/9863
19. Ersoy, S., Erdem, M.: Determining unmanned aerial vehicle design parameters for
air pollution detection system. Online J. Sci. Technol. 10(1), 6–18 (2020)
20. SolidWorks Connected Help. Export Options - ANSYS, PATRAN, IDEAS, or
Exodus. http://help.solidworks.com/2021/English/SWConnected/cworks/IDH
HELP PREFERENCE EXPORT OFFSET.htm?id=f937f775202444789000e092
f83f3c2b
21. LIGO Laboratory, (2004). SW-ProE-Ansys Compatible file types.pdf. https://
labcit.ligo.caltech.edu/ctorrie/QUADETM/MPL/SW-ProE-Ansys Compatible
file types.pdf
22. Bedri, R., Al-Nais, M.O.: Prestressed modal analysis using finite element package
ANSYS. In: International Conference on Numerical Analysis and Its Applications,
pp. 171–178 (2004)
23. ANSYS: Lecture 8: Modal Analysis. Introduction to ANSYS Mechanical,
pp 10. https://www.clear.rice.edu/mech517/WB16/lectures trainee/Mechanical
Intro 16.0 L08 Modal Analysis.pdf
24. Fernando, E., De Silva, A., et al.: Modelling simulation and implementation of
a quadrotor UAV. In: 2013 IEEE 8th International Conference on Industrial and
Information Systems, pp. 207–212 (2013). https://doi.org/10.1109/ICIInfS.2013.
6731982
25. Balas, C.: Modelling and Linear Control of a Quadrotor. Master thesis. University
of Cranfield (2007)
26. Wang, P., Man, Z., Cao, Z., Zheng, J., Zhao, Y.: Dynamics modelling and linear
control of a quadcopter. In: Proceedings of the 2016 International Conference on
Advanced Mechatronic Systems, Melbourne, Australia (2016)
27. Dong, W., Gu, G.Y., Zhu, X., Ding, H.: Modelling and control of a quadrotor UAV
with aerodynamic concepts. Int. J. Aerospace Mech. Eng. 7(5), 901–906 (2013)
Cooperative Decision Making
for Selection of Application Strategies
Western Norway University of Applied Sciences, P.O. Box 7030, 5020 Bergen, Norway
{sbe,Erik.Styhr.Petersen,Margareta.Holtensdotter.Lutzhoft}@hvl.no
1 Introduction
Our university department, like so many other organizations, can be charac-
terized by having an area of excellence, a mission, a strategy, a set number of
assignments and activities, and resources, human, financial and temporal, which
are both limited and fixed in size, at least short-term. Besides teaching, research
- and the associated publication of results - is a Key Performance Indicator
(KPI) which is constantly in focus, aimed at providing the students, as well
as any other interested stakeholder, with knowledge that is on the forefront of
the subject field, and the resources available for research are often tied to suc-
cessful research grants. Still comparable to most other organizations, sustaining
or expanding research activities in a department as ours mean that externally
funded research opportunities are continuously being considered and selected, in
which cases research applications are being prepared and submitted as appro-
priate. Nothing is new about this, and neither are the potential challenges that
results from this process. High-level, often national, regional, or even global
agendas set the direction of research subjects being offered by funding orga-
nizations and thus essentially dictates the direction of departmental research.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 880–887, 2023.
https://doi.org/10.1007/978-3-031-18461-1_58
Cooperative Decision Making 881
2 Background
2.1 Decision Making
“The AHP provides a comprehensive framework to cope with the intuitive, the
rational, and the irrational in us at the same time. It is a method we can use
to integrate our perceptions and purposes into an overall synthesis” [4]. Multi-
criteria decision-making methods addressing the measurement of the priorities
of conflicting tangible/intangible criteria are shown in [5]. For a free web based
AHP tool we refer to [14].
Tangible and factors for supplier selection are discussed in [6]. University
R&D funding strategies are analyzed in [13]. Guidelines for competing for
research funding are provided in [1].
Let P be a non-empty ordered set. If sup{x, y} and inf {x, y} exist for all x, y ∈
P , then P is called a lattice, [2].
882 S. Encheva et al.
– Call C5 has a bit higher aggregated value than C1 and C3 but satisfies a
smaller number of criteria than the other two and should therefore be excluded
from the priority list.
– Call C6 has the lowest aggregated value and should therefore be excluded
from the priority list.
– Calls C7 and C1 have equal aggregated values but C7 satisfies a smaller
number of criteria than C1 and should therefore be excluded from the priority
list.
Fig. 1. A concept lattice based on data shown in Table 2
Cooperative Decision Making
885
886 S. Encheva et al.
Table 3. This is Table 2 populated with numeric values and calculated aggregated
values
Note that calculations carried out this way imply equal importance of all
criteria. We suggest application of Yager’s weights in case a team would prefer
to emphasize the relevance of some of the listed criteria. If two or more calls
have the same aggregated value and are still of interest to the team, we suggest
additional discussions in order to select the most desirable option. The same
applies in cases where aggregated values are not significantly different.
4 Conclusion
References
1. Blume-Kohout, M.E., Kumar, K.B., Sood, N.: University R&D funding strategies
in a changing federal funding environment. Sci. Public Policy 42(3), 355–368 (2015)
2. Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge Uni-
versity Press, Cambridge (2005)
3. Gaspars-Wieloch, H.: Modifications of the Hurwicz’s decision rule. CEJOR 22,
779–794 (2014)
4. Saaty, T.L.: The analytic hierarchy process: decision making in complex environ-
ments. In: Avenhaus, R., Huber, R.K. (eds.) Quantitative Assessment in Arms
Control. Springer, Boston (1994). https://doi.org/10.1007/978-1-4613-2805-6 12
5. Saaty, T.L., Ergu, D.: When is a decision-making method trustworthy? Criteria
for evaluating multi-criteria decision-making methods. Int. J. Inf. Technol. Decis.
Mak. (IJITDM) 14(06), 1171–1187 (2015)
6. Tahriri, F., Osman, M.R., Ali, A., Yusuff, R., Esfandiary, A.: AHP approach for
supplier evaluation and selection in a steel manufacturing company. J. Ind. Eng.
Manag. 1, 54–76 (2008)
7. Yager, R.R., Kacprzyk, J.: The Ordered Weighted Averaging Operators: Theory
and Applications. Kluwer, Norwell, MA (1997)
8. Yager, R.R.: OWA aggregation over a continuous interval argument with applica-
tions to decision making. IEEE Trans. Syst. Man Cybern. Part B 34(5), 1952–1963
(2004)
9. Wille, R.: Concept lattices and conceptual knowledge systems. Comput. Math.
Appl. 23(6–9), 493–515 (1992)
10. https://orangedatamining.com/
11. http://www.iro.umontreal.ca/∼galicia/
12. https://upriss.github.io/fca/fca.html
13. https://intranet.bloomu.edu/documents/research/ebook-funding.pdf
14. https://bpmsg.com/ahp/?lang=en
Dual-Statistics Analysis with Motion
Augmentation for Activity Recognition
with COTS WiFi
Ouyang Zhang(B)
1 Introduction
Human activity recognition serves as the crucial part of numerous human-
centered computing services, such as smart home, elderly care, assisted living,
etc. In the past decades, researchers have explored various techniques to achieve
human activity recognition, such as camera-based [2], radar-based [1] and elec-
tronic wearable devices [3,24]. Camera-based approaches are restricted to line-
of-sight (LoS) areas and require a good light condition. Also, the abundant image
information will potentially threat users’ privacy. The low-cost radar system also
suffers the high directionality and a limited coverage area (tens of centimeters).
By attaching devices on the user’s body, researchers can infer the activity he/she
engages in by analyzing data from sensors like accelerator or gyroscope. How-
ever, attached sensors are neither desirable nor available in most applications.
In contrast, WiFi devices provide the opportunity to achieve a low-cost system
and get rid of the above limitations with less security concern.
Proposed Approach. In this work, we propose WiSen, a novel design paradigm
for device-free human activity recognition. WiSen is a passive detection system
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 888–905, 2023.
https://doi.org/10.1007/978-3-031-18461-1_59
Dual-Statistics Analysis with Motion Augmentation 889
loc2
loc1
Fig. 1. Experiment scenario for preliminary study.
built on commodity WiFi devices. The basic idea behind WiSen is to make fully
use of the channel information in the received signals. To achieve this, WiSen
conducts a dual-statistics analysis to exploit the diversity across multiple sub-
carriers across the band, where a new processing methodology is applied to deal
with the high-dimensional data. What’s more, WiSen augments the recognition
performance with motion analysis.
The principle of activity recognition using WiFi signal is that different human
activities would introduce different channel conditions of wireless propagation.
By processing the channel state information (CSI) obtained from NIC card, the
system is able to track changes in surrounding environments and infer activities.
CSI information is spread over multiple subcarriers over the frequency band,
where each subcarrier represents a small spectrum slice. Existing approaches
[22] have analyzed the distribution of CSI coefficient on a single subcarrier,
which is stored as the profile to match the corresponding activity. However,
information on a single subcarrier is lack of the frequency diversity across the
full CSI vector, which other works [21,26] have shown important to distinguish
different channel conditions. Figure 10 shows a simple but representative example
with two-subcarrier CSI. As we can see, without utilizing the diversity between
subcarriers, CSI 1 and CSI 2 are not distinguishable since they have the same
distance with reference CSI ref, which is the sum of distances on all subcarriers.
Inspired by this, WiSen proposes to exploit statistics of the multiple-subcarrier
CSI vector to enhance the system performance.
Technical Challenges. The main challenge comes from the high dimension of
the CSI vector, which is well-known in the literature [5,6]. The essential difficulty
of statistical analysis on high-dimensional data comes from unknown explicit
distribution function and insufficient data with regards to over-fitting to excess
parameters. In the 20 MHz Wi-Fi band, there are 56 subcarriers so each CSI set
is a vector of 56 complex values with a resolution of 10 bits from NIC card.
890 O. Zhang
Fig. 2. Drink.
2 System Design
CSI to Activity Profile. In this section, we will introduce how to build profile
of activities from CSI. In wireless environment, the transmit and receive signals
are associated with a channel coefficient hf = |hf |exp−j∠hf on frequent f which
is a complex value. hf reflects channel conditions in that a longer distance results
in more fading which makes hf a smaller value. Moreover, signals of different
frequencies have different characteristics in shattering, fading, and power decay
during propagation. Thus, the full CSI information across the frequency band of
M subcarriers would be an M ×1 matrix H = [h1 , h2 , ..., hM ]T .
With COTS WiFi, we should be aware that phase information ∠hf is unreli-
able due to the unsynchronized clock between the sender and the receiver [20]. To
construct the activity profile, we collect the amplitudes ft = [|h1 |, |h2 |, ..., |h56 |]T
of CSIs during the activity at timestamp t of 56 subcarriers. Each activity gener-
ates a series of CSIs during a period of time which we can sample at t1 , t2 , ..., tT .
The profile for activity is constructed as the following matrix:
Fig. 4. Draw.
The next is to apply earth mover’s distance (EMD) algorithm on the estimation
across these instances. However, due to the challenge of high-dimensional data
in our problem, we cannot easily estimate statistics of time-series CSI vectors.
As far as we know, currently there is no good solution for this dual-statistics
analysis problem.
892 O. Zhang
Fig. 5. Bend-over.
To solve this challenge, we borrow the idea from computer vision community.
Recently, the vision processing area has seen a giant advance by adopting neural
network model on images1 . The underlying principle behind these models is
to replace human-craft features with automatically extracted features through
multi-layer neural network. The back-propagation training can push the model
to approximate any statistical function guided by the supervised data [13], thus
eliminating the tedious and error-prone manual heuristic statistical modeling.
Different from CSI-fingerprint based localization [21,26], our problem is based
on time-series data. Thus, the simple feedforward neural network (FNN) is not
feasible because each input data is sample at one timestamp. As such, FNN
training results in a single-statistics model, which does not satisfy the require-
ment. By contract, recurrent neural network (RNN) has shown its strength in
analyzing time-series data like speech recognition. In WiSen, we propose to uti-
lize sequence model - recurrent neural network (RNN) - to learn the statistics
from activity profiles.
Furthermore, the naive RNN model associates the information in proceed-
ing time direction. However, the loosely-defined activity in our problem does
not impose a well-defined order of motions. Thus, WiSen adopts bidirectional
recurrent neural network (BiRNN [16]) to link information back and forth. In
Sect. 4.1, we validate the effectiveness of BiRNN and its superiority over other
models.
1
Well-known models include AlexNet [10], VGG16 [17], Inception [18] and ResNet
[7].
Dual-Statistics Analysis with Motion Augmentation 893
Fig. 6. Drink.
Fig. 8. Draw.
Fig. 9. Bend-over.
Figure 11 shows the BiRNN [16] model architecture with forward and back-
ward layers. Specifically, our model uses gated recurrent unit cell (GRU) [4],
which has a similar ability with LSTM cell [15] but fewer parameters. The input
dimension is equal to the length of CSI vectors, which is 56*4 with 2 × 2 MIMO.
The dimension of the internal state is set to 120. The batch size is set to 10. To
enhance generality, we use a dropout wrapper with dropout rate as 0.5. Adam
optimizer [9] is used to adaptively change the learning rate to precisely achieve
the minimum cost. We implement this model in TensorFlow framework, accept-
ing 56*4 dimensional CSI sequence as input and connecting the hidden states to
softmax output layer for activity classification.
Reduce Noise in CSI Values. To reduce the random noise in CSI measure-
ments due to chipset imperfection, we average over five consecutive CSIs. With
the packet rate of 1250 p/s, 5 CSIs span over 4 ms.
Inconsistency in Activity Durations. Undoubtedly, the CSI segments of
the activity would have various lengths. Without a fixed sequence length, the
original data is unsuitable for the BiRNN model. To solve this issue, we deploy
a fix-length subset strategy in the awareness that subset of sampling the data
would also represent the statistics. We assume that each activity is longer than
2 s and thus the total number of CSIs at 1250 p/s is larger than 500. In WiSen,
we use 50 as the BiRNN sequence length, i.e., time resolution of 0.04 s.2
Augmenting Training Data. Generally, in the machine learning a larger train-
ing dataset can enhance the generality and increase the accuracy. Here, we pro-
pose a method to augment the training data. The idea is to fully utilize the data
under the above sampling strategy. Specifically, since we evenly sample 50 CSIs
as one instance, multiple training instances can be obtained by shifting the start
of the sampling. A diagram is shown in Fig. 12 to demonstrate the idea. The
rationale underneath is to account for the drifting time of detecting the start of
the activity.
2
This resolution is good enough for human activities, with speed less than 8 m/s.
896 O. Zhang
Preliminary Study. In this section, we propose that the above activity profile
could be augmented by motion profile to further boost the recognition per-
formance. To demonstrate this, we conduct experiments to show the following
observations: activities with similar status (e.g., position and orientation) have
similar distributions of CSIs but different motion (e.g., speed) so motion profile
can help to distinguish. However, only motion profile is insufficient because the
other observation is activities at different positions might have similar motion
(e.g., speed).
In a typical office (Fig. 15), we deploy two laptops equipped with Atheros
WiFi NICs. The user is guided to sit on the chair performing two activities, i.e.,
drinking water and spine-stretch. On another chair, the user performs different
activities, i.e., drawing and bend-over, as shown in Fig. 1. We measure and pro-
cessing the activity profile and motion profile from CSI collection. Figure 6, 7, 8
and 9 show the histograms of CSI amplitude where each dashed line represents
one instance. For brevity, ‘L1A1’ (location 1 and activity 1) represents drink-
ing water, ‘L1A2’ represents spine stretch, ‘L2A1’ represents draw and ‘L2A2’
represents bend over. As we can see, ‘drink’ and ‘spine stretch’ have similar
histograms. This is the above first observation which also applies to ‘draw’ and
‘bend-over’. One the other hand, Fig. 2, 3, 4 and 5 show the earth mover’s dis-
tance (EMD) [11,14] among the motion profiles. It is difficult to distinguish
‘L1A2’ from ‘L2A2’ and ‘L2A1’ from ‘L1A1’ because spine-stretch and bend-
over involve the upper body while drinking and drawing involves just a hand
which is our second observation.
CSI to Motion Profile. WiSen uses CSI-speed model [20] to build the motion
profile. With K dynamic paths from the transmitter to the receiver, the channel
property h(f, t) can be represented as:
K
−j2πΔf t
h(f, t) = e (hs (f, t) + ak (f, t)e−j2πlk (t)/λ )
k=1
where hs (f ) is the contribution of all static paths. lk (t) is the path length and
ak (f, t) is the attenuation, while Δf is the frequency offset. With the target
moving at speed vk , we have lk (t) = lk (0) + vk ∗ t within a small period t. As
such, we can derive the power |h(f, t)|2 as follows:
Dual-Statistics Analysis with Motion Augmentation 897
K
2πvk t 2πlk (0)
|h(f, t)|2 = 2|hs (f )ak (f, t)|cos( + + φsk )
λ λ
k=1
K
2π(vk − vl )t 2π(lk (0) − ll (0))
+ 2|ak (f, t)al (f, t)|cos( + (2)
λ λ
k,l=1
k=l
K
+ φkl ) + |ak (f, t)|2 + |hs (f )|2
k=1
The above reveals that the target’s moving speed is directly related to the
frequency of sinusoid components in |h(f, t)|2 . Therefore, after transforming the
CSIs series to the frequency domain, the speed profile of the target can be
extracted.
PCA-Based Denoising. Based on the Nyquist theorem, sampling frequency
should be twice as large as the variation frequency. The variation frequency
of CSIs is how many wavelengths the target moves over per second, as shown
in Eq. 2. Thus, 150 Hz is the upper bound with regards to human speed less
than 8 m/s. To have a good de-noising effect, we choose 1250 p/s as the rate as
suggested in [20].
The noise of CSIs mainly comes from the internal WiFi state transition,
which induces similar effects across all subcarriers. PCA-based analysis extracts
the correlation across subcarriers and thus is effective in removing the CSI noise,
as shown in Fig. 13. Different from previous works, the noise typically exists in
the fourth component and above which may be due to the Atheros WiFi cards,
thus motivating us to keep the first three principal components.3
Motion Profile Construction. To extract the motion profile, we use discrete
wavelet transform (DWT) to convert |h(f, t)|2 to the frequency domain. With
each CSI segment of 240 ms, WiSen obtains the mean power for each level
after decomposing into 10 levels4 , which have exponentially decreased frequency
range5 . Compared with short time frequency transform (STFT), discrete wavelet
transform (DWT) has the advantage in obtaining high-frequency component
with high time resolution and low-frequency component with high frequency
resolution. We move the segment window with a step of 80 ms to smoothen
the value. Thus, if the activity lasts for 2s, the motion profile is a collection of
25 10-dimension vectors. Figure 13 (right) shows the heatmap of DWT power
distribution over time.
Motion-Profile Analysis. Matching DWT distribution over time as in [19,20]
is infeasible for loosely-defined activities. WiSen use weighted sum as motion
intensity, denoted as I, with DWT level as the weight because higher frequency
3
Since the information mainly exists in the first several components of PCA, WiSen
does not analyze the later components.
4
It is the maximum value due to the boundary effect.
5
For example, level 1 is the range of 150–300 Hz while level 2 is 75–150 Hz.
898 O. Zhang
is induced by faster speed (Eq. 2). Figure 14 shows the values of I with activities
‘h’ - ‘j’ (the activity codes are defined in Table 1).
In this section, we discuss the augmentation strategy with motion profile. WiSen
proposes priority based decision (PBD) for augmenting recognition from dual-
statistics analysis. PBD is based on the observation that dual-statistics analysis
demonstrates much higher accuracy. Thus, PBD puts a higher priority on the
results from dual-statistics analysis. Specifically, we sum up the probabilities of
co-located activities to get the probability of the corresponding location. After
dual-statistics analysis is applied and there are still multiple candidates, we will
apply results from motion-profile analysis.
3 Methodology
3.1 Testbed
Text
Text
Tx1 B2
B1 Rx2 7m
5m
Rx1
6m 9m
Fig. 15. Testbed. The left is an office room and the right is a two-bedroom apartment.
4 Evaluation
We analyze the result on the Tx1-Rx1 link and six activities (i.e., ‘h’-‘m’)
which have strong multipath links between Tx1 and Rx1. WiSen has an average
accuracy of 96%. In comparison, the ‘E-eyes’ approach has an average accuracy
of 60.3% (Fig. 16) while that of ‘FNN’ is 36.7% (Fig. 17). From the confusion
matrix (Fig. 16), we can see that the performance of ‘E-eyes’ approach is mostly
degraded by co-located activities, which demonstrates that dual-statistics app-
roach of WiSen is more effective. It also demonstrates that our BiRNN model is
effective in handling the dual-statistics analysis task. What’s more, the improve-
ment over FNN model demonstrates that CSI statistics during the activity is
more effective than CSI at single timestamp.
that it performs well in distinguishing co-located activities, i.e., larger than 93%
across ‘h’, ‘i’ and ‘j’. This observation guides us to design the augmentation
strategy based on the priority between multipath and motion profile, which is
elaborated in Sect. 2.3.
Office Environment. The office area (Fig. 15) provides a line-of-sight (LoS)
scenario with desks and chairs. The experiment in this section is to evaluate
the effectiveness of the proposed approach in a typical office environment. Since
the results are limited to four types of activities, we do not use the results for
in-depth analysis on each profile (Sect. 4.1 and Sect. 4.2).
Figure 19 shows the confusion matrix. We use cross-validation to obtain the
recognition accuracy. The results show that WiSen can reliably detect and rec-
ognize all four activities in an open area with LoS links.
Apartment Environment. To cover the whole area (Fig. 15), we set up two
WiFi links with one transmitter and two receivers.
Figure 20 shows the performance with individual dual-statistics analsysis and
motion analsysiss as well as integration with augmentation strategy. The results
show that the average accuracy of the dual-statistics analysis is 92.8% while that
of the motion-profile analysis is 23.08%. While, the integration of both profiles
enhance the accuracy to 98%.6 Therefore, it demonstrates that the augmentation
strategy boost the performance by combining the strength of both profiles.
5 Related Works
Non-WiFi Based Approaches. Non-WiFi based approaches have their limi-
tation regarding to the application scenario. For examples, vision approaches [2]
6
It is not obvious in the figure due to the scale.
Dual-Statistics Analysis with Motion Augmentation 903
Fig. 20. Recognition accuracy w.r.t multipath (dual-statistics), motion profile analysis
and WiSen.
needs good light condition and are limited to LoS area. The concern on user’s
privacy is more serious compared to WiFi signal due to the abundant informa-
tion captured by camera. RF based approaches like radar-based [1] and other
customized signal [25] suffer extra infrastructure cost. By attaching sensors on
the body to collect data, [8,24] increase the deployment cost and hurt users’
experience.
WiFi-Based Approaches. The ubiquity of WiFi devices attracts researchers
due to the convenient and low cost deployment. In this literature, one group
of approaches utilize the CSI statistics during the activity. E-eyes [22] collected
CSI traces from commodity devices as a database of activity profiles. E-eyes [22]
constructs distribution histogram for each subcarrier as the profile and compares
the distance between the detected profile with the reference profile using EMD
algorithm. In contrast, another group of methods [12,20] utilize the variation
in CSI traces over time. For instance, WiSee [12] infers gestures by looking
into the Doppler shift. CARM [20] built a CSI-activity model by constructing
a mapping from CSI to speed. Unlike subcarrier-level distribution analysis [22],
WiSen exploits the diversity across subcarriers. Further, we propose a priority-
based strategy (PBD) to boost the performance.
References
1. Google project soli. https://www.youtube.com/watch?v=0qnizfsspc0
2. Microsoft. x-box kinect. http://www.xbox.com
3. Philips lifeline. http://www.lifelinesys.com/content/
4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
5. Donoho, D.L., et al.: High-dimensional data analysis: the curses and blessings of
dimensionality. AMS Math Challenges Lecture 1(2000), 32 (2000)
6. Giraud, C.: Introduction to High-Dimensional Statistics. Chapman and Hall/CRC
(2014)
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
8. Jeong, S., Kim, T., Eskicioglu, R.: Human activity recognition using motion sen-
sors. In: Proceedings of the 16th ACM Conference on Embedded Networked Sensor
Systems, pp. 392–393. ACM (2018)
9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
11. Levina, E., Bickel, P.: The earth mover’s distance is the mallows distance: some
insights from statistics. In: Proceedings Eighth IEEE International Conference on
Computer Vision. ICCV 2001, vol. 2, pp. 251–256. IEEE (2001)
12. Pu, Q., Gupta, S., Gollakota, S., Patel, S.: Whole-home gesture recognition using
wireless signals. In: Proceedings of the 19th Annual International Conference on
Mobile Computing & Networking, pp. 27–38. ACM (2013)
13. Rippel, O., Adams, R.P.: High-dimensional probability estimation with deep den-
sity models. arXiv preprint arXiv:1302.5125 (2013)
14. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications
to image databases. In: Sixth International Conference on Computer Vision (IEEE
Cat. No. 98CH36271), pp. 59–66. IEEE (1998)
15. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural net-
work architectures for large scale acoustic modeling. In: Fifteenth Annual Confer-
ence of the International Speech Communication Association (2014)
16. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans.
Sig. Process. 45(11), 2673–2681 (1997)
17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
18. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
19. Venkatnarayan, R.H., Page, G., Shahzad, M.: Multi-user gesture recognition using
wifi. In: Proceedings of the 16th Annual International Conference on Mobile Sys-
tems, Applications, and Services, pp. 401–413. ACM (2018)
20. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling
of wifi signal based human activity recognition. In: Proceedings of the 21st Annual
International Conference on Mobile Computing and Networking, pp. 65–76. ACM
(2015)
Dual-Statistics Analysis with Motion Augmentation 905
21. Wang, X., Gao, L., Mao, S., Pandey, S.: CSI-based fingerprinting for indoor local-
ization: a deep learning approach. IEEE Trans. Veh. Technol. 66(1), 763–776 (2017)
22. Wang, Y., Liu, J., Chen, Y., Gruteser, M., Yang, J., Liu, H.: E-eyes: device-free
location-oriented activity identification using fine-grained wifi signatures. In: Pro-
ceedings of the 20th Annual International Conference on Mobile Computing and
Networking, pp. 617–628. ACM (2014)
23. Xie, Y., Li, Z., Li, M.: Precise power delay profiling with commodity wifi. In:
Proceedings of the 21st Annual International Conference on Mobile Computing
and Networking, MobiCom 2015, New York, NY, USA, 2015, pp. 53–64. ACM
(2015)
24. Yatani, K., Truong, K.N.: Bodyscope: a wearable acoustic sensor for activity recog-
nition. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing,
pp. 341–350. ACM (2012)
25. Zhang, Y., et al.: Vibration-based occupant activity level monitoring system. In:
Proceedings of the 16th ACM Conference on Embedded Networked Sensor Sys-
tems, pp. 349–350. ACM (2018)
26. Zhou, R., Chen, J., Lu, X., Wu, J.: CSI fingerprinting with SVM regression to
achieve device-free passive localization. In: 2017 IEEE 18th International Sympo-
sium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pp.
1–9. IEEE (2017)
Cascading Failure Risk Analysis of Electrical
Power Grid
Abstract. This paper studies the severity of cascading failure processes of electri-
cal power grids by statistically analyzing the number of line outages per cascade,
the cascade duration, and the load shedding amount of a cascade, based on the
historical utility data of the BPA system and the simulation data of two synthetic
test cases. Both uniform and non-uniform probability distribution functions have
been considered for the initial line trips in the cascading failure simulation in
order to determine which function better approximates the cascading failure risks
of the real-world grid. The obtained simulation data and statistical analysis results
from the two 500-bus synthetic test cases are then compared with those from the
historical utility data.
1 Introduction
The modern electrical power system is always evolving to meet the growing electricity
demand. Any disruption in power delivery has a severe effect on our daily life. A complex
interconnected infrastructure like the power grid is prone to cascading processes where
failure in one part of the system can affect other parts of the grid. These cascading
processes may initiate by a single line trip and result in a large number of line outages
and uniform blackouts. The best way to analyze these events is to study the past events
documented by various utility and power distribution companies. The existing simulation
models can also be validated by using the statistical parameters of the past events as a
benchmark. In [1], the authors took a complex system approach to analyze the blackout
risk of power transmission systems using the North American Electrical Reliability
Council (NERC) data. In [2], the authors evaluated the statistics of cascading line outages
spreading using utility data. In [3], the authors studied different algorithms for cascading
failure analysis in the power grid. In [4, 5], the authors incorporated a cascading failure
simulation model with ac power flow to show power systems’ vulnerability to cascading
failure with rising renewables integration. In [6], the authors estimated the distribution
of cascaded outages using both historical data and a simulation model.
The severity of a cascading failure process in the power grid can be measured by
the number of lines tripped during the event, the duration of the whole process, and
the amount of load shedding needed to restore the system balance. In our study, these
three parameters of a cascading failure process have been statistically analyzed using
the historical utility data of the BPA system and two synthetic 500-bus test cases. In a
cascading failure simulation model, it is critical to define an appropriate mechanism for
the initial line trip which causes the cascading failure (CF) process. In our previous work
[6], an uniform probability distribution function was adopted for the cascading failure
study. In this study, we will consider both uniform and non-uniform initial line tripping
probability and compare them with the results of the historical utility data, in order to
determine which function may give better approximates of the grid CF risks.
The CF simulation model developed in [8] is used for the simulations on two synthetic
power system test cases in order to analyze the risk of CF processes. At first, the optimal
power flow (OPF) algorithm is used on the forecasted load profiles to determine the
initial generation dispatches in normal condition. Then a single branch of the system
has been tripped manually to initiate the CF process. After every branch gets tripped,
AC power flow has been used to determine the overloaded branches of the system. The
Unscented Transformation (UT) method has been used to determine the mean overload
time of the overloaded branches [9]. These values are then used in the relay mechanism
to determine whether any other branches are getting tripped or not. After every new line
trip, a modified version of the OPF algorithm is used to restore the power balance of the
system. Here, the least-square adjusted OPF algorithm is used to mimic the most viable
path of the CF process. The CF process will eventually stabilize and there will be no
more overloaded lines in the system. After every CF simulation, we have determined
the total number of line trips during the cascade, the duration of the cascade, and also
the load shedding amount (if any).
Cascading Failure Risk Analysis of Electrical Power Grid 909
Uniform Initial-Line-Trip Distribution: In this case, it is assumed that every line get-
ting tripped to initiate a CF process is with the same probability in the system. For that
scenario, a system with n lines has n different cascades and the probability of these
cascades happening is same.
Branch Flow: First we consider the branch flow as a defining parameter to assign the
probability of initial line trips. The loading level of each branch of the system is deter-
mined as the ratio of the branch power flow under normal operation condition and its
maximum branch capacity. After that, we divided all the branches into ten categories
and assigned them different weights for the CF estimation, as shown in Table 1(a).
Table 1(a). Non-uniform probability definition of initial line trips according to the loading level
The higher the loading levels, the higher the probability of a branch initially getting
tripped to start a cascading process.
Shortest Path: In this case, we use the network topology of a power grid to assign the
non-uniform probability of initial line trips. We first define all the generation buses as
the boundary nodes of a system and calculate the shortest path from every bus to the
910 S. Das and Z. Wang
boundary. The distances are measured in terms of the hops on the shortest path. As every
branch is connected with two buses, we have taken the minimum of the shortest paths
associated with the two buses connected to the branch. Hence, for every branch, we
have a corresponding shortest path to the boundary. We have considered this distance to
assign a specific weight for every initial line trip. The longer the distance, the higher is
the probability of a line initially getting tripped, as shown in Table 1(b).
Connectivity: In this case, we have considered the connectivity of every bus as a defining
parameter. In the power grid, every bus is connected to several other buses. For every
bus, we have determined this connectivity number. As every branch is connected by
two buses, for every branch, we have considered the average connectivity number of the
corresponding buses. This connectivity number is considered to assign the non-uniform
probability of initial line trips. The higher the connectivity, the higher the probability of
a line getting tripped initially and starting a CF process, as shown in Table 1(c).
Table 1(b). Non-uniform probability definition of initial line trips according to the shortest path
Table 1(c). Non-uniform probability definition of initial line trips according to the connectivity
4 Result
In this study, two different synthetic 500-bus test cases have been used for the CF
simulation and statistical risk analysis.
In Fig. 2, the probability distributions of the total number of outages in a cascade are
shown. Figure 2(a) considers the uniform probability of any initial line trips and Fig. 2
(b), (c), and (d) consider the non-uniform probability of initial line trips. In Fig. 2(b),
the power flow in a branch is used as a defining parameter to assign different weights on
initial line trips. Similarly, in Fig. 2(c) and Fig. 2(d), the shortest path to the boundary and
connectivity of the systems are used as defining parameters to assign different weights
on the initial line trips respectively. It is obvious from the figure that most of the cascades
have only one outage and did not spread any more. The probability of a cascade having
a large number of outages is very low and this probability decreases with the increasing
number of outages. We have fitted our simulated outage number data with exponential
distribution. In Table 3, the mean number of outages per cascade has been shown. Initial
912 S. Das and Z. Wang
(c) Initial Line Trip Probability According to the Shortest Path to the Boundary
line trip probability according to branch flow has given us the best estimation compared
with the BPA data while probability according to shortest path underestimated the mean
outage number and probability according to connectivity overestimated the number the
most.
(c) Initial Line Trip Probability According to the Shortest Path to Boundary
(c) Initial Line Trip Probability According to the Shortest Path to Boundary
4.2 Asgwecc_500_1
We have used another 500 bus test case for our CF simulation model. This test case
has been developed using AutoSyngrid, a MATLAB-based toolkit for the automatic
generation of synthetic power grids [11]. For our generated test case, WECC (West-
ern Electricity Coordinating Council) system has been utilized as a reference system
for generation and load settings as this system is closely related to the BPA control
area load. We have named this test case ASGWECC_500_1. This test case contains
103 committed generators, 875 transmission lines, 109 loads, a total online generation
capacity of 34,277.8 MW, and a load of 29,313.8 MW. Like the previous test case, as the
system has 875 transmission lines, we will have 875 different CF processes by initially
tripping these 875 lines. For each CF simulation process, we have calculated the number
of total lines tripped during the cascade, the duration of the cascade, and the amount
of load shedding in percentage. In this test case also, we have considered both uniform
and non-uniform probability of initial line trips. Just like the previous test case, we have
considered three different loading levels for our CF simulation, and every loading level
has a different weight according to the load duration curve of the BPA control area load.
These three different loading levels have been combined according to their weight to
mimic the real-world scenario where loading levels are different during different times
of the day.
In Fig. 5, the probability distributions of the total number of outages in a cascade
are shown. Figure 5(a) considers the uniform probability of any initial line trips and
Fig. 5(b), (c), and (d) considers the non-uniform probability of initial line trips. In
Fig. 5(b), the power flow in a branch is used as a defining parameter to assign different
weights on initial line trips. Similarly, in Fig. 5(c) and Fig. 5(d), the shortest path to
the boundary and connectivity of the systems are used as defining parameters to assign
different wights on the initial line trips respectively. It is evident from the figure that most
of the cascades have only one outage and did not spread any further. The probability of a
cascade having a large number of outages is very low and this probability decreases with
the increasing number of outages. We have fitted our simulated outage number data with
exponential distribution. In Table 6, the mean number of outages per cascade has been
Cascading Failure Risk Analysis of Electrical Power Grid 917
(c) Initial Line Trip Probability According to the Shortest Path to the Boundary
shown which is higher than the number we got from the BPA data and the previous test
cases. Initial line trip probability according to the shortest path to boundary has given
us the best estimation with the BPA data while probability according to the branch flow
overestimated the number the most.
(c) Initial Line Trip Probability According to the Shortest Path to the Boundary
Fig. 6. Probability distribution of the cascade duration for the ASGWECC_500_1 system
920 S. Das and Z. Wang
(c) Initial Line Tripping Probability According to the Shortest Path to Boundary
Table 9. Overall comparison of cascading failure risk analysis for the BPA system and the
synthetic test cases
For the load shedding, the BPA system lacks historical data and we do not have
a benchmark for performance comparison. We can only say that from Table 10, the
load shedding percentage is higher for the ACTIVSg500 test case for both uniform and
non-uniform probability distribution of initial line trips. It seems that the non-uniform
probability function of initial line trips will give most consistent results for both test
cases. But in the future, more simulations will be done using more test cases, so that we
will be able to conduct a self-comparison among these test cases and figure out which
probability distribution function is better.
Table 10. Comparison of mean load shedding percentage between two synthetic test cases
5 Conclusion
The severity of a cascading failure process in the power grid can be measured by the
number of lines that got tripped during the event, the duration of the whole CF process,
and the amount of load needed to be shed to restore the balance of the system. In this
study, we have analyzed these different parameters of a cascading failure process using
historical utility data and two synthetic test cases. For a cascading failure simulation
model, it is important to recognize the mechanism of initial line trips which initiate the
CF process. In this study, we have considered both uniform and non-uniform probability
distribution of initial line trips and compared our results with the results from historical
data. It is viewed that non-uniform distribution of initial line trips gave better estima-
tions on mean outage number and cascade duration. For the ACTIVSg500 test case, the
probability with branch flow worked better for outage number estimation and probability
with connectivity worked better for estimating cascade duration. On the other hand, for
the ASGWECC_500 test case, the probability with the shortest path worked better for
outage number estimation and the probability with branch flow worked better for esti-
mating cascade duration. For the load shedding, the BPA system lacks information about
load shedding amount and we do not have a benchmark for performance comparison.
We can only say that the load shedding percentage is higher for the ACTIVSg500 test
case than the ASGWECC_500 test case for both uniform and non-uniform probability
distribution of initial line trips. In future work, more simulations will be done using more
test cases, so that we may be able to conduct a self-comparison among these test cases
and figure out which probability distribution function is better in general or if there are
specific cases where we need to choose a specific distribution.
References
1. Dobson, I., Newman, D.E., Carreras, B.A., Lynch, V.E.: An initial complex systems analysis
of the risks of blackouts in power transmission systems. Power Systems and Communications
Infrastructures for the Future, pp. 1–7, September 2002
2. Dobson, I., Carreras, B.A., Newman, D.E., Reynolds-Barredo, J.M.: Obtaining statistics of
cascading line outages spreading in an electric transmission network from standard utility
data. IEEE Trans. Power Syst. 31(6), 4831–4841 (2016). https://doi.org/10.1109/TPWRS.
2016.2523884
3. Soltan, S., Mazauric, D., Zussman, G.: Cascading failures in power grids-analysis and
algorithms, vol. 14. https://doi.org/10.1145/2602044.2602066
4. Athari, M.H., Wang, Z.: Stochastic cascading failure model with uncertain generation using
unscented transform. IEEE Trans. Sustain. Energy 11(2), 1067–1077 (2020). https://doi.org/
10.1109/TSTE.2019.2917842
5. Das, S., Wang, Z.: Power grid vulnerability analysis with rising renewables infiltration. In:
Proceedings of IMCIC 2021 - 12th International Multi-Conference Complexity, Informatics
Cybernetics, vol. 2, pp. 157–162, July 2021
6. Das, S., Wang, Z.: Estimating distribution of cascaded outages using observed utility data and
simulation modeling. In: 2021 North American Power Symposium, NAPS 2021, pp. 5–10
(2021). https://doi.org/10.1109/NAPS52732.2021.9654745
7. BPA.gov - Bonneville Power Administration - Bonneville Power Administration. https://
www.bpa.gov/. Accessed 24 Apr 2022
Cascading Failure Risk Analysis of Electrical Power Grid 923
8. Das, S., Wang, Z.: Power grid vulnerability analysis with rising renewables infiltration. J.
Syst. Cybern. Inform. 19(3), 23–32 (2021)
9. Julier, S.J., Uhlmann, J.K.: Unscented filtering and nonlinear estimation (2004). https://doi.
org/10.1109/JPROC.2003.823141
10. Birchfield, A.B., Xu, T., Gegner, K.M., Shetye, K.S., Overbye, T.J.: Grid structural character-
istics as validation criteria for synthetic networks. IEEE Trans. Power Syst. 32(4), 3258–3265
(2017). https://doi.org/10.1109/TPWRS.2016.2616385
11. Sadeghian, H., Wang, Z.: AutoSynGrid: a MATLAB-based toolkit for automatic generation
of synthetic power grids. Int. J. Electr. Power Energy Syst. 118 (2020). https://doi.org/10.
1016/j.ijepes.2019.105757
A New Metaverse Mobile Application
for Boosting Decision Making of Buying
Furniture
1 Introduction
Technology continues to become more important for people in everyday life. There
are so many advancements that provide people with greater convenience. People also
want to be creative and see what they imagine become reality. Companies are thus
applying augmented reality (AR) to promote their products in 3D to let customers see the
product in every dimension as it combines a virtual world and physical world together, as
explained by Nunes et al. [1]. There are many technologies to provide people with greater
convenience in everyday life. People want to be creative and let their imagination become
reality. Companies can therefore use augmented reality to promote their products in 3D
to let the customer see the real product in every dimension, so this desire has been the
idea behind the creation of metaverse, focusing on augmented reality. More obviously,
augmented reality is an emerging area that combines the virtual world and physical
world together. In this paper, our contribution is to focus on applying augmented reality
to the decision-making process of buying furniture. This technology is attracting more
interest many companies are attempting to use it in new applications to influence people
to buy their products. This paper is divided into five main parts. Section 1 explains a
general introduction to augmented reality and the registration method. Section 2 then
discusses the proposed system and the research methodology. Next, Sect. 3 explains
the experiment and shows the results of the proposed system. Section 4 explores the
evaluation, including discussing the results. Finally, Sect. 5 gives a conclusion to this
paper and then predicts the future direction.
can help customers choose the right product for their home, as explained by Saraswati
[6].
Y because people in this age group are familiar with technology and have a greater chance
to buy furniture online. Huntley [10] suggested that technology devices are essential as
journalism tools. It also can be a good sign of generational identity. These devices include
smartphones, game consoles, tablet computers, and personal computers. Generation Y
has grown up in the digital world, as discussed by Wolburg and Pokrywczynski [11]. The
application was designed using Android Studio and is called “AR Smart Furniture”. The
program can run on both Windows and OSX for developing applications on the Android
platform. The first step was to develop a mind map to see the overall application plan.
The mind map made it possible to see the links between each page and their action.
The application framework can be seen in Fig. 1. The next step was to design each
program page using Adobe Photoshop. After designing the application, each page was
created using Android Studio as applications developed on Android Studio can use many
languages. The language used in AR Smart Furniture is Java. The application has two
features. The first allows users to interact with posters using PixLive Maker to create
the AR Mode. Moreover, consumers can make a purchase more easily when they see
furniture they like. The AR model was developed using PixLive Maker, a very useful
program for people who want to create augmented reality applications as it is easy to
learn and has useful functions. Moreover, developers do not have to code a program but
can instead create only visual art and add it automatically to the program. In addition,
this program is free of charge. A developer can just access the website and begin their
work, whereas many other programs require payment or are free only for a limited period
of time. Furthermore, developers can also carry out coding in Android Studio, but they
must first learn how to code from the basic level, which can take a long time. Therefore,
the PixLive Maker is the best tool for developers from the beginner to professional levels.
It can be employed effectively not just for AR mode, but also for manual mode. If the
user knows the size of a space or room, they can input this. After that, the program
will select the furniture that can fit this size. This function can help customers select a
product more easily. They can also purchase the product from stores where it is sold. It
will automatically link to the page where they can make their purchase, which makes
this even more convenient for customers.
Augmented reality in this application has been developed by PixLive Maker which
is the program that can create augmented reality. This application has the structure as
indicated in Fig. 1. Figure 2 shows an example of an AR Furniture application. It contains
the scanned augmented reality furniture poster which will pop up in the augmented reality
that has been created. The second mode is the manual mode, which allows the user to
choose the furniture they want. After this, the user will go to the page they selected,
which will link to a page according to the selected furniture. The developer tried to
design this page to be as simple as possible to make the application user-friendly. Also,
Fig. 2 shows the real design of the poster when the user picks up the furniture. Note that
the furniture photo icons used in this research are from www.ikea.com/, www.konceptfu
rniture.com/, and www.google.com for research and academic use. The size of the chair
should match the room. So, the user can feel the look of the furniture when the furniture
is placed in the room space.
928 C. Kerdvibulvech and T. P. N. Ayuttaya
Fig. 2. Our application interface. Note that the furniture photo icons are from www.ikea.com/,
www.konceptfurniture.com/, and www.google.com for academic use.
3 Experimental Results
After testing the application to determine it worked properly, it was tested by sampling.
To be sure the application could solve the research problem and meet the objectives,
users were asked to pick different pieces of furniture to see how they looked and if the
system worked smoothly. From Fig. 3, the sampling tried different sofas, and the program
worked properly. The users could pick what they wanted from the picture. After they
picked a sofa, it would appear in the room. The users liked the poster because they could
interact and play with it. After they selected a sofa, they could go directly to the store.
After letting the sampling test the application, they were given an online questionnaire
to determine how much they liked the application. This also helped us to learn what
could be improved in the future. The participants could choose different sofas from the
poster via their smartphone, and then the sofa would appear in the room.
Fig. 3. The different types of sofa (left) and selecting the furniture (right)
A New Metaverse Mobile Application 929
As shown in Fig. 3, the sofa that the user selected would appear in the room. The
user could then see how it looked in the space. The price would also be displayed along
with a link to order the furniture directly from the store. The user showed more interest
in the poster because they could interact with it and then go directly to the process of
purchasing on the website, which can be a strong motivating factor for purchasing the
product as they can take action immediately to buy the furniture.
Fig. 4. Images appearing in the space room (left) and purchasing process (right)
To confirm whether this application is useful or not, we have to set the online question-
naire to see the effectiveness of the application. As shown in Table 1, the questions were
set into three parts. The first part is the demographic factor. The second part is the expe-
rience of buying furniture online. And the last part is the experience of using augmented
reality on the mobile application of respondents. We should confirm that our application
is useful for the user and effective. Moreover, we set the open-ended question. Not only
ask the user but also, we want the user to give feedback on what they want to have or
what do they not like about this application. So, we will collect the data and can develop
it in the future. The objectives of this study are 1) to study augmented reality and the way
to use this technology to boost decision-making in purchasing furniture and 2) to study
how to create a mobile application that can serve and provide more channels for con-
sumers who want to buy products online, especially furniture and home decorations. The
questionnaire contained both closed format and open format questions. The first part of
the questionnaire asked about demographic factors to learn if the respondents had been
using a smartphone. The second part asked the respondents if they had been buying
furniture online. The last part aimed to learn respondents’ satisfaction levels after using
the application. An aim of the study was to provide consumers with channels through an
application that lets consumers compare products of different brands. From the research,
A New Metaverse Mobile Application 931
it was found that most respondents, or 52%, had experience buying furniture online, but
for the most part only once or twice a year. Moreover, it was found that customers do
not buy furniture online with a price of 10,000 baht or higher, the reason being they
want to be sure the product is good and satisfies their needs when they pay a sum. In this
research, the respondents had to use the application developed to incorporate AR tech-
nology to influence people to buy furniture. After they used the application, they tended
to like it. They mostly felt they would like it more if it had a greater variety of products.
They did say they would be willing to download this application if it was launched. Even
though the application feedback is good and people like this application, it was found
that the respondents still do not strongly agree with many questions in the questionnaire,
for example, the question about the application being easy to use. The respondents did
agree that it is easy to use, but the application should be further developed to make it
more user-friendly. Moreover, they felt the application could save them time and money
because the application focuses on e-commerce, which means they do not have to go to
a store to make their purchase.
Table 1. Questions and results from the users, with the demographic factor, the experience of
buying furniture online, and the experience of using augmented reality on the mobile application
of respondents
Questions Score
5 4 3 2 1
Application AR Furniture is easy to use 10 17 3 0 0
Application AR Furniture is not complicated 8 18 4 0 0
Helps the customers to decide about buying furniture 15 12 3 0 0
An attractive and interesting application 11 17 2 0 0
Helps the customers to save money over going to the store 21 9 0 0 0
The system of the AR Furniture application is convenient and easy 12 16 2 0 0
The system of the AR Furniture application is accurate 8 19 3 0 0
This technology is up-to-date and convenient to use 18 12 0 0 0
A user is interested in buying more furniture after using this application 15 14 1 0 0
A user is willing to download this application 16 14 0 0 0
Results do show that respondents felt this application must be improved. The accu-
racy of the application must be further developed because the furniture database is
insufficient and needs more products to make it more accurate. Currently, when a cus-
tomer uses the manual mode, they want to find furniture that is suitable and matches
their room. However, the furniture in the database is rather limited and so it is difficult to
find what they are looking for. The augmented reality model also does not have enough
products. If there were more products and different posters, the application would be
considered more accurate. After the respondents used this application, the respondents
were willing to buy more furniture and they were willing to download this application.
932 C. Kerdvibulvech and T. P. N. Ayuttaya
Half of the respondents wanted to buy more furniture and 53.3% of the respondents
were willing to download this application. Moreover, 70% said that this application can
help the customer save more money to go to the store. 60% think that this application
is convenient to use and up-to-date. And 50% answered that this application can help
them to decide about buying furniture.
The research which is similar to our work is about methods for e-commerce using
metaverse, such as Lu and Smith’s work [12]. Their research studies the type of e-
commerce system. In their study, customers of this metaverse e-commerce are able
to bring goods into the real scene visually and interactively. Their research shows the
augmented reality technology in the new way of e-commerce to make the customers
experience the new technology. Their research mentions the furniture of e-commerce
on the website to show that this research can link with any e-commerce platform. Their
research generates the object in the 3D layer. The developer wants to project the real
virtual image in the real space via the computer with the camera. However, our work
has a different goal from their work. Our objective is to create a 3D model of a specific
sofa to convince the user to be interested in buying this sofa, so the shopper can come
to the page since it can bring greater shopping confidence which may help increase the
possibility of selling the goods. Therefore, the difference is in the type of technology
and the way it is used and applied. As technology is developed quickly, people mostly
use smartphones as a tool for communication and can do almost the same things as with
a laptop or personal computer. This research tries to change from the website to the
application to make this application up-to-date. Moreover, the application adds more
information about the products and shows augmented reality objects. In addition, the
user can interact with augmented reality in the application so it can increase the users’
confidence about buying furniture. This research brings the idea of making the process
more up-to-date and more modern.
to promote their own products. The benefit of connecting with local shops is that the
application will have more variety of furniture for a customer to choose from. The local
shop might negotiate more easily than a big company. When an application has access to
a large amount of data, a user can enjoy using the application and, as a result, motivate
users to purchase a product, leading to the growing success of this application. Also,
this application can be further beneficial for related works [13, 14] of conversational
commerce.
Conflict of Interest. The authors declare that they have no conflict of interest.
References
1. Nunes, F.B., et al.: A dynamic approach for teaching algorithms: integrating immersive
environments and virtual learning environments. Comput. Appl. Eng. Educ. 25(5), 732–751
(2017)
2. Siriborvornratanakul, T.: A study of virtual reality headsets and physiological extension pos-
sibilities. In: Gervasi, O., et al. (eds.) ICCSA 2016. LNCS, vol. 9787, pp. 497–508. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-42108-7_38
3. Ronald, T.: Azuma: the most important challenge facing augmented reality. Presence 25(3),
234–238 (2016)
4. Siriborvornratanakul, T.: Through the realities of augmented reality. In: Stephanidis, C. (ed.)
HCII 2019. LNCS, vol. 11786, pp. 253–264. Springer, Cham (2019). https://doi.org/10.1007/
978-3-030-30033-3_20
5. Kammerer, K., Pryss, R., Sommer, K., Reichert, M.: Towards context-aware process guidance
in cyber-physical systems with augmented reality. In: RESACS@RE 2018, pp. 44–51 (2018)
6. Saraswati, T.G.: Driving factors of consumer to purchase furniture online on IKEA Indonesia
website. J. Secr. Bus. Adm. 2(1), 19–28 (2018)
7. Kerdvibulvech, C., Wang, C.-C.: A new 3D augmented reality application for educational
games to help children in communication interactively. In: Gervasi, O., et al. (eds.) ICCSA
2016. LNCS, vol. 9787, pp. 465–473. Springer, Cham (2016). https://doi.org/10.1007/978-3-
319-42108-7_35
8. Al-Azzam, A.F.M.: Evaluating effect of social factors affecting consumer behavior in
purchasing home furnishing products in JORDAN (2014)
9. Joshi, M.S.: A study of online buying behavior among adults in Pune city 13(1) (2017)
10. Huntley, R.: The world according to Y: Inside the New Adult Generation. Australia: Allen &
Unwin (2006)
11. Wolburg, J.M., Pokrywczynski, J.: A psychographic analysis of Generation Y college students.
J. Advert. Res. 41, 33–52 (2001)
12. Lu, Y., Smith, S.: Augmented reality e-commerce assistant system: trying while shopping.
In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 643–652. Springer, Heidelberg (2007).
https://doi.org/10.1007/978-3-540-73107-8_72
13. Rungvithu, T., Kerdvibulvech, C.: Conversational commerce and cryptocurrency research in
urban office employees in Thailand. Int. J. Collab. 15(3), 34–48 (2019)
14. Liew, T.W., Tan, S.-M., Tee, J., Goh, G.G.G.: The effects of designing conversational
commerce chatbots with expertise cues. In: HSI 2021, pp. 1–6 (2021)
Author Index
A D
Ab Rahman, Ab Al-Hadi, 100 Dalvi, Mohsin, 185
Abdalla, Hassan I., 307 Daniel, Azel, 753
Abdellatef, Hamdan, 100 Das, Amit Kumar, 145
Ahmad, Imran Shafiq, 79 Das, Saikat, 906
Ajila, Samuel, 275 Dhaouadi, Rached, 860
Akinrinade, Olusoji, 275 Diaz-Arias, Alec, 230
Ali, Megat Syahirul Amin Megat, 260 Dimitrakopoulos, George, 343
Almutairi, Abdullah, 823 Dodich, Frank, 493
Amazonas, José R., 850 Dombi, József, 719
Amer, Ali A., 307 Domnauer, Colin, 378
An, Peng, 328 Du, Chunglin, 275
Ankargren, Sebastian, 548
Anthes, Christoph, 731 E
Elgarhy, Aya, 578
Ashar, Nur Dalila Khirul, 260
Elseddawy, Ahmed, 578
Ayat, Sayed Omid, 100
Encheva, Sylvia, 880
Ayuttaya, Thitawee Palakawong Na, 924
F
B Fawcett, Nathanel, 317
Baek, Stephen, 230 Ficocelli, Ryan, 493
Bajracharya, Aakriti, 809 Fu, Swee Tee, 32
Bakirov, Akhat, 368
Bin, Chen, 526 G
Bitri, Aida, 669 Gałuszka, Adam, 615
Boufama, Boubakeur, 79 Garyfallou, Antonios, 343
Bui, Len T., 18 Guliashki, Vassil, 669
H
C Hao, Jia, 328
Chiddarwar, Shital, 185 Happonen, Ari, 287
Chov, Bunhov, 563 Harvey, Barron, 809
Churchill, Akubuwe Tochukwu, 439 Heiden, Bernhard, 414
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 935–937, 2023.
https://doi.org/10.1007/978-3-031-18461-1
936 Author Index
Suh, Jeanne, 65 W
Suleimenov, Ibragim, 368 Wang, Yan, 696
Sun, Ting, 656 Wang, Zhifang, 906
Syed, Sameer Akhtar, 79 Wang, Zizhao, 328
Watada, Junzo, 287
T Weber, Kai, 796
Tahir, Nooritawati Md, 260 Wilkerson, Bryce, 121
Tee, Mark Kit Tsun, 32 Willcox, Gregg, 378
Thuan, Nguyen Dinh, 426 Wiśniewski, Tomasz, 615
Tonino-Heiden, Bianca, 414
Tripathi, Shailesh, 731 X
Tsang, Herbert H., 493 Xiao, Hua, 328
Y
U Yamakami, Tomoyuki, 776
Ugbari, Augustine Obolor, 248 Ye, Wenbin, 328
Usmani, Usman Ahmad, 287 Yonggang, Wang, 526
V Z
Varlamis, Iraklis, 343 Zamri, Nurul Farhana Mohamad, 260
Vitulyova, Yelizaveta, 368 Zhang, Alexandros Shikun, 198
Vo, Duy K., 18 Zhang, Ouyang, 888