0% found this document useful (0 votes)
681 views948 pages

Proceedings of FTC Conference

This is the Proceedings of he FTC Conference.

Uploaded by

MUIslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
681 views948 pages

Proceedings of FTC Conference

This is the Proceedings of he FTC Conference.

Uploaded by

MUIslam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 948

Lecture Notes in Networks and Systems 559

Kohei Arai   Editor

Proceedings
of the Future
Technologies
Conference
(FTC) 2022,
Volume 1
Lecture Notes in Networks and Systems

Volume 559

Series Editor
Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences,
Warsaw, Poland

Advisory Editors
Fernando Gomide, Department of Computer Engineering and Automation—DCA,
School of Electrical and Computer Engineering—FEEC, University of Campinas—
UNICAMP, São Paulo, Brazil
Okyay Kaynak, Department of Electrical and Electronic Engineering,
Bogazici University, Istanbul, Turkey
Derong Liu, Department of Electrical and Computer Engineering, University
of Illinois at Chicago, Chicago, USA
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Witold Pedrycz, Department of Electrical and Computer Engineering, University of
Alberta, Alberta, Canada
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Marios M. Polycarpou, Department of Electrical and Computer Engineering,
KIOS Research Center for Intelligent Systems and Networks, University of Cyprus,
Nicosia, Cyprus
Imre J. Rudas, Óbuda University, Budapest, Hungary
Jun Wang, Department of Computer Science, City University of Hong Kong,
Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest
developments in Networks and Systems—quickly, informally and with high quality.
Original research reported in proceedings and post-proceedings represents the core
of LNNS.
Volumes published in LNNS embrace all aspects and subfields of, as well as new
challenges in, Networks and Systems.
The series contains proceedings and edited volumes in systems and networks,
spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor
Networks, Control Systems, Energy Systems, Automotive Systems, Biological
Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems,
Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems,
Robotics, Social Systems, Economic Systems and other. Of particular value to both
the contributors and the readership are the short publication timeframe and
the world-wide distribution and exposure which enable both a wide and rapid
dissemination of research output.
The series covers the theory, applications, and perspectives on the state of the art
and future developments relevant to systems and networks, decision making, control,
complex processes and related areas, as embedded in the fields of interdisciplinary
and applied sciences, engineering, computer science, physics, economics, social, and
life sciences, as well as the paradigms and methodologies behind them.
Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
For proposals from Asia please contact Aninda Bose ([email protected]).

More information about this series at https://link.springer.com/bookseries/15179


Kohei Arai
Editor

Proceedings of the Future


Technologies Conference
(FTC) 2022, Volume 1

123
Editor
Kohei Arai
Faculty of Science and Engineering
Saga University
Saga, Japan

ISSN 2367-3370 ISSN 2367-3389 (electronic)


Lecture Notes in Networks and Systems
ISBN 978-3-031-18460-4 ISBN 978-3-031-18461-1 (eBook)
https://doi.org/10.1007/978-3-031-18461-1
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface

We are extremely delighted and excited to present before you the seventh Future
Technologies Conference 2022 (FTC 2022), which was successfully held during
20–21 October 2022. COVID-19 necessitated this conference to be held virtually
for two years. However, as the pandemic waded and restrictions eased, we managed
to recreate the scholarly aura by having the esteemed conference in hybrid mode,
wherein learned researches from across the globe adorned the stage by either their
in-person presence or via the online mode. Around 250 participants from over 60
countries participated to make this event a huge academic success.
The conference provided a wonderful academic exchange platform to share the
latest researches, developments, advances and new technologies in the fields of
computing, electronics, AI, robotics, security and communications. The conference
was successful in disseminating novel ideas, emerging trends as well as discussing
research results and achievements. We were overwhelmed to receive 511 papers out
of which a total of 177 papers were selected to be published in the final proceed-
ings. The papers were thoroughly reviewed and then finally selected for publishing.
Many people have collaborated and worked hard to produce a successful FTC
2022 conference. Thus, we would like to thank all the authors and distinguished
Keynote Speakers for their interest in this conference, the Technical Committee
members, who carried out the most difficult work by carefully evaluating the
submitted papers, with professional reviewing and prompt response and to Session
Chairs Committee for their efforts. Finally, we would also like to express our
gratitude to Organizing Committee who worked very hard to ensure high standards
and quality of keynotes, panels, presentations and discussions.
We hope that readers are able to satisfactorily whet their appetite for knowledge
in the field of AI and its useful applications across diverse fields. We also expect
more and more enthusiastic participation to this coveted event next year.
Kind Regards,

Kohei Arai
Conference Program Chair

v
Contents

Min-Max Cost and Information Control in Multi-layered Neural


Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Ryotaro Kamimura and Ryozo Kitajima
Face Generation from Skull Photo Using GAN and 3D Face Models . . . 18
Duy K. Vo, Len T. Bui, and Thai H. Le
Exploring Deep Learning in Road Traffic Accident Recognition for
Roadside Sensing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Swee Tee Fu, Bee Theng Lau, Mark Kit Tsun Tee,
and Brian Chung Shiong Loh
Alternate Approach to GAN Model for Colorization of Grayscale
Images: Deeper U-Net + GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Seunghyun Lee
Implementing Style Transfer with Korean Artworks via VGG16:
For Introducing Shin Saimdang and Hongdo KIM’S Paintings . . . . . . . 65
Jeanne Suh
Feature Extraction and Nuclei Classification in Tissue Samples
of Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Boubakeur Boufama, Sameer Akhtar Syed, and Imran Shafiq Ahmad
A Compact Spectral Model for Convolutional Neural Network . . . . . . . 100
Sayed Omid Ayat, Shahriyar Masud Rizvi, Hamdan Abdellatef,
Ab Al-Hadi Ab Rahman, and Shahidatul Sadiah Abdul Manan
Hybrid Context-Content Based Music Recommendation System . . . . . . 121
Victor Omowonuola, Bryce Wilkerson, and Shubhalaxmi Kher
Development of Portable Crack Evaluation System for Welding Bend
Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Shigeru Kato, Takanori Hino, Tomomichi Kagawa, and Hajime Nobuhara

vii
viii Contents

CVD: An Improved Approach of Software Vulnerability Detection


for Object Oriented Programming Languages Using Deep Learning . . . 145
Shaykh Siddique, Al-Amin Islam Hridoy, Sabrina Alam Khushbu,
and Amit Kumar Das
A Survey of Reinforcement Learning Toolkits for Gaming:
Applications, Challenges and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Charitha Sree Jayaramireddy, Sree Veera Venkata Sai Saran Naraharisetti,
Mohamad Nassar, and Mehdi Mekni
Pre-trained CNN Based SVM Classifier for Weld Joint Type
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Satish Sonwane, Shital Chiddarwar, M. R. Rahul, and Mohsin Dalvi
A Two-Stage Federated Transfer Learning Framework in Medical
Images Classification on Limited Data: A COVID-19 Case Study . . . . . 198
Alexandros Shikun Zhang and Naomi Fengqi Li
Graph Emotion Distribution Learning Using EmotionGCN . . . . . . . . . . 217
A. Revanth and C. P. Prathibamol
On the Role of Depth Predictions for 3D Human Pose Estimation . . . . . 230
Alec Diaz-Arias, Dmitriy Shin, Mitchell Messmore, and Stephen Baek
AI-Based QOS/QOE Framework for Multimedia Systems . . . . . . . . . . . 248
Laeticia Nneka Onyejegbu, Ugochi Adaku Okengwu,
Linda Uchenna Oghenekaro, Martha Ozohu Musa,
and Augustine Obolor Ugbari
Snatch Theft Detection Using Deep Learning Models . . . . . . . . . . . . . . 260
Nurul Farhana Mohamad Zamri, Nooritawati Md Tahir,
Megat Syahirul Amin Megat Ali, and Nur Dalila Khirul Ashar
Deep Learning and Few-Shot Learning in the Detection of Skin
Cancer: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Olusoji Akinrinade, Chunglin Du, Samuel Ajila,
and Toluwase A. Olowookere
Enhancing Artificial Intelligence Control Mechanisms: Current
Practices, Real Life Applications and Future Views . . . . . . . . . . . . . . . . 287
Usman Ahmad Usmani, Ari Happonen, and Junzo Watada
A General Framework of Particle Swarm Optimization . . . . . . . . . . . . . 307
Loc Nguyen, Ali A. Amer, and Hassan I. Abdalla
How Artificial Intelligence and Videogames Drive Each Other
Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Nathanel Fawcett and Lucien Ngalamoum
Contents ix

Bezier Curve-Based Shape Knowledge Acquisition and Fusion for


Surrogate Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Peng An, Wenbin Ye, Zizhao Wang, Hua Xiao, Yongsong Long,
and Jia Hao
Path Planning and Landing for Unmanned Aerial Vehicles
Using AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Elena Politi, Antonios Garyfallou, Ilias Panagiotopoulos, Iraklis Varlamis,
and George Dimitrakopoulos
Digital Ticketing System for Public Transport in Mexico to Avoid
Cases of Contagion Using Artificial Intelligence . . . . . . . . . . . . . . . . . . . 358
Jose Sergio Magdaleno-Palencia, Bogart Yail Marquez, Ángeles Quezada,
and J. Jose R. Orozco-Garibay
To the Question of the Practical Implementation of “Digital
Immortality” Technologies: New Approaches to the Creation of AI . . . 368
Akhat Bakirov, Ibragim Suleimenov, and Yelizaveta Vitulyova
Collaborative Forecasting Using “Slider-Swarms” Improves
Probabilistic Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Colin Domnauer, Gregg Willcox, and Louis Rosenberg
Learning to Solve Sequential Planning Problems Without Rewards . . . . 393
Chris Robinson
Systemic Analysis of Democracies and Concept of Their Further
Human-Technological Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Bernhard Heiden and Bianca Tonino-Heiden
Using Regression and Algorithms in Artificial Intelligence to Predict
the Price of Bitcoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Nguyen Dinh Thuan and Nguyen Thi Viet Huong
Integration of Human-Driven and Autonomous Vehicle: A Cell
Reservation Intersection Control Strategy . . . . . . . . . . . . . . . . . . . . . . . 439
Ekene Frank. Ozioko, Kennedy John. Offor,
and Akubuwe Tochukwu Churchill
Equivalence Between Classical Epidemic Model and Quantum
Tight-Binding Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Krzysztof Pomorski
Effects of Various Barricades on Human Crowd Movement Flow . . . . . 493
Andrew J. Park, Ryan Ficocelli, Lee Patterson, Frank Dodich,
Valerie Spicer, and Herbert H. Tsang
The Classical Logic and the Continuous Logic . . . . . . . . . . . . . . . . . . . . 511
Xiaolin Li
x Contents

Research on Diverse Feature Fusion Network Based on Video Action


Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Chen Bin and Wang Yonggang
Uncertainty-Aware Hierarchical Reinforcement Learning Robust
to Noisy Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
Felippe Schmoeller Roza
Resampling-Free Bootstrap Inference for Quantiles . . . . . . . . . . . . . . . . 548
Mårten Schultzberg and Sebastian Ankargren
Determinants of USER’S Acceptance of Mobile Payment: A Study of
Cambodia Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Sreypich Soun, Bunhov Chov, and Phichhang Ou
A Proposed Framework for Enhancing the Transportation Systems
Based on Physical Internet and Data Science Techniques . . . . . . . . . . . 578
Ashrakat Osama, Aya Elgarhy, and Ahmed Elseddawy
A Systematic Review of Machine Learning and Explainable Artificial
Intelligence (XAI) in Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . 596
Yi Sheng Heng and Preethi Subramanian
On the Application of Multidimensional LSTM Networks to Forecast
Quarterly Reports Financial Statements . . . . . . . . . . . . . . . . . . . . . . . . . 615
Adam Gałuszka, Aleksander Nawrat, Eryka Probierz, Karol Jędrasiak,
Tomasz Wiśniewski, and Katarzyna Klimczak
Utilizing Machine Learning to Predict Breast Cancer: One Step Closer
to Bridging the Gap Between the Nature Versus Nurture Debate . . . . . 625
Junhong Park and Miso Kim
Recognizing Mental States when Diagnosing Psychiatric Patients via
BCI and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Ayeon Jung
Diagnosis of Hepatitis C Patients via Machine Learning Approach:
XGBoost and Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
Ting Sun
Data Analytics, Viability Modeling and Investment Plan Optimization
of EDA Companies in Case of Disruptive Technological Event . . . . . . . 669
Galia Marinova, Aida Bitri, and Vassil Guliashki
Using Genetic Algorithm to Create an Ensemble Machine Learning
Models to Predict Tennis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681
Arisoa S. Randrianasolo and Larry D. Pyeatt
Towards Profitability: A Profit-Sensitive Multinomial Logistic
Regression for Credit Scoring in Peer-to-Peer Lending . . . . . . . . . . . . . 696
Yan Wang, Xuelei Sherry Ni, and Xiao Huang
Contents xi

Distending Function-based Data-Driven Type2 Fuzzy Inference


System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
József Dombi and Abrar Hussain
Vsimgen: A Proposal for an Interactive Visualization Tool
for Simulation of Production Planning and Control Strategies . . . . . . . . 731
Shailesh Tripathi, Andreas Riegler, Christoph Anthes,
and Herbert Jodlbauer
An Annotated Caribbean Hot Pepper Image Dataset . . . . . . . . . . . . . . . 753
Jason Mungal, Azel Daniel, Asad Mohammed, and Phaedra Mohammed
A Prediction Model for Student Academic Performance Using
Machine Learning-Based Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
Harjinder Kaur and Tarandeep Kaur
Parameterized-NL Completeness of Combinatorial Problems
by Short Logarithmic-Space Reductions and Immediate
Consequences of the Linear Space Hypothesis . . . . . . . . . . . . . . . . . . . . 776
Tomoyuki Yamakami
Rashomon Effect and Consistency in Explainable Artificial
Intelligence (XAI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
Anastasia-M. Leventi-Peetz and Kai Weber
Recent Advances in Algorithmic Biases and Fairness in Financial
Services: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809
Aakriti Bajracharya, Utsab Khakurel, Barron Harvey, and Danda B. Rawat
Predict Individuals’ Behaviors from Their Social Media Accounts,
Different Approaches: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Abdullah Almutairi and Danda B. Rawat
Environmental Information System Using Embedded Systems Aimed
at Improving the Productivity of Agricultural Crops in the
Department of Meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837
Obeth Hernan Romero Ocampo
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks
on Graphene Composite Substrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Diogo F. Lima Filho and José R. Amazonas
CAD Modeling and Simulation of a Large Quadcopter
with a Flexible Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
Ajmal Roshan and Rached Dhaouadi
Cooperative Decision Making for Selection of Application Strategies . . . 880
Sylvia Encheva, Erik Styhr Petersen, and Margareta Holtensdotter Lützhöft
xii Contents

Dual-Statistics Analysis with Motion Augmentation for Activity


Recognition with COTS WiFi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
Ouyang Zhang
Cascading Failure Risk Analysis of Electrical Power Grid . . . . . . . . . . . 906
Saikat Das and Zhifang Wang
A New Metaverse Mobile Application for Boosting Decision Making of
Buying Furniture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
Chutisant Kerdvibulvech and Thitawee Palakawong Na Ayuttaya

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935


Min-Max Cost and Information Control
in Multi-layered Neural Networks

Ryotaro Kamimura1(B) and Ryozo Kitajima2


1
Tokai University and Kumamoto Drone Technology and Development Foundation, 2880
Kamimatsuo Nishi-ku, Kumamoto 861-5289, Japan
[email protected]
2
Tokyo Polytechnic University, 1583 Iiyama, Atsugi, Kanagawa 243-0297, Japan
[email protected]

Abstract. The present paper aims to propose a new method to minimize and
maximize information and its cost, accompanied by the ordinary error minimiza-
tion. All these computational procedures are operated as independently as pos-
sible from each other. This method aims to solve the contradiction in conven-
tional computational methods in which many procedures are intertwined with
each other, making it hard to compromise among them. In particular, we try
to minimize information at the expense of cost, followed by information max-
imization, to reduce humanly biased information obtained through artificially
created input variables. The new method was applied to the detection of rela-
tions between mission statements and firms’ financial performance. Though the
relation between them has been considered one of the main factors for strategic
planning in management, the past studies could only confirm very small positive
relations between them. In addition, those results turned out to be very dependent
on the operationalization and variable selection. The studies suggest that there
may be some indirect and mediating variables or factors to internalize the mis-
sion statements in organizational members. If neural networks have an ability to
infer those mediating variables or factors, new insight into the relation can be
obtained. Keeping this in mind, the experiments were performed to infer some
positive relations. The new method, based on minimizing the humanly biased
effects from inputs, could produce linear, non-linear, and indirect relations, which
could not be extracted by the conventional methods. Thus, this study shows a pos-
sibility for neural networks to interpret complex phenomena in human and social
sciences, which, in principle, conventional models cannot deal with.

Keywords: Min-max property · Cost · Information · Generalization ·


Interpretation · Mission statements

1 Introduction
1.1 Necessity of Min-Max Property

One of the main characteristics of multi-layered neural networks is their ability to create
the sparse distributed representation in which all components should be used on average,

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 1–17, 2023.
https://doi.org/10.1007/978-3-031-18461-1_1
2 R. Kamimura and R. Kitajima

while a very few should respond to a specific input [1]. This type of representation has
been the base of many learning strategies from the beginning of research on neural net-
works. For example, the conventional competitive learning and self-organizing maps [2–
6] need the winner-take-all type of responses to specific inputs, but actually, all neurons
are forced to be equally used, with many additional constraints to reduce dead neurons
[7–11]. This seemingly contradictory statement of specific and non-specific property
can be explained by the economical and efficient use of components in living systems,
in which all available components should respond to inputs equally, but at the same time,
all the components should have their own meaning or specific information. This means
that information specific to inputs should be minimized and at the same time maximized
in some ways. For clarifying this contradictory min-max property more concretely, we
explain here this property in terms of min-max selectivity, information, and cost.
Fist, the importance of the min-max property can be found in the recent discussion
on the necessity of selectivity in neural networks. As has been well known, the selec-
tivity has played important roles in the neurosciences from the beginning [12, 13], with
many experimental results accumulated [14]. The selectivity has also played impor-
tant roles in neural networks in improving generalization [15]. In particular, the selec-
tivity has been so far discussed in the field of convolutional neural networks (CNN).
For example, the majority of interpretation methods have tried to show which compo-
nents in a neural network are the most responsible for a specific input pattern or spe-
cific output, [16, 17]. However, there have been recent discussions on the importance of
selectivity in neural networks, saying that the selectivity should be reduced as much as
possible, especially for improving generalization performance [18, 19]. This discussion
suggests that selectivity should be minimized and maximized, depending on different
situations in neural networks. The selectivity becomes of some use in improving general
performance only with some conditions in learning. We need to adjust the strength of
selectivity according to learning conditions and objectives, suggesting the importance
of controlling min-max property in selectivity.
The min-max property can be also found in the regularization approach [20–23].
Usually, the regularization has been realized by decreasing the strength of weights,
such as weight decay. Suppose that we consider the absolute strength of weights as a
representational cost; then, the majority of regularization methods have been realized
by reducing the cost for representation. Actually, this cost minimization has a natural
effect to reduce the average cost, while the cost, in terms of the weight strength, associ-
ated with a specific neuron may be increased. Thus, we can say that the regularization
approach has a property that the average cost should be decreased, but the cost with a
few specific neurons should be increased in the end. The regularization methods can be
used to realize the contradictory cost minimization and maximization at the same time.
In addition, the min-max property of the sparse distributed representation can be
closely related to min-max information control. This concept of information can be used
to improve interpretation and generalization. For example, when we try to interpret a
neural network, we need to focus on specific information on a small number of com-
ponents. On the contrary, when we try to improve generalization, we need to distribute
information on input patterns over as many components as possible, because we need
to deal with new and unseen input patterns. This means that information minimization,
responding to all inputs, and information maximization, responding to specific inputs,
Min-Max Cost and Information Control 3

should be used, depending on the purposes of study. In addition, the situation is much
better if we can use both types of information optimization procedures at the same time.
In the information-theoretic approach, the need to maximize and minimize infor-
mation has been well recognized. For example, the information-theoretic methods have
used the complicated measure of mutual information, whose use in neural networks
has been initiated by the pioneering works on the maximum information preservation
by Linsker [24–26]. In mutual information, the properties of information maximization
and minimization are implicitly supposed from our viewpoint. For example, in mutual
information maximization, unconditional entropy should be maximized, while condi-
tional entropy should be minimized to achieve its maximum value. Since entropy max-
imization means that all components are uniformly distributed on average, and condi-
tional entropy minimization means that a component tends to be closely connected with
a specific one, mutual information is also a special type of operation with the min-max
property in neural networks.

1.2 Contradiction Resolution on Min-Max Operations


As mentioned above, there are min-max operations in information and cost control
in neural networks that are contradictory to each other. In spite of this contradiction,
many computational procedures have tried to realize those operations simultaneously.
However, as the complexity of neural networks becomes larger, it becomes harder and
harder to appropriately control the contradiction. The computational learning proce-
dures in neural networks tend to be performed in intertwined ways, meaning that the
procedures are actually mixed and in which all different computational procedures
should be simultaneously performed. As mentioned above, learning in term of error
minimization between outputs and targets is usually accompanied by several regular-
ization terms, such as weight decay [21–23, 27, 28] to limit the network capacity so as
to correspond to the complexity of input patterns. These regularization terms have been
used for improved generalization as well as interpretation due to the simplified final
representations obtained by the regularization. As inferred from the above discussion,
these complicated and entangled computational procedures have made it impossible to
realize the inherent and necessary min-max operation in information, cost, and error.
The same problem can be applied to the representation learning to identify and dis-
entangle the underlying important factors, hidden in complicated and entangled repre-
sentations [29]. There have been many different types of attempts to obtain underlying
factors, [30–35]. However, one of the main problems with these types of representation
learning seems to lie in their computational methods, where many different types of pro-
cedures are introduced, which are to be performed simultaneously and in synchronized
ways in the middle of error minimization. We should say again that the disentangling
process contains intertwined computational procedures, making it hard to clarify the
min-max property by which we can reach the underlying and core factors.
In this context, we propose here a new method to unfold or disentangle the
information-theoretic computational procedures and to control network configurations
for producing easily interpretable compressed information, because they are inter-
twined, entangled, and folded, making learning processes considerably complicated
and hard. To make learning as efficient as possible, we need to disentangle or unfold
those entangled computational procedures as much as possible. Thus, we try to perform
4 R. Kamimura and R. Kitajima

information minimization and maximization, cost minimization and maximization, and


error minimization as separately as possible. In the conventional and intertwined meth-
ods, those computational procedures are simultaneously executed, which has made it
hard to compromise among those contradictory operations. It should be stressed that
the present method can perform those procedures almost as independently as possible
of the other procedures. The contradictions among them possibly cannot be necessarily
solved, but at least they can be weakened.

1.3 New Insights into the Analysis of Mission Statements

The method was applied to the analysis of the relation between mission statements and
firms’ financial performance [36]. The mission statement has been considered one of
the most widespread methods for strategic planning [37–39] in companies’ manage-
ment. There has been a number of attempts to show positive relations among them [40],
while there are also some reports against positive influences, in particular, against the
organizational members [41]. Roughly speaking, the past studies seem to show that the
relation between mission statements and firms’ financial performance may be a small
positive one [42]. The past studies on this relation suggest that the conclusions are
highly dependent on the operationalization decisions. In particular, the variable and tar-
get selection can have much influence on the final results on the relations. We think that
one of the main problems lies in the limitations of the methods for analysis, where the
conventional linear models such as the regression ones are usually used and only direct
and linear relations between inputs and targets are extensively examined. For dealing
with those complex problems, we need a method to deal with non-linear relations and,
in addition, to reduce the influence of operationalization and variable selection [43] due
to the fact that extracted information is biased toward very specific purposes. The new
method proposed here tries to reduce the information in input variables to the extreme
point and even at the expense of cost. Then, the present method can show a new insight
on this relation, which can be independent of inputs, or at least, as independent as pos-
sible of input variables. Since input variables are artificial and humanly biased, not nec-
essarily representing the core information in inputs, we need to minimize information
through inputs at the expense of cost, and then maximize the information for interpre-
tation. Thus, this application can be used to demonstrate the performance of the present
method for showing the relations, whose existence has not been definitely decided by
the conventional methods.

2 Theory and Computational Methods

2.1 Serial Unfolding

The present paper tries to unfold the complicated, intertwined, and folded combination
of information, cost minimization and error minimization (assimilation) into completely
separated and independently operated procedures. Figure 1(a) shows the intertwined
and folded combination of three procedures. They are processed in complicated and
interwoven ways, which makes it hard to compromise among three types of process-
ing. Figure 1(b) shows that the three procedures are separated and serially operated. In
Min-Max Cost and Information Control 5

Fig. 1. Conventional folded processing (a) and serially unfolded processing (b).

this processing, the error minimization is called “information assimilation,” which is


operated separately from the corresponding information minimization and maximiza-
tion. For example, the information minimization is applied, which can produce a mini-
mized information state independently of the information assimilation. Then, the infor-
mation obtained by the information minimization should be assimilated by the infor-
mation assimilation independently of the information minimization, which is actually
error minimization. In the second step, the cost is minimized, followed by the same
error minimization. This cost minimization is introduced to weaken the strong effect of
information minimization, because in this study, information is forced to decrease at the
expense of cost. The same process is applied to the information maximization, which
6 R. Kamimura and R. Kitajima

is followed by the information assimilation. This cycle of information minimization,


cost minimization, and information maximization with information assimilation (error
minimization) can be repeated to assimilate the effects of information minimization and
maximization and cost minimization sufficiently.

2.2 Selective Information


In this paper, we use a new type of information measure, close to the entropy of informa-
tion theory [44], but its meaning can be concretely interpreted in terms of the number
of connection weights and their strength, contrary to the abstract concept of entropy.
Let us define the selective information in terms of the number of connection weights
for the concept of information. For simplicity, we focus on connection weights between
the second and third layer (2,3), as shown in an initial state in Fig. 2, and the absolute
strength of weights is computed by
(2,3) (2,3)
ujk =| wjk | (1)

Then, we normalize this by its maximum value, which can be computed by


(2,3)
(2,3) ujk
gjk = (2,3)
(2)
maxj  k uj  k

where the max operation is over all connection weights between the layer. In addition,
we define the complementary one by
(2,3)
(2,3) ujk
ḡjk =1− (2,3)
(3)
maxj  k uj  k

We call this absolute strength “potentiality” and “complementary potentiality,”


because it can be used to increase or decrease the selective information. By using this
potentiality, selective information can be computed by
n2 n3
 (2,3)

 u jk
Ḡ(2,3) = 1− (2,3)
(4)
j=1 k=1 maxj  k uj  k

When all potentialities become equal, naturally, the selective information becomes
zero. On the other hand, when only one potentiality becomes one, while all the oth-
ers are zero, the selective information becomes maximum. For simplicity, we suppose
that at least one connection weight should be larger than zero. Finally, the cost can be
computed simply by the sum of all absolute weights.
n2 
 n3
(2,3)
C (2,3) = ujk (5)
j=1 k=1
Min-Max Cost and Information Control 7

Fig. 2. Information minimization with cost augmentation (a), cost minimization (b), and informa-
tion maximization (c) only for the first learning cycle.

2.3 Learning Cycle


In learning in Fig. 2, selective information is minimized (a1) with cost augmentation,
where the strength of connection weights is forced to be larger, contrary to the assump-
tion of the other regularization, and the strength of all connection weights is also
forced to be the same. This large strength for information minimization is introduced
to strengthen the forces of information minimization. Then, this effect by information
minimization is assimilated in connection weights in Fig. 2(a2), in which the ordinary
error minimization procedure is applied for a fixed number of steps. When the assimi-
lation ends, cost minimization is applied in Fig. 2(b1), where the strength of weights
is reduced by a fixed rate. Again, the assimilation is applied in Fig. 2(b2). Finally,
8 R. Kamimura and R. Kitajima

information maximization is applied in Fig. 2(c1), and the number of stronger con-
nection weights is forced to be smaller, and the strength of the weights is also forced
to be smaller. Then, the effect by information maximization should be assimilated in
Fig. 2(c2).
Connection weights are changed by multiplying them by the corresponding poten-
tialities. Since in the actual implementation, there are several parameters to be adjusted,
we need to discuss this implementation more concretely. Due to the page limitation,
detailed implementation is eliminated for easy interpretation of information flow in this
study. In the first place, information is forced to be minimized, which is realized by
adding the complementary potentiality. For the (n + 1)th step, weights are computed
by
(2,3) (2,3) (2,3)
wjk (n + 1) = ḡjk (n) wjk (n) (6)
In this process, the cost augmentation should be applied, which can be realized by
a learning parameter larger than one. Then, cost minimization is applied by
(2,3) (2,3)
wjk (n + 1) = θ1 wjk (n) (7)

In the cost minimization, only the learning parameter θ1 should be less than one.
Finally, to increase the selective information, we have
(2,3) (2,3) (2,3)
wjk (n + 1) = gjk (n) wjk (n) (8)

For increasing information, the potentiality g should be used, where strong weights
are forced to be stronger, while small ones are forced to be smaller. This has an effect
of reducing the number of strong weights.

2.4 De-layered Compression

Then, we should deal with the problem of the multi-layered property of neural networks.
This means that, for the interpretation of neural networks, we need to reduce the number
of layers or to make neural networks de-layered as much as possible. Naturally, when
the number of hidden layers increases, neurons tend to be connected in more compli-
cated ways, which has made it impossible to understand the inner mechanism, and this
black-box problem has been one of the serious problems in neural networks. For mak-
ing the interpretation possible, the complicated multi-layered neural networks should
be de-layered or compressed, and the number of hidden layers should be decreased. As
has been well known, the method of model compression has been taken seriously these
days [45–52]. However, the typical compression methods have tried to replace multi-
layered networks with fewer-layered ones, different from the original ones. Thus, we
need to develop a method to directly make the number of hidden layers smaller to keep
the original information as untouched as possible.
For interpreting multi-layered neural networks, we compress them into the simplest
ones, as shown in Fig. 3. We try here to trace all routes from inputs to the corresponding
outputs by multiplying and summing all corresponding connection weights.
Min-Max Cost and Information Control 9

Fig. 3. De-layered compression to the simplest network (b), and further unfolded one (c).

First, we compress connection weights from the first to the second layer, denoted by
(1,2), and from the second to the third layer (2,3) for an initial condition and a subset of
a data set. Then, we have the compressed weights between the first and the third layer,
denoted by (1,3).
n2
(1,3) (1,2) (2,3)
wik = wij wjk (9)
j=1

Those compressed weights are further combined with weights from the third to the
fourth layer (3,4), and we have the compressed weights between the first and the fourth
layer (1,4).
n3

(1,4) (1,3) (3,4)
wik = wik wkl (10)
k=1
By repeating these processes, we have the compressed weights between the first and
(1,6)
sixth layer, denoted by wiq . Using those connection weights, we have the final and
fully compressed weights (1,7).
n6

(1,7) (1,6) (6,7)
wir = wiq wqr (11)
q=1

Considering all routes from the inputs to the outputs, the final connection weights
should represent the overall characteristics of connection weights of the original multi-
layered neural networks. Finally, connection weights are expected to be unfolded and
disentangled, meaning that each input can be separately treated as shown in Fig. 3(c).

3 Results and Discussion


3.1 Experimental Outline
This experiment aimed to examine relations between the mission statements of compa-
nies and their profitability. For the experiment, we collected the mission statements of
10 R. Kamimura and R. Kitajima

300 companies, listed in the first section of the Tokyo Stock Exchange, which were sum-
marized by five input variables, extracted by the natural language processing systems.
For demonstrating the performance of the present method, we used the very redundant
ten-hidden-layered neural networks, where each hidden layer had ten neurons. The data
set seemed very easy, but the conventional methods such as the linear regression and
random forest could not improve generalization. In addition, simple information max-
imization and minimization also failed to improve generalization. Then, we repeated
the process of information minimization and maximization up to 20 cycles, because
improved performance could not be seen even if we repeated them further. In the exper-
iments, we tried to increase generalization performance, focusing on information min-
imization, where information was first minimized and then maximized. We used infor-
mation assimilation steps containing the fixed number of learning epochs (50 epochs)
for assimilating the information by the information minimizer and maximizer. In the
experiments, several learning parameters had to be controlled, and among them, the
most important one was a parameter θ1 , which was changed from 1 to 1.5, to control
the strength of weights or cost. When this parameter increased, the strength of weights
or potentialities (cost) increased gradually to decrease the information content.

3.2 Selective Information and Cost

The experimental results show that, when the number of learning cycles increased to 20,
information decreased gradually, and at the same time the fluctuation of selective infor-
mation became smaller, meaning that information could be assimilated in connection
weights smoothly.
Figure 4 shows selective information (left), cost (middle), and ratio of information
to its cost (right), when the number of cycles increased from two (a) to 20 (d), and by the
conventional method without information control (e). When the number of cycles was
two in Fig. 4(a), selective information (left), cost (middle), and ratio (right) increased
to a maximum point and decreased to a minimum point drastically. Then, when the
number of cycles increased from five in Fig. 4(b) to 20 in Fig. 4(d), the strength of
fluctuations of information, cost, and ratio decreased gradually. In particular, when the
number of cycles was 20, information and its cost decreased sufficiently through many
small fluctuations. On the other hand, by using the conventional method in Fig. 4(e),
information, cost and its ratio remained almost constant for the entire series of learning
steps.

3.3 Potentialities

When the learning cycles increased, the number of weights with stronger potentialities
became larger and more regularly distributed. Figure 5 shows the potentialities in terms
of relative absolute weights when the number of cycles increased from two (a) to 20
(d) and by the conventional method (e). One of the main characteristics is that, when
the number of cycles increased, the number of stronger weights with higher poten-
tialities increased gradually. When the number of cycles was two in Fig. 5(a), only
one weight tended to have higher potentialities. On the contrary, when the number
Min-Max Cost and Information Control 11

Fig. 4. Selective information (left), cost (middle), and ratio of information to its cost (right) as a
function of the number of steps when the number of learning cycles increased from two (a) to 20
(d) and when the conventional method (e) was used for the mission statement data set.

of cycles increased to 20 in Fig. 5(d), the number of weights with higher potentiali-
ties, shown in white, increased gradually, going through the second hidden layer to the
ninth hidden layer. In addition, potentialities responded very regularly to neurons in the
precedent and subsequent layers. From the information-theoretic viewpoint, informa-
tion decreases as the number of stronger components increases gradually. This means
that the potentialities, when the number of cycles was 20 in Fig. 5(d), tended to be dis-
tributed evenly when the hidden layers become higher. Thus, this shows that the present
method can realize a process of information minimization over hidden layers. Finally,
when the conventional method was used in Fig. 5(e), the number of strong weights
became smaller, and many weights seemed to be randomly distributed.
12 R. Kamimura and R. Kitajima

Fig. 5. Potentialities for all hidden layers by the selective information when the number of cycles
increased from two (a) to 20 (d) and by the conventional method without information control (e)
for the mission statement data set.

3.4 Compressed Weights

The conventional method without information control produced compressed weights


close to the original correlation coefficients between inputs and targets. By controlling
information to increase generalization, the compressed weights became different from
those correlation coefficients, where an input, considered not so important in terms of
the correlation coefficients, was detected as the most important one.
Figure 6 shows the compressed weights (left), relative weights (middle), and corre-
lation coefficient (right) when the number of cycles increased from two (a) to 20 (d)
and by the conventional method (e). The relative weights were obtained by dividing the
compressed weights by the original correlation coefficients. When the number of cycles
was two in Fig. 6(a), the compressed weights (left) were different from the original cor-
relation coefficients (right), and input No. 5 had the highest strength. When the number
Min-Max Cost and Information Control 13

Table 1. Summary of experimental results on average correlation coefficients and generalization


performance for the mission statement data set. The numbers in the method represent the values
of the parameter θ1 to control the potentialities.

Method Correlation Accuracy


2 0.095 0.627
5 −0.830 0.616
10 0.232 0.630
20 0.966 0.645
Conventional 0.974 0.597
Logistic 0.981 0.527
Random forest 0.238 0.529

of cycles increased from five in Fig. 6(b) to 10 in Fig. 6(c), the compressed weights
(left) gradually became similar to the correlation coefficients (right), but still, relative
input No. 5 had the highest strength. When the number of cycles increased to 20 in
Fig. 6(d), the compressed weights were almost equal to the correlation coefficients, but
the strength of relative input No. 5 became smaller. Finally, by the conventional method
in Fig. 6(e), the compressed weights were close to the corresponding correlation coeffi-
cients, and the strength of relative input No. 5 became smaller. This tendency was more
clearly seen in the relative information, normalized by the original correlation in the
middle figures in Fig. 6. As can be seen in the figure, input No. 5 showed the highest
strength for all cases. The results show that input No. 5 should play some important
roles, though non-linearly.

3.5 Summary of Results

The present method produced compressed weights close to the correlation coefficients,
and at the same time, it could improve generalization performance by changing the
compressed weights. Thus, the method could extract linear and non-linear relations
between inputs and targets explicitly.
Table 1 shows the summary of correlation coefficients and generalization perfor-
mance. The conventional method produced the second highest correlation coefficient of
0.974, but the accuracy was relatively low (0.597). When the number of cycles was 20,
the correlation coefficient of 0.966 was close to the 0.974 by the conventional method.
However, the method produced the best and highest accuracy of 0.645. The logistic
regression produced the highest correlation coefficient of 0.980, but the accuracy was
the lowest (0.527). Finally, the random forest produced the lowest correlation coeffi-
cient of 0.238, and the accuracy was the second worst at 0.529. The results show that
the conventional and very redundant multi-layered neural networks could extract the
linear and independent relations between inputs and outputs. When the information was
controlled, the compressed weights tended to be less similar to the correlation coeffi-
cient, and Input No. 5, considered not so important, tended to show higher importance.
Input No. 5 represents the abstract property, meaning that the input and target relation,
14 R. Kamimura and R. Kitajima

Fig. 6. The compressed weights (left), relative weights (middle), and the original correlation coef-
ficients (right) when the number of cycles increased from two (a) to 20 (d) and by the conven-
tional method (e) for the mission statement data set. the compressed weights were averages over
ten different initial conditions and input patterns.

or more concretely, inputs and profitability can be based on Input No. 5, which could
not be extracted by the conventional methods.

4 Conclusion

The present paper aimed to present a method to resolve a contradiction among com-
putational procedures in neural networks. We particularly focused on the contradiction
among information minimization, maximization, cost minimization, and error mini-
mization. Those computational procedures were intertwined with each other, which
has made it hard to reconcile among those contradictory procedures. To cope with
Min-Max Cost and Information Control 15

this contradiction, we proposed a method to unfold those four procedures separately,


serially, and independently. This serially unfolded computing can be used to resolve,
though seemingly, the contradiction among four contradictory factors. In addition, the
final multi-layered networks were de-layered, and hidden layers were eliminated by
gradually compressing all connection weights for easy interpretation. The method was
applied to the real data set on relations between mission statements of companies and
their profitability. The experimental results showed that the majority of relations could
be explained by the simple and linear relations between them. However, the non-linear
and mediating factor was also detected by the present method in the increasing of
generalization performance. Thus, the results show a possibility that neural networks
can be applied to the fields where the interpretation is considered more important than
improvement generalization and indirect relations are more important than direct ones
due to the complexity of data sets. The present method showed a possibility that much-
redundant neural networks can be used even to extract linear and independent rela-
tions between inputs and targets. In addition, by forcing networks into learning a more
specific performance, such as generalization, the non-linear relations can be extracted,
keeping linear relations unchanged as much as possible. Thus, this paper certainly
shows a possibility that neural networks can be used to improve generalization as well
as interpretation for complicated data sets.

References
1. Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Distributed representations. In: Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations,
pp. 77–109 (1986)
2. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112
(1985)
3. Kohonen, T.: Self-Organization and Associative Memory. Springer, New York (1988).
https://doi.org/10.1007/978-3-642-88163-3
4. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995). https://doi.org/10.1007/
978-3-642-97610-0
5. Xu, Y., Xu, L., Chow, T.W.S.: PPoSOM: a new variant of PolSOM by using probabilis-
tic assignment for multidimensional data visualization. Neurocomputing 74(11), 2018–2027
(2011)
6. Xu, L., Chow, T.W.S.: Multivariate data classification using PolSOM. In: Prognostics and
System Health Management Conference (PHM-Shenzhen), pp. 1–4. IEEE (2011)
7. DeSieno, D.: Adding a conscience to competitive learning. In: IEEE International Confer-
ence on Neural Networks, vol. 1, pp. 117–124. Institute of Electrical and Electronics Engi-
neers, New York (1988)
8. Lei, X.: Rival penalized competitive learning for clustering analysis, RBF net, and curve
detection. IEEE Trans. Neural Netw. 4(4), 636–649 (1993)
9. Choy, C.S., Siu, W.: A class of competitive learning models which avoids neuron underuti-
lization problem. IEEE Trans. Neural Netw. 9(6), 1258–1269 (1998)
10. Banerjee, A., Ghosh, J.: Frequency-sensitive competitive learning for scalable balanced clus-
tering on high-dimensional hyperspheres. IEEE Trans. Neural Netw. 15(3), 702–719 (2004)
11. Van Hulle, M.M.: Entropy-based kernel modeling for topographic map formation. IEEE
Trans. Neural Netw. 15(4), 850–858 (2004)
16 R. Kamimura and R. Kitajima

12. Hubel, D.H., Wisel, T.N.: Receptive fields, binocular interaction and functional architecture
in cat’s visual cortex. J. Physiol. 160, 106–154 (1962)
13. Bienenstock, E.L., Cooper, L.N., Munro, P.W.: Theory for the development of neuron selec-
tivity. J. Neurosci. 2, 32–48 (1982)
14. Schoups, A., Vogels, R., Qian, N., Orban, G.: Practising orientation identification improves
orientation coding in V1 neurons. Nature 412(6846), 549–553 (2001)
15. Ukita, J.: Causal importance of low-level feature selectivity for generalization in image
recognition. Neural Netw. 125, 185–193 (2020)
16. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualiza-
tion: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.-R. (eds.)
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol.
11700, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 4
17. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Müller, K.-R.: Layer-wise relevance
propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller,
K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS
(LNAI), vol. 11700, pp. 193–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-
030-28954-6 10
18. Morcos, A.S., Barrett, D.G.T., Rabinowitz, N.C., Botvinick, M.: On the importance of single
directions for generalization. Stat 1050, 15 (2018)
19. Leavitt, M.L., Morcos, A.: Selectivity considered harmful: evaluating the causal impact of
class selectivity in DNNs. arXiv preprint arXiv:2003.01262 (2020)
20. Arpit, D., Zhou, Y., Ngo, H., Govindaraju, V.: Why regularized auto-encoders learn sparse
representation? In: International Conference on Machine Learning, pp. 136–144. PMLR
(2016)
21. Goodfellow, I., Bengio, Y., Courville, A.: Regularization for deep learning. Deep Learn.
216–261 (2016)
22. Kukačka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv
preprint arXiv:1710.10686 (2017)
23. Wu, C., Gales, M.J.F., Ragni, A., Karanasou, P., Sim, K.C.: Improving interpretability and
regularization in deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 26(2), 256–
265 (2017)
24. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988)
25. Linsker, R.: Local synaptic rules suffice to maximize mutual information in a linear network.
Neural Comput. 4, 691–702 (1992)
26. Linsker, R.: Improved local learning rule for information maximization and related applica-
tions. Neural Netw. 18, 261–265 (2005)
27. Moody, J., Hanson, S., Krogh, A., Hertz, J.A.: A simple weight decay can improve general-
ization. Adv. Neural Inf. Process. Syst. 4, 950–957 (1995)
28. Fan, F.-L., Xiong, J., Li, M., Wang, G.: On interpretability of artificial neural networks: a
survey. IEEE Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021)
29. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspec-
tives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
30. Hu, J., et al.: Architecture disentanglement for deep neural networks. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 672–681 (2021)
31. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning aug-
mentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 113–123 (2019)
32. Gupta, A., Murali, A., Gandhi, D., Pinto, L.: Robot learning in homes: improving general-
ization and reducing dataset bias. arXiv preprint arXiv:1807.07049 (2018)
Min-Max Cost and Information Control 17

33. Kim, B., Kim, H., Kim, K., Kim, S., Kim, J.: Learning not to learn: training deep neural net-
works with biased data. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 9012–9020 (2019)
34. Wang, T., Zhao, J., Yatskar, M., Chang, K.W., Ordonez, V.: Balanced datasets are not enough:
estimating and mitigating gender bias in deep image representations. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 5310–5319 (2019)
35. Hendricks, L.A., Burns, K., Saenko, K., Darrell, T., Rohrbach, A.: Women also snowboard:
overcoming bias in captioning models. In: Proceedings of the European Conference on Com-
puter Vision (ECCV), pp. 771–787 (2018)
36. Cortés-Sánchez, J.D., Rivera, L.: Mission statements and financial performance in Latin-
American firms. Verslas: Teorija ir praktika/Business Theory Pract. 20, 270–283 (2019)
37. Bart, C.K., Bontis, N., Taggar, S.: A model of the impact of mission statements on firm
performance. Manag. Decis. 39(1), 19–35 (2001)
38. Hirota, S., Kubo, K., Miyajima, H., Hong, P., Park, Y.W.: Corporate mission, corporate poli-
cies and business outcomes: evidence from japan. Manag. Decis. (2010)
39. Alegre, I., Berbegal-Mirabent, J., Guerrero, A., Mas-Machuca, M.: The real mission of the
mission statement: a systematic review of the literature. J. Manag. Organ. 24(4), 456–473
(2018)
40. Atrill, P., Omran, M., Pointon, J.: Company mission statements and financial performance.
Corp. Ownersh. Control. 2(3), 28–35 (2005)
41. Vandijck, D., Desmidt, S., Buelens, M.: Relevance of mission statements in flemish not-for-
profit healthcare organizations. J. Nurs. Manag. 15(2), 131–141 (2007)
42. Desmidt, S., Prinzie, A., Decramer, A.: Looking for the value of mission statements: a meta-
analysis of 20 years of research. Manag. Decis. (2011)
43. Macedo, I.M., Pinho, J.C., Silva, A.M.: Revisiting the link between mission statements and
organizational performance in the non-profit sector: the mediating effect of organizational
commitment. Eur. Manag. J. 34(1), 36–46 (2016)
44. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (1991)
45. Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 535–541. ACM (2006)
46. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information
Processing Systems, pp. 2654–2662 (2014)
47. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015)
48. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Hints for thin deep
nets. In: Proceedings of ICLR, Fitnets (2015)
49. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling knowl-
edge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
50. Neill, J.O.: An overview of neural network compression. arXiv preprint arXiv:2006.03669
(2020)
51. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis.
129(6), 1789–1819 (2021)
52. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration
for deep neural networks (2020)
Face Generation from Skull Photo Using
GAN and 3D Face Models

Duy K. Vo1,2(B) , Len T. Bui1,2 , and Thai H. Le1,2


1
Faculty of Information Technology, University of Science,
Ho Chi Minh City, Vietnam
[email protected], {btlen,lhthai}@fit.hcmus.edu.vn
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Generating face images from skull images have many appli-
cations in fields such as archaeology, anthropology and especially foren-
sics, etc. However, face/skull images generation remain a challenging
problem due to the fact that face image and skull image have different
characteristics and the data on skull images is also limited. Therefore, we
consider this transformation as an unpaired image-to-image translation
problem and research the recently popular generative models (GANs)
to generate face images from skull images. To this end, we use a novel
synthesis framework called U-GAT-IT, a new framework for unsuper-
vised image-to-image translation. This framework use AdaLIN (Adap-
tive Layer-Instance Normalization), which a new normalization function
to focus on more important regions between source and target domains.
Furthermore, to visualize the generated face in many other aspects, we
use an additional 3D facial generation model called DECA (Detailed
Expression Capture and Animation), which is a model for 3D facial
reconstruction that is trained to robustly produce a UV displacement
map from a low-dimensional latent representation. Experimental results
show that the proposed method achieves positive results compared to
the current unpaired image-to-image translation models.

Keywords: Skull image generation · Skull to face · Unpaired


image-to-image translation

1 Introduction
In recent years, information technology has developed at a rapid rate, especially
applications of artificial intelligence are increasingly interested and developed.
This promotes information technology to become one of the important fields
contributing to the economic development of the country and improving people’s
lives.
One of the problems that have been interested and researched by computer
scientist is the reconstruction problem, especially the facial generation problem,
which is the most commonly research due to the convenience of face data col-
lection. In which, the problem “Face images Generation From Skull images” is
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 18–31, 2023.
https://doi.org/10.1007/978-3-031-18461-1_2
Face Generation from Skull Photo Using GAN and 3D Face Models 19

attracting the research attention of many scientists in different fields in many


countries around the world. This was demonstrated by the participation of more
than 100 delegates with more than 50 papers on many aspects at the Inter-
national Conference on Reconstruction of Soft Facial Parts (RSFP), Remagen,
Germany, 2005. Some typical application areas of the problem can be mentioned
as archeology, anthropology and forensic examination and can be applied in the
search for the identity of unknown martyrs based on their skulls because when
excavating a corpse, the skull is usually the least damaged.
In the research on generating face images from skull images, the accuracy of
the generated face is always a top concern because even small changes in the
face are noticeable. Lieberman and Kalocsai [3] suggest that facial recognition
is very sensitive to contrast, light, size and especially viewing angle. In addition,
face/skull images generation remain a challenging problem due to the fact that
face image and skull image have different characteristics and the data on skull
images is also limited. Therefore, face images generation from skull images is a
complex problem, and it is this complexity that attracts a lot of special attention
from researchers.
The human face has many shapes and forms, the shape depends on many
factors such as age, gender, aesthetics, trauma, etc. so a skull image can be suit-
able for many different faces. From a mathematical point of view, this problem
is an inverse problem and can give many answers. The problem is even more
complicated in cases where the skull is found to be only partially left, because
of missing elements (such as jaws or teeth) [1]. This means that there is no way
that a face can be accurately generated simply from its skull image.
The typical approaches commonly used to model a face are often still artistic
and rely heavily on the anatomic and physiognomic knowledge of the modeller
[1]. In other words, computers simply replace the old process of creating identi-
fiers with hand-drawn sketches or with clay models, adding realistic editing and
simulation capabilities, but the reliability of the result is limited.
In this paper, the main purpose of this study is generating complete face
images from complete skull images. We propose an integrated system to provide
face images from skull images and 3D Facial generation from the face images.
This paper is organized as follows. In Sect. 2 related works are presented.
In Sect. 3 the proposed method is presented in detail. In Sect. 4 the experiment
results are presented and discussed. The paper concludes showing directions for
future research in Sect. 5.

2 Related Work

Generating facial images from skull images has been studied since the late 19th
century, starting with Paul Broca’s study of the relationship between bone and
thickness of soft tissues in the face [5] in 1867 and the work was completed
and officially published by the Russian scientist, Gerasimov [21]. In the field of
forensic science, some typical works such as the facial reconstruction of Karen
Price (a Welsh girl murdered in 1981, whose remains were found eight years later)
20 D. K. Vo et al.

[15]. In archeology, in 2013, a group of physical anthropologists and forensic


artists presented the facial reconstruction of King Richard III [15]. In 2014, the
face of Naia, a Pleistocene teenager from Yucatán, Mexico, was created by a
research team so-called Kennewick Man [15]. Most recently, the face-making of
a man so-called the Cheddar Man - a Neolithic Englishman with dark skin,
straight dark hair and blue eyes - has sparked controversy in Britain (Devlin),
2018) [15].
The face images generation methodologies from the skull images developed
basically come from two main approaches:

1. The manual methods.


– 2D artistic drawing face image overlay on a skull image.
– image/video overlay of face images on a skull image.
– Clay/plaster sculpting of the face.
2. The computer-aided methods.
– Matching skulls with face images.
– Generating a three-dimensional face images model from skull images

2.1 The Manual Methods

2D Artistic Drawing Face Image Overlay on a Skull Image. This method


was published by Pearson [17] in 1926. This method requires a forensic artist,
who have knowledge about the relationship between the skull and the face. Based
on knowledge, the artist will observe the image of the skull and draw a sketch of
face, then insert the sketched portrait into the skull image to check and correct
outline contours accordingly. This is the simplest generation technique, suitable
for providing a preliminary identification from cadavers for forensic purposes
(see Fig. 1).

Fig. 1. 2D artistic drawing face image overlay on a skull image [23]


Face Generation from Skull Photo Using GAN and 3D Face Models 21

Photo/Video Overlay of Face Images on a Skull Image. This method was


proposed by Sen (1962), Gupta (1969) and Sekharan (1973). The main purpose
of this method is to compare a skull image to a face image to highlight matching
features [1]. This method is often applied when forensic experts already have a
small group of face images with characteristics that match the skull, after that
overlaying the skull on the image to select the most suitable face image with the
skull (see Fig. 2).

Fig. 2. Photo/video overlay of face images on a skull image [1]

Clay/Plaster Sculpting of the Face. This method relies on tissues to create


the face. The artist applies the usual depth markers (typically referred as land-
marks) and then begins to model a face fitting the landmarks [1]. It is common
to use 32 landmarks (10 points on the line in the middle of the face and 22 points
symmetrically on either side of the face) and identified by medical examiners.
Based on landmarks identified, the basic facial muscles are covered with plaster
or clay. The result is a three-dimensional of the face model by clay/plaster. This
method requires a lot of time to collect information because there are many
factors affecting tissue thickness such as: race, sex, age, body status, etc. and is
difficult to edit.
In 1867, Paul Broca is said to be the first anatomist to study the relationship
between bone and facial soft tissue thickness [5] laid the first foundation for
the problem. This work was later developed and completed by Gerasimov, the
Russian scientist. He unveiled a statue of the head of a man’s corpse that had
disintegrated and was recognized by a mother as her missing son.

2.2 The Methods with the Help of the Computer


Matching Skulls with Face Images. Matching skull can be considered as
a heterogeneous matching problem, where the first domain is the skull images
22 D. K. Vo et al.

and the second domain is the face images. With the help of science and technol-
ogy, two-dimensional images of the face are superimposed on the skull [7,13]. A
semi-supervised formulation of transform learning and a skull dataset so-called
IdentifyMe proposed in 2017 by Nagpal et al. [14] (see Fig. 3, which is an example
of this approach).

Fig. 3. Matching skulls with face images [19]

Generating a Three-Dimensional Face Images Model from Skull


Images. The three-dimensional facial generation technique involves the pro-
duction of facial sculptures onto the skull or skull replica [23].
Digitizing skull data is a very important step in generating a facial 3D model
from a skull. Normally, to have three-dimensional digitized data of the skull, we
often use high-precision camcorders, tomography, three-dimensional scanners,
and CT scanners to collect complete information from the skull. Besides, the
system using the scanner requires relatively complicated setup of specialized
equipment. This system includes: computer, camcorder, tripod, digital video
mixer, digital display system, display system adjustment device, etc.
Accuracy was evaluated by comparing the morphology of the generated face
and the target face using three-dimensional modeling software (Rapidform) [23].
However, three-dimensional scanners, CT scanners are often quite expensive and
inconvenient to carry, so this approach is often used in research laboratories.
Another approach is to generate a three-dimensional model of the skull from
the collected skull images, and then use the skull model to generate a face model.
FACES [1] software was researched and perfected by the authors from the
University of Salerno, Italy in 2004. This software helps to generate a face model
from skull based on algorithms to correct the face accordingly with marked
landmarks on the skull.
The skull-based facial generation system uses 3DS MAX graphics software
developed by Björn Anderson, Martin Valfridson in 2005 [2]. The experimental
results show that the method used in the software is significantly faster than the
traditional methods.
FLAME [11], a recent 3D face model, researched by Tianye Li et al. in 2017.
This model helps to decompose faces into shape, posture and expression param-
eters and the result of the model is more realistic and accurate face description.
Figure 4 is an example using FLAME model.
Face Generation from Skull Photo Using GAN and 3D Face Models 23

Fig. 4. 3D face generation using DECA [6]

3 Proposed Method
In this section, the problem formulation is presented followed by a detailed
description of the proposed method. Also, details of generator and discriminator
network architecture are presented.

3.1 Formulation
Given a dataset (D) consisting of a source domain {x}N i=1 ∈ Xs , which is skull
images and a target domain {y}M j=1 ∈ Xt , which is face images, the goal of
the generating face images to skull images model is to learn mapping functions:
(1) G : Xs → Xt that represents skull image Xs to face image Xt generation,
(2) F : Xt → Xs that represents face image Xt to face image Xs generation.
In addition, we introduce two discriminators Ds and Dt , where Ds aims to
distinguish between {x} and G : Xs → Xt , Dt aims to distinguish between {y}
and F : Xt → Xs . The full objective contains four types of terms: adversarial
loss, cycle loss, identity loss and CAM loss.

Adversarial Loss. An adversarial loss is used for matching the distribution of


generated images to the target image distribution. We apply to both mapping
functions.
For the mapping function G : Xs → Xt and its discriminator Dt

Ls→t 2
gan =Ey∼Xt [(Dt (y)) ]
(1)
+ Ex∼Xs [(1 − Dt (Gs→t (x)))2 ],
where Gs→t tries to generate images Gs→t (x) that look similar to images from
target domain Xt , while Dt tries to distinguish between generated images
Gs→t (x) and images y from target domain Xt . Gs→t aims to minimize this
objective, while Dt aims to maximize it, i.e., minGs→t maxDt Lgan . The similar
adversarial loss for F : Xt → Xs and its discriminator Ds is minFt→s maxDs Lgan .

Cycle Consistency Loss. Cycle consistency loss was introduced with the
CycleGAN [24] architecture to the optimization problem that means if we trans-
late a skull image to a face image and then back to a skull image, we should get
the same input image back.
In the paper, we want to learn two mapping functions: G : Xs → Xt and
F : Xt → Xs . Cycle consistency loss encourages F (G(x)) ≈ x and G(F (y)) ≈ y.
24 D. K. Vo et al.

It reduces the space of possible mapping functions by enforcing forward and


backwards consistency:

Lcycle = Ex∼Xs [|x − Ft→s (Gs→t (x)))|1 ] + Ey∼Xt [|y − Gs→t (Ft→s (y)))|1 ], (2)

Identity Loss. To ensure that the color distributions of the input image and
the generated image are the similar, the identity consistency constraint is applied
to the generator. For an image x ∈ Xs , the image is better not to be changed
after the translation of x using Gs→t .

Ls→x
identity = Ex∼Xs [|x − Gs→t (x)|1 ], (3)

CAM Loss. By exploiting the information from the auxiliary classifiers ηs and
ηDt , given an image x ∈ {Xs , Xt }, Gs→t and Dt get to know where they need to
improve or what makes the most difference between two domains in the current
state [10]:

Ls→t
cam = −(Ex∼Xs [log(ηs (x))] + Ex∼Xt [log(1 − ηs (x))]), (4)

LD
cam = Ex∼Xt [(ηDt (x)) ] + Ex∼Xs [(1 − ηDt (Gs→t (x)) ],
t 2 2
(5)

The Full Objective. Finally, the full objective is the sum of the four objectives
with pre-set default coefficients:

L = λ1 Lgan + λ2 Lcycle + λ3 Lidentity + λ4 Lcam , (6)


where λ1 , λ2 , λ3 , λ4 control the relative importance of the four objectives

3.2 Generator
The main goal of the generator is to train a function Gs→t that maps an image
from the source domain Xs to the target domain Xt using unpaired images from
each domain [10] in the dataset.
The Gs→t model consists of an encoder Es , a decoder Gt , and an auxiliary
classifier ηs , where ηs (x) represents the probability that x comes from Xs [10].
Let x ∈ {Xs , Xt } represent a sample from the source and the target domain,
k
Esk (x) be the kth activation map of the encoder, Es ij (x) be the value at (i, j)
and wsk be the weight of the kth feature map for the source domain is trained
by ηs (x). Formula for attention feature maps ηs (x):

⎛ ⎞
 
ηs (x) = σ ⎝ wsk Eskij (x)⎠ (7)
k ij
Face Generation from Skull Photo Using GAN and 3D Face Models 25

Formula for attention feature maps as (x):

as (x) = ws ∗ Es (x) = {wsk ∗ Esk (x)|1≤k≤n} (8)

where n is the number of encoded feature maps.


Inspired by Batch-instance Normalization (BIN) [16], the authors in [10]
proposed adaptive layer-instance normalization (AdaLIN), a new normalization
method with the parameters learned from the dataset during training process
by combining the advantages of two normalization methods AdaIN [8] and LN
[20]. The formula of the AdaLIN module:

AdaLIN (a, γ, β) = γ · (ρ · aˆI + (1 − ρ) · aˆL ) + β, (9)

a − μI a − μL
aˆI =  2 , aˆL =  2 , (10)
σI +  σL + 

ρ ← clip[0,1] (ρ − τ Δρ) (11)


where γ, β are dynamically computed by a fully connected layer from the atten-
tion map, μI , μL are the channel-wise and layer-wise mean, τ is the learning rate
and Δρ indicates the parameter update vector determined by the optimizer [10].

3.3 Discriminator

The main goal of the discriminator is to train a discriminant function Dt to


distinguish whether the generated image in the generator Gs→t is similar to the
image in the target domain Xt in the dataset.
The Dt model consists of an encoder EDt , a classifier CDt , and an auxil-
iary classifier ηDt . The auxiliary classifier ηDt is simply an extension of class-
conditional GAN that requires that the discriminator to not only predict if the
image is ‘real’ or ‘fake’ but also has to provide the ‘source’ or the ‘class label’
of the given image. Let x ∈ {Xt , Gs→t (Xs )} represent a sample from the target
domain and the generated source domain. Dt (x) exploits the attention feature
maps aDt (x) using weights wDt trained by ηDt on the encoder EDt [10]. Formula
for attention feature maps aDt (x):

aDt (x) = wDt ∗ EDt (x) (12)

3.4 3D Face Generation Model - DECA

DECA [6] is a three-dimensional face generation model, that is trained to gen-


erate a UV displacement map that includes person-specific detail parameters
and generic expression parameters. DECA [6] is built using Encoder - Decoder
26 D. K. Vo et al.

architecture, the input image I is regressed into some parameters correspond-


ing to albedo, lighting and geometry [6]. The main idea of DECA [6] is based
on observing an individual’s face then displaying facial details based on facial
expression.

4 Experimental Results
In this section, experimental settings and evaluation of the proposed method
are discussed in detail. Three reference databases are used: the Flickr-Faces-HQ
(FFHQ) [9], the Chinese University of Hong Kong Face Sketch (CUFS) [22] and
IdentifyMe [14].

4.1 Datasets
The paper focuses on generating face images from skull images and using a
face-to-three-dimensional transformation model to visualize the obtained results.
However, the datasets that contain together face images and skull images, respec-
tively, are few. Therefore, to be able to train the model, we propose a composite
dataset, which is a combination of the IdentifyMe [14] dataset and the two facial
datasets FFHQ [9] and CUFS [22].
The IdentifyMe dataset introduced in [14] consists of 464 skull and face
images divided into two parts:
– Part 1: Skull and Face Image Pairs. A total of 35 skull images and their cor-
responding face images are collected from various sources, some of these pairs
correspond to real world cases where a skull was found and later identified to
belong to a missing person [14].
– Part 2: Unlabeled Supplementary Skull Images. A total of 429 skull images
is collected from various sources on the Internet and in real life.
The FlickrFaces-HQ (FFHQ) dataset is a dataset of human faces, consists of
70,000 high-quality images at 1024 × 1024 resolution. The images were crawled
from Flickr (thus inheriting all the biases of that website) and automatically
aligned and cropped [9].
The CUHK Face Sketch database (CUFS) [22] is a viewed sketch database,
but we only use face images in this paper. We collect 188 face images from the
Chinese University of Hong Kong (CUHK) student database.
The proposed dataset is fed into the system to form two training sets used
to compare the results with each other. The first training set skull2ffhq is a
combination of two datasets IdentifyMe and FFHQ and the second training set
skull2CUFS is a combination of two datasets IdentifyMe and CUFS.

4.2 Training Details


During model training procedure, each input image is resized to the size of
256 × 256. We train the U-GAT-IT [10] network from scratch, with a learning
Face Generation from Skull Photo Using GAN and 3D Face Models 27

rate of 0.0002 and a batch size of 1. In particular, we replace the negative log-
likelihood objective by a least-squares loss [12]. The loss is more stable during
training and generates higher quality results [24]. For all the experiments, we
set λ1 = 1, λ2 = 10, λ3 = 10, λ4 = 1000. We keep the same learning rate for the
first 50 epochs and linearly decay the rate to zero over the next 50 epochs (see
Fig. 5 and 6).

Fig. 5. Skull images from IdentifyMe [14] on data preprocessing

Fig. 6. Facial images from two faces datasets [9, 22] on data preprocessing

4.3 Evaluation
We first evaluate the performance of the proposed method on two proposed
datasets. We then use the more performance dataset to train model for a recent
method for unpaired image-to-image translation - CycleGAN. Finally, we com-
pare the performance of our method against CycleGAN (see Fig. 7).
For evaluation, we use two indicates, those are Inception Score and Kernel
Inception Distance.
The Inception Score, or IS for short, was proposed in [18], is an objective met-
ric for evaluating the quality of generated images, specifically synthetic images
output by generative adversarial network models. The inception score has the
lowest value of 1.0 and the highest value of the number of classes supported by
the classification model.
The Kernel Inception Distance, or KID for short, was proposed in [4], is used
to evaluate the generated images by GAN model, the lower the KID, the more
similar the generated image is to the images in the source domain.
28 D. K. Vo et al.

Fig. 7. Visualization the skull images and their generated images: (a): Source images
from [14], (b): The generated images from the FFHQ facial images dataset [9], (c): The
generated images from the CUFS facial images dataset [22], (d): The generated images
from the CycleGAN model [24]

Evaluate the Performance on Two Proposed Datasets. The Inception


Score and the Kernet Inception Distance of the proposed method on two pro-
posed datasets is shown in Table 1.

Table 1. Comparison results on two proposed datasets

Model IS KID
U-GAT-IT and skull2ffhq 2.019 ± 0.157 9.868 ± 0.441
U-GAT-IT and skull2CUFS 1.465 ± 0.101 3.445 ± 0.318

Based on the obtained results table, we find that the generated face images
from the FFHQ dataset is more diverse than the CUFS dataset, but the gener-
ated images from the CUFS dataset is stable and similar to the source domain
more than the FFHQ dataset. Because the facial images of the FFHQ dataset is
confused by accessories such as hats, glasses, age, that makes the generate model
out of focus on the face.

Comparison with a State-of-the-Art Method. We use the CUFS face


images dataset for the comparison with the CycleGAN model. The Inception
Score and the Kernet Inception Distance for the proposed method and Cycle-
GAN is shown in Table 2.
Face Generation from Skull Photo Using GAN and 3D Face Models 29

Table 2. Comparison results on the proposed method and CycleGAN

Model IS KID
U-GAT-IT 1.465 ± 0.101 3.445 ± 0.318
CycleGAN 1.304 ± 0.084 20.011 ± 0.617

Based on the obtained results table, we find that the generated image from
the proposed method is more similar to the source domain than the CycleGAN
model. It can be seen that the CycleGAN model has not converged, so the
generated image is the same by different input images.

4.4 3D Face Generation - DECA Model


We use the face generated images from the proposed method with the CUFS
facial images dataset. The 3D face model generated by the DECA model is
shown in Fig. 8.

Fig. 8. 3D face generation using DECA model

5 Conclusion
The paper focuses on the model of generating face images from skull images in
order to support in the process of identifying the skull’s identity. Through the
process of studying the problem, we have grasped the methods and techniques of
generating a face from a skull along with some basic knowledge about skull and
face. The approach presented in this paper can generate a face image and the
3d model of the face more optimal results with KID 3.445 ± 0.318. However,
30 D. K. Vo et al.

the obtained results depend entirely on the training data set. When changing
the training data set, the results will also be changed. Therefore, the obtained
results are for reference only, so it is not applicable in practice.
Future development of this method will try to train and test more data
to generate more transformation models, and at the same time improve the
accuracy of the generated face from the model so that it can be applied in
practice. In addition, developing a system to convert skull images to face images,
building a data normalization process so that the system gives the most accurate
results and can serve other studies that have the same research object.

References
1. Abate, A.F., et al.: FACES: 3D FAcial reConstruction from anciEnt Skulls using
content based image retrieval. J. Vis. Lang. Comput. 15(5), 373–389 (2004)
2. Andersson, B., Valfridsson, M.: Digital 3D facial reconstruction based on computed
tomography (2005)
3. Biederman, I., Kalocsai, P.: Neural and psychophysical analysis of object and face
recognition. In: Wechsler, H., Phillips, P.J., Bruce, V., Soulié, F.F., Huang, T.S.
(eds.) Face Recognition, pp. 3–25. Springer, Heidelberg (1998). https://doi.org/10.
1007/978-3-642-72201-1 1
4. Bińkowski, M., et al.: Demystifying MMD GANs. In: International Conference on
Learning Representations (2018)
5. Buzug, T.M., et al.: Reconstruction of soft facial parts (2005)
6. Feng, Y., et al.: Learning an animatable detailed 3D face model from in the- wild
images. ACM Trans. Graph. (TOG) 40(4), 1–13 (2021)
7. Grüner, O.: Identification of skulls: a historical review and practical applications.
In: Iscan, M.Y., Helmer, R.P. (eds.) Forensic Analysis of the Skull: Craniofacial
Analysis, Reconstruction, and Identification. Wiley- Liss, New York (1993)
8. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance
normalization. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 1501–1510 (2017)
9. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4401–4410 (2019)
10. Kim, J., et al.: U-GAT-IT: unsupervised generative attentional networks with
adaptive layer-instance normalization for image-to-image translation (2020)
11. Li, T., et al.: Learning a model of facial shape and expression from 4D scans. ACM
Trans. Graph. 36(6), 194–1 (2017)
12. Mao, X., et al.: Least squares generative adversarial networks. In: Proceedings of
the IEEE International Conference on Computer Vision, pp. 2794–2802 (2017)
13. Miyasaka, S., et al.: The computer-aided facial reconstruction system. Forensic Sci.
Int. 74(1–2), 155–165 (1995)
14. Nagpal, S., et al.: On matching skulls to digital face images: a preliminary approach.
In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 813–819.
IEEE (2017)
15. Delgado, A.N.: The problematic use of race in facial reconstruction. Sci. Cult.
29(5), 568–593 (2020)
16. Paoletti, M.E., et al.: Deep learning classifiers for hyperspectral imaging: a review.
ISPRS J. Photogramm. Remote. Sens. 158, 279–317 (2019)
Face Generation from Skull Photo Using GAN and 3D Face Models 31

17. Pearson, K.: On the skull and portraits of George Buchanan. Biometrika, 233–256
(1926)
18. Salimans, T., et al.: Improved techniques for training GANs. Adv. Neural. Inf.
Process. Syst. 29, 2234–2242 (2016)
19. Singh, M., et al.: Learning a shared transform model for skull to digital face
image matching. In: 2018 IEEE 9th International Conference on Biometrics The-
ory, Applications and Systems (BTAS), pp. 1–7. IEEE (2018)
20. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information
Processing Systems, pp. 5998–6008 (2017)
21. Verzé, L.: History of facial reconstruction. Acta Biomed. 80(1), 5–12 (2009)
22. Wang, L., Sindagi, V., Patel, V.: High-quality facial photo-sketch synthesis using
multi-adversarial networks. In: 2018 13th IEEE International Conference on Auto-
matic Face & Gesture Recognition (FG 2018), pp. 83–90. IEEE (2018)
23. Wilkinson, C.: Facial reconstruction-anatomical art or artistic anatomy? J. Anat.
216(2), 235–250 (2010)
24. Zhu, J.-Y., et al.: Unpaired image-to-image translation using cycle- consistent
adversarial networks. In: 2017 IEEE International Conference on Computer Vision
(ICCV) (2017)
Exploring Deep Learning in Road Traffic
Accident Recognition for Roadside Sensing
Technologies

Swee Tee Fu(B) , Bee Theng Lau , Mark Kit Tsun Tee ,
and Brian Chung Shiong Loh

Swinburne University of Technology Sarawak Campus, Kuching, Sarawak, Malaysia


{sfu,blau,mtktsun,bloh}@swinburne.edu.my

Abstract. Road traffic accident recognition is essential in providing timely infor-


mation to healthcare authorities for reducing fatalities. This area of research is
heavily dependent on traffic flow data captured through roadside sensing technolo-
gies that are installed on highways and intersections. To date, Deep Learning (DL)
has achieved remarkable progress in solving time-series problems with increasing
applications in road traffic accident recognition. This paper explores recent studies
of DL techniques using roadside sensor-based traffic flow data. Limited literature
has focused on road traffic accident recognition in mixed traffic. Various issues in
current DL recognition solutions that affect accuracy, including consideration of
user varieties, dynamic traffic flow conditions, and external environmental factors
are discussed. In this research, a fusion feature-based deep learning model for
traffic accident recognition has been proposed, consisting of three major streams
of models to cater for prominent features in traffic accident recognition in a mixed
traffic flow environment.

Keywords: Deep learning · Machine learning · Traffic accident recognition

1 Introduction

Road traffic accidents have become the leading cause of casualties and death across the
world. Currently, it is ranked as the 9th leading cause of human casualties worldwide and
is projected to be the fifth leading cause of human casualties in 2030 [1]. A road traffic
accident is a phenomenon that involves traffic collisions between motorised vehicles or
crashes between motorised and non-motorised vehicles as well as pedestrians or other
stationary obstructions that may lead to property damage and injury to road users. One
of the most common causes of fatalities in a road traffic accident is the latency in the
intervention of the paramedic teams between the detection of accidents and their arrival
time at the scene [1–4]. Hence, early recognition of and reaction to accidents are crucial
to reducing road traffic fatalities.
Different traffic flow environments have distinct characteristics and traffic phenom-
ena, which is essential for a thorough understanding of the occurrences of road traffic

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 32–52, 2023.
https://doi.org/10.1007/978-3-031-18461-1_3
Exploring Deep Learning in Road Traffic Accident 33

accidents. A non-mixed traffic flow environment is predominantly comprised of cars and


trucks, which is also termed as homogenous traffic [5], in which vehicles follow lane
discipline and move in a synchronized way fulfilling basic traffic theories. Conversely,
a mixed traffic flow environment is composed of both motorized and non-motorized
vehicles as well as pedestrians, known as heterogenous traffic [5], in which there can be
an absence of lane discipline due to the variations in sizes and unsynchronized maneu-
verability of the road users. According to [6], accidents are more prone to occur in a
mixed traffic flow environment because of the inconsistent speed of the subjects within
the traffic scene.
Considering these issues, several works on monitoring abnormal traffic flow events
were deployed to improve road monitoring technologies such as sensors, accelerome-
ters, Global Positioning System (GPS), surveillance cameras, etc. CCTV-based visual
detection via surveillance cameras from roadside monitoring stations attain a higher
detection rate and lower false alarm rate as compared to traditional way of using traffic
flow sensors data, because cameras can provide richer graphical information on raw
traffic behaviour. Moreover, in recent years, computer vision techniques have become
prevalent approaches in traffic accident recognition without requiring manual human
interpretation [7].
In current literature, various research works employ traditional statistical analysis
and Machine Learning methods (e.g., logistic regression, log-linear model, decision tree,
support vector machines, etc.) in modelling road traffic accident detection [8–14]. How-
ever, the varying performances of these traffic models remain a challenge especially with
a rich volume of traffic data available today. On top of that, the relationship between
traffic conditions and accidents is often too complex and furthermore, traffic patterns
are mostly heterogeneous [15, 16]. Recently, Deep Learning has seen increasing appli-
cations in traffic accidents with markedly improved performance of vision-based object
detection. Most researches have been focused on developing accident detection models
based on complex Deep Learning frameworks [7, 17–20]. Deep Learning models gen-
erally achieve higher accuracy as compared to statistical and Machine Learning models
but require more computational power, data storage, and processing capacity [21]. Till
date, there are various inherent gaps in Deep Learning solutions especially in handling
erratic traffic patterns while maintaining a balance between detection speed and accu-
racy. To contribute to this end, future research projects must have a clear awareness of
the current landscape of Deep Learning models applied in traffic accident recognition.
Therefore, this paper is aimed at studying the latest Deep Learning approach, grow-
ing research interest in the Machine Learning field, applied specifically to recognizing
road traffic accidents using roadside sensor-based traffic monitoring systems. In the fol-
lowing sections, a review of the state-of-the-art on road traffic accident recognition using
Deep Learning approaches is presented, followed by highlights of important issues and
challenges related to these approaches.

2 Deep Learning Techniques for Road Traffic Accident Recognition


on Roadside Sensor-Based Traffic Flow Data
Several works on monitoring anomalies in traffic flow based on automatic analysis of
non-video and video streams data were performed in highway, urban and rural areas.
34 S. T. Fu et al.

These research works focused on developing their models based on complex Deep
Learning frameworks.
CNN is the most commonly used Deep Learning approach to study patterns in spa-
tiotemporal traffic data and classifying traffic conditions. Many research works employ
CNN to perform image or video classification on a frame-by-frame basis, in which the
model is mainly trained to classify accident scenes and normal traffic scenes. The authors
in [22] use the invariant property of CNN architecture to extract spatial features from
images and video frames at the initial layer before combining them into meaningful pat-
terns at the subsequent layers to detect traffic accident scenes. Traffic-Net image dataset
is used to train the proposed CNN model to classify each video frame into four prede-
fined categories which are accident, dense traffic, fire, and sparse traffic. The proposed
CNN model is compared against a pre-trained ResNet50 model and concludes that the
proposed CNN model achieved a higher accuracy at 94.4% on the four target accident
classes. It is also reported that the accuracy of accident detection is enhanced with Deep
Learning compared to traditional neural networks.
The paper [23] proposed a CNN model which is trained on 2500 accident images and
2500 non-accident images to classify each video frames input. The classification result
for each of the video frames are stored in a deque and rolling prediction averaging is used
to predict the occurrence of an accident. The proposed model has achieved an image
prediction accuracy of 85%. Also, Inception V3 is a type of CNN that is 48 layers deep,
and [4] customised the Inception V3 with two Deep Learning network architectures:
DenseNet and SENet to detect high-speed head-on and single-vehicle collision. The
former is to increase the depth of the high dimensional neural networks by repeating the
use of features and the latter is to act as a filtering mechanism to remove the last output
features that are insignificant for the traffic accident detection. The proposed model is
trained upon various traffic collision accident images collected from the Internet. The
result showed that the modified Inception V3 model can reach 96% of accuracy in
traffic collision detection. However, the model suffers accuracy loss when the test set
is derived from a different dataset than the training set. The main reason for this is the
limited training dataset used to train on the traffic collision scenarios. Hence, a large
pool of training datasets is needed to increase the accuracy of the image classification
result.
Besides, [24] utilizes video dynamic detection whereby feature extractions are per-
formed across a series of video frames using the CNN model and the differences between
the video frames are obtained through GRU model before outputting the prediction result
through a fully connected neural network. The result is based on the occurrence of frame
changes. It has been reported that the training time is significantly reduced through the
fusion of both CNN and GRU models. This model achieved an accuracy of 96.64% but the
training dataset is limited which may cause overfitting issues. Recently, [25] addressed
the overfitting issue by proposing a CNN with drop-out method which uses Global Aver-
age Pooling (GAP) to reduce the number of learning parameters. The results concluded
that the performance of the proposed deep model is better than LR, DT, SVC, RF, and
KNN as it improves both the accuracy and F-1 score for crash detection. However, it
can be seen that using CNN to interpret traffic accident occurrence on a frame-by-frame
Exploring Deep Learning in Road Traffic Accident 35

basis might not be accurate as traffic accident is a time-series event, and the relationship
between frames is important to be considered within the model.
To address this issue, [17] combines both CNN (Inception V3) and LSTM for real-
time accident detection on highways. The LSTM layers were added to the existing CNN,
and both temporal and spatial features were taken into consideration. The extracted
features from each video frame were saved into sequences to be further trained in the
LSTM model for final classification as shown in Fig. 1. Recently, [26] proposed a
time-distributed model which combines a time-distributed LSTM, an LSTM network,
and a dense layer for traffic accident classification as shown in Fig. 2. The proposed
model is trained using the DETRAC dataset for normal traffic scenarios and the accident
video dataset from the YouTube platform. The time-distributed LSTM used allows the
modelling of long-term sequential dependencies across video frames which could further
improve the classification accuracy.

Fig. 1. LSTM on CNN [17]

Fig. 2. Time-distributed model architecture [26]

On the other hand, LSTM is also applicable to non-vision-based data, for example,
traffic flow data captured through loop detectors or radar sensors installed at the roadside.
The research [18] performed a comparative study between LSTM and GRU in terms
of traffic accident detection using spatiotemporal data collected from loop detectors.
Synthetic Minority Over Sampling Technique (SMOTE) is employed to balance the
dataset in terms of accident and non-accident cases. It has been demonstrated that both
models share similar false alarm rates but GRU performed slightly better than LSTM in
detection rate. Both models are capable of dealing with dependencies at different time
36 S. T. Fu et al.

scales and have a performance accuracy as high as 96%. Moreover, it has been observed
that considering traffic data of only one temporal resolution might not be sufficient to
represent the traffic trends at different time intervals.
Hence, [16] proposed a novel LSTM-based framework that comprises three LSTM
networks to comprehensively capture traffic states at different time intervals. The outputs
of the three LSTM networks are combined by a fully connected layer with a dropout
layer to avoid the overfitting issue. The model is tested against other similar freeways
and has achieved a crash detection accuracy of 65.15% on transferability which is better
than models with one or more temporal resolutions. Hence, considering traffic data of
different temporal resolutions can improve the prediction performance especially in a
real-world setting.
Apart from interpreting the traffic accident occurrence based on the entire image
scene, it will be more accurate if each of the objects within the given frame is detected
and tracked across frames for motion and appearances discrepancies. In [27] CNN is
used for lane boundary line extractions and selective search method for vehicle detection.
The prediction of accident occurrence is solely based on the lane boundary lines and
the position relationship between the vehicle’s trajectory. Hence, this approach is only
applicable in a non-mixed traffic flow environment whereby lane discipline adheres and
vehicle types are of the same size.
As traffic accident detection requires high real-time performance, one-stage models
such as You Only Look Once (YOLO) are one of the preferred Deep Learning techniques
in performing object detection. The author in [28] used YOLO net for car detection and
Fast Fourier Transform algorithm for building the object tracker. The Violent Flow (ViF)
descriptor is then used as input to a Support Vector Machine (SVM) classifier to detect
car crashes. Moreover, [29] utilises YOLO V3 based on darknet for vehicle detection and
uses bounding box prediction to track the movement through identifying the centroid of
the objects. Conservation of momentum is used to calculate the probability of accident
occurrence based on the set of features extracted through the object tracking process and
further classify it into major and minor accidents. Also, [30] combines YOLO V3 and
canny edge detection algorithm to detect cars to perform preliminary classification on the
severity of an accident based on three main pre-trained classes of cars which are normal
car, damaged car and overturned car. To enhance the performance in detecting small
objects, [3] proposes YOLO-CA which is a combination of one-stage model YOLO and
Multi-Scale Feature Fusion (MSFF). This model is comprised of 228 neural network
layers and is formally evaluated against several Deep Learning models such as Fast R-
CNN, Faster R-CNN, Faster R-CNN with FPN, Single Shot MultiBox Detector (SSD),
YOLO V3 without MSFF and YOLO V3. It is demonstrated that the proposed model
can detect car accident occurrences in 0.0461 s with 90.02% average precision.
On the other hand, [31] proposed an improved model based on their previous work
on traffic accident classification [32] to support the detection of a group of vehicles trav-
elling through the predefined cell by incorporating Faster R-CNN for vehicle separation
and track each of the individual vehicles using data association scheme to obtain the vehi-
cle trajectory for classification on traffic incident happened. The paper [1] uses Mask
R-CNN for its vehicle detection framework, which employs the RoI Align algorithm
to automatically segment and build pixel-wise masks for every object in the video to
Exploring Deep Learning in Road Traffic Accident 37

provide more accurate results. Even though this model has higher detection accuracy as
compared to Faster R-CNN, it is still ineffective when dealing with high-density traffic
and occlusion issues. A centroid-based object tracking algorithm is used to track each
of the vehicles detected and the probability of the accident occurred is calculated based
on the acceleration and trajectory anomaly obtained.
However, these aforementioned approaches focused on detecting crashes in
motorised traffic environments such as vehicle-vehicle collisions or single-vehicle colli-
sions. To date, a mixed traffic flow environment would be even more significant as there
are a large number of pedestrians and cyclists who share roadways with automobiles.
Hence, it is critical for the model built to take into consideration the traffic characteristics
in a mixed traffic flow environment. To support this, [19] initiated an attempt to investi-
gate collision detection in a dense lane-less mixed traffic flow environment using a new
proposed framework known as Siamese Interaction LSTM (SILSTM). This proposed
framework comprises one or more bidirectional LSTM (BLSTM) layers that can learn
salient features from long vehicle interaction trajectories as well as pedestrian spatial
trajectories and allowed for highly effective detection of safe and collision-prone interac-
tion trajectories in lane-less traffic. A bidirectional LSTM can capture the propensity of
a collision caused by a sudden change in speed of the vehicles more accurately through
incorporating both future and past context of each time step using a separate LSTM on
the reversed sequence [19]. This approach could better represent the complex interaction
trajectories of every road user in a mixed traffic flow especially with the absence of lane
discipline and inconsistent vehicle speed by examining the relative distance and speed
information of the vehicle and its neighbouring vehicles in a bidirectional LSTM. The
proposed model is compared against different variants of LSTM and GRU, and it can
be seen that the proposed model has a higher recall, precision, and F1 score but it is
relatively expensive in terms of computational cost.
Clearly, object detection and tracking in a mixed traffic flow environment are more
complex than in a non-mixed traffic flow environment due to the wide variations in the
sizes and types of road users. On top of that, a good object detector will significantly
contribute to the accuracy of the object tracked as well as contributing to good traffic
accident recognition. The study [7] incorporated Retinex image enhancement technique
to enhance the input images before YOLO V3 with Darknet-53 is being trained to
detect multiple objects from images such as fallen pedestrians/cyclists, moving/stopped
pedestrians/cyclists/vehicles, single-vehicle collisions, and multiple objects collisions.
Features extracted from YOLO V3 are then fed into a decision tree model to further
classify the crashes. The result shows that the proposed model has a detection accuracy
of 92.5% on crashes in the testing dataset used which comprised a total of 30214 crash
frames and 42148 normal frames over a total of 12736 frames.
To improve both the object detection speed and accuracy in a mixed traffic flow envi-
ronment, [33] proposed the Detection Transformer (DETR) approach which comprised
of CNN ResNet-50 as the backbone, a transformer encoder-decoder block to help in
focusing on the most influential features about the car, motorcycle, and truck for a par-
ticular detection and fully connected layers for the class with bounding box predictions.
It is reported that the DETR approach has achieved a detection rate of 78.2% with low
latency as compared with previous work. The DETR output is then fed into a random
38 S. T. Fu et al.

forest classifier to classify each frame into either accident or non-accident frames. Lastly,
the probability of an accident occurring is derived based on the predictions from the past
60 frames using the sliding window technique. The detection rate of DETR is remark-
able; however, there is still a gap especially in localising diminished objects. On the
other hand, [34] presents a condensed version of YOLO known as Mini-YOLO which is
composed of pre-trained YOLO V3 with ResNet-152 and MobileNet-v2. Mini-YOLO is
primarily trained to detect motorcycles, cars, trucks, and buses. Simple Online Realtime
Tracking (SORT) is then used to track each of the objects detected for their damage
status. Support Vector Machine (SVM) is then used to classify the accident occurrence
based on the damage status of the objects across the frames.
Detecting smaller road users such as motorcycles, bicycles and pedestrians has
always been challenging for traffic accident recognition in a mixed traffic flow envi-
ronment. The paper [35] deploys Context Mining and Augmented Context Mining on
top of Faster R-CNN to improve the detection of pedestrians which occupy smaller image
segments than other vehicle categories. The features extracted from Faster-RCNN are
the input into the Dynamic-Spatial-Attention LSTM (DSA-LSTM) for accident forecast-
ing. However, Faster R-CNN is a two-stage model that involves an additional stage of
selecting the proposal regions before object detection, hence processing is much slower
and is not suitable for real-time applications.
From the literature, it can be seen that most of the Deep Learning-based recognition
algorithms are based upon a single feature-based approach in traffic accident detec-
tion: either using appearance feature-based approaches [7, 35] or motion feature-based
approaches [1, 19, 27, 28, 31]. These look into appearance or motion crash features
within video frames to determine whether an accident has occurred. Recently, a fusion
feature-based approach that combines both the appearance and motion crash features is
widely adopted in traffic accident detection especially in a mixed traffic flow environ-
ment. This is mainly due to the complex nature of mixed traffic flow characteristics that
by considering only one single perception in traffic accident recognition might cause the
accuracy to suffer.
The study [15] proposed a two-stream traffic accident detection framework with
one stream focusing on collision detection and another stream focusing on abnormality
detection. Abnormality detection is performed through extracting deep representation
based on three modalities which are appearance, motion, and joint representation using
stacked autoencoder to obtain the anomaly scores and reconstruction errors, but this
approach is computationally intensive [36]. The collision score is then determined based
on the trajectories of the moving vehicles and it is used to increase the reliability of the
overall result.
Besides, [20] proposed an integrated two-stream CNNs architecture that performs
near-accident detection of mixed road users according to traffic data sourced from fisheye
surveillance footage, multi-scale drone footage, and simulated videos. Additionally, the
two-stream CNNs proposed comprise of a spatial stream network for object detection by
capturing appearance features using the YOLO detector and a temporal stream network
(ConvNet model) for leveraging motion information of multiple objects to generate
individual trajectories of each tracked target as shown in Fig. 3. In addition, a deep
cosine metric learning method called DeepSORT is used to train the temporal stream for
Exploring Deep Learning in Road Traffic Accident 39

vehicle reidentification in the presence of occlusion. The probabilities of near-accident


candidate regions are then computed based on both the appearance and motion features
derived from the two-stream deep-learning-based model. With the introduction of the
multimodal concept and the two-stream models, the proposed model has demonstrated
that it has achieved overall competitive qualitative and quantitative performance at high
frame rates. Notably, benchmarking against other Deep Learning models within the
context of a mixed traffic flow environment was not found in the work presented.

Fig. 3. Two-stream CNN architecture for near-accident detection [20]

Recently, [21] proposed a feature-fusion Deep Learning framework that combines


residual neural network (ResNet) with attention module and Convolutional Long Short-
Term Memory (Conv-LSTM) to capture both appearance and motion crash features as
shown in Fig. 4. Conv-LSTM is used as it does not require data flattening which outpaced
the traditional LSTM in capturing the motion feature of crashes. It is reported that the
proposed model achieved its aim, which is to have higher crash detection accuracy within
acceptable detection speed and limited computing resources in a mixed traffic flow envi-
ronment. However, the proposed model is still not robust especially in a congested and
ambiguous traffic scene where crashes are usually falsely detected. A comparison of
the discussed Deep Learning techniques for road traffic accident recognition is provided
in Table 1. In addition, the traffic flow data mainly obtained via roadside sensing tech-
nologies which are publicly available and mainly used for Deep Learning-based road
traffic accident detection is further documented in Table 2. The next section outlines the
research needs and future research trends for designing and developing Deep Learning
models for real-time road traffic accident recognition.
40
Table 1. Summarization of deep learning models applied for traffic accident recognition on roadside sensor-based traffic data sources.

Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[22] 2019 Deep CNN Image dataset from Urban - Motorized TrafficNet ResNet50 94.4% accuracy
CCTV
surveil-lance
S. T. Fu et al.

footage
[23] 2020 CNN Image dataset from Urban - Motorized Google Images on - 85% accuracy
CCTV surveillance accident scenes
footage
[4] 2019 Inception V3 Simulation videos - - Motorized NA - 96% accuracy
[24] 2018 CNN + RNN (GRU) Simulated videos - - Motorized NA STM 96.64% accuracy
[25] 2020 CNN + ANN Radar sensor Highway - Motorized Des Moines, IA LR, DT, RF, SVC, 76% accuracy
Traffic data KNN
[4] 2019 Inception V3 + CCTV surveillance Highway - Motorized NA - 92.38% accuracy
LSTM footage
(CNN-LSTM)
[26] 2021 LSTM CCTV surveillance - - Motorized (Car UA-DETRAC and - 94.33% accuracy
footage only) YouTube videos
[18] 2019 LSTM Loop detector Express-way Under various Motorized Chicago - 96% accuracy
weather conditions Express-way
and traffic
congestion status
[18] 2019 GRU Loop detector Express-way Under various Motorized Chicago - 95.9% accuracy
weather conditions Express-way
and traffic
congestion status
[16] 2020 LSTMDTR (LSTM Loop detector Freeway - Motorized I880-N and I805-N KNN, LR, NB, 70.43% accuracy
for Different ANN, SVM, RF
Temporal
Resolution)
(continued)
Table 1. (continued)

Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[27] 2019 CNN CCTV surveillance Highway Under various Motorized UA-DETRAC GMM 95.2% detection
footage weather and rate
illumination
condition
[28] 2018 YOLO net CCTV surveillance City Under various Motorized (Car CCV - 75% accuracy
footage weather and only)
illumination
condition
[30] 2020 Yolo v3 CCTV surveillance City - Motorized NA - NA
footage
[3] 2019 YOLO-CA CCTV surveillance Urban road Under various Motorized (Car CAD-CVIS ARRS, DSA-RNN 90.02% average
footage weather and only) precision
illumination
conditions
[31, 32] 2018 CNN + Faster CCTV surveillance Highway, urban Under various Motorized 6 videos consist of SVM + PHOG + F1 Score of 80%
R-CNN footage and rural weather and various traffic GMM, GoogleNet
illumination incidents CNN
condition
[1] 2019 Mask R-CNN CCTV surveillance Intersection Under various Motorized CCTV videos from Vision based 71% detection
footage weather and YouTube model (ARRS), rate
illumination
conditions
[19] 2019 SILSTM CCTV surveillance Intersections and - Mixed traffic flow SkyEye dataset Different variants -
footage + aerial Highways of SILSTM
footage
[7] 2020 Yolo v3 CCTV surveillance City Under various Mixed traffic flow CCTV videos from - 92.5% detection
Exploring Deep Learning in Road Traffic Accident

footage weather and online sources rate


illumination
condition
(continued)
41
42
Table 1. (continued)

Ref Year Deep Learning Data Source Type Type of Accident Environ-mental Motorized/ Mixed Dataset Used Compara-tive Performance
Algorithm Used Area Condition Traffic Flow Models
[33] 2020 ResNet + DETR CCTV surveillance Intersections and Under various Mixed traffic flow CADP MaskR-CNN, 78.2% detection
footage Highways weather and (Multi-vehicle Stacked rate
illumination crashes and Autoencoder
S. T. Fu et al.

conditions pedestrian-vehicle
crashes)
[34] 2021 Mini-YOLO CCTV surveillance Urban - Mixed traffic flow Boxy vehicles - -
footage (Car, Motorcycle, dataset
Truck and Bus)
[35] 2019 Faster R-CNN + CCTV surveillance Intersections and Under various Mixed traffic flow CADP - 80% recall
DSA-LSTM footage Highways weather and (Multi-vehicle
illumination crashes and
conditions pedestrian-vehicle
crashes)
[15] 2019 Stack Autoencoder CCTV surveillance City Under various Motorized (Bike ARRS, RTADS -
footage illumination and car)
conditions
[20] 2020 Two-stream CNN Fisheye Intersections Under various Mixed traffic flow TNAD datasets - F1 score of
(Spatial Stream + surveillance illumination (Cars, Bike and 93.09%
Temporal stream) footage + condition pedestrian)
multi-scale drone
footage +
Simulation videos
[21] 2020 ResNet + CCTV surveillance City Under various Mixed traffic flow Traffic image Faster R-CNN + 87.78% accuracy
Conv-LSTM footage weather and (Cars, Bike and datasets in China SORT, ResNet-50
illumination pedestrian) + CBAM
condition
Exploring Deep Learning in Road Traffic Accident 43

Table 2. Summarization of roadside sensor-based traffic flow data source.

Ref Dataset Name Sensor Area Dataset Motorized/Mixed Video Environmental


Type Installed Descriptions Traffic Flow Type Condition
[18] Chicago Loop Highway 82,182 Motorized NA NA
Expressway Detector non-accident
Dataset cases and 32
accident
cases
[16] I880-N and Loop Highway I880-N Motorized NA NA
I805-N Detector (1,386 crash
cases and
2,781
non-crash
cases)
I805-N (48
crash cases
and 89
non-crash
cases)
[25] Interstate 235 Radar Highway 447,043 Motorized NA NA
Des Moines Sensor crash and
Dataset non-crash
cases
respectively
[37] Autopista MVDS Expressway 39 accident Motorized NA NA
Central Detector cases and
Expressway 12,990
Dataset non-accident
cases
[3] CAD-CVIS Camera Intersection, 633 car Mixed Traffic CCTV Under various
Urban Road, accident Flow weather and
Expressway videos illumination
and condition
Highway
[28] CCV Camera Intersection 57 car crash Motorized CCTV -
videos and 57
normal traffic
videos
[15] Hyderabad Camera Intersection 863 accident Mixed Traffic CCTV Under various
City Dataset and Urban frames and Flow illumination
Road 127,138 condition
normal traffic
frames
[35] CADP Camera Intersection 1,416 Mixed Traffic CCTV & Under various
and accident Flow Dashcams weather and
Highway videos illumination
condition
[38] UA-DETRAC Camera Intersection 100 normal Mixed Traffic CCTV Under various
and traffic videos Flow weather and
Highway illumination
condition
(continued)
44 S. T. Fu et al.

Table 2. (continued)

Ref Dataset Name Sensor Area Dataset Motorized/Mixed Video Environmental


Type Installed Descriptions Traffic Flow Type Condition
[22] Traffic-Net Camera Intersection, 4,400 Motorized CCTV Under various
Urban Road, accident illumination
and images condition
Highway

Fig. 4. Feature-fusion deep learning framework [21]

3 Research Needs for Deep Learning-Based Road Traffic Accident


Recognition

After a study on the state-of-the-art Deep Learning techniques in traffic accident recog-
nition from roadside sensing technologies, there are still many issues worth exploring.
Traditionally, traffic data is mainly collected through active sensors such as radar and
lightwave photosensors installed at the roadside which may not provide consistent and
reliable counts to support traffic accident detection especially in mixed traffic flow envi-
ronments [25]. Recently, video-based data captured from vision sensor-equipped within
CCTV surveillance cameras have become viable with the introduction of Deep Learn-
ing techniques as these can capture real-time traffic information and conditions [39] in
which Deep Learning models are proven to be a promising tool for real-time traffic acci-
dent detection [40]. Also, Deep Learning is the state-of-the-art method for time-series
problems and it can simulate dynamic changes in traffic conditions as well as detecting
anomalous activity which can improve the performance of traffic accident recognition
[16, 41].
However, the presence of various external factors that influence the detection per-
formance such as weather, illumination, and congestion condition. These factors can
severely influence the reliability of traffic accident recognition. Hence, it is crucial
to consider these external factors while designing the Deep Learning architecture to
improve recognition accuracy [39]. Multiple research works have addressed these exter-
nal factors but consideration of congestion conditions is rarely highlighted. Also due to
the dynamic nature of traffic data, the detection technique used must be able to model
traffic flow at different time intervals. However, most of the works considered traffic
Exploring Deep Learning in Road Traffic Accident 45

data from only one temporal resolution, which is not sufficient to represent the traffic
trends at different time intervals.
As reported by [7], most of the previous research works focused on detecting traffic
accidents in non-mixed traffic flow environments, for example, single-vehicle collisions
and vehicle-vehicle collisions [1, 3, 4, 31] instead of in mixed traffic flow environments
[7, 19–21, 35]. A mixed traffic flow environment is usually found in urban traffic scenes
and intersections which often have the highest number of traffic accident fatalities.
Nonetheless, there is limited literature on traffic accident recognition in this context. On
top of that, mixed traffic flow environments present an exponential increase in the factors
that influence traffic accident modelling when compared to its non-mixed counterpart,
which cannot be easily addressed through conventional methods. The authors in [19] and
[7] reported that crash detections remain a challenge in mixed traffic flow environments
as non-motorized traffic such as pedestrians or cyclists tend to be blocked by other
objects. These findings show that there is a gap between the implementation of existing
conventional traffic accident detection systems for non-mixed versus mixed traffic flow
environments.
In a mixed traffic flow environment, one of the greatest concerns is the complexity
in detecting smaller objects such as pedestrians, motorcyclists, and cyclists as these road
users are often represented in smaller image footprints compared to the rest of the vehicles
and it is often challenging to draw a tight bounding box around them especially across
video frames. This has caused significant degradation of accuracy in object detection
especially in accident scenes that may involve fallen motorcyclists and fallen pedestrians.
Besides, the absence of lane discipline in mixed traffic flow environment forms challenge
in annotating unsafe interaction trajectories especially when vehicle creeping phenomena
happens in which smaller sized vehicles can move through the gaps to reach the front of
the queue during overtaking, causing the vehicles to ply very close to each other. Also,
traditional ways of object tracking such as background subtraction [42–44] or optical
flow [45–48] are found not optimal under complex traffic scenarios in a mixed traffic
flow environment such as dense traffic scenarios, sudden acceleration of vehicle speed,
and occlusion. In light of the many challenges faced at the object level detection and
tracking, one possible concept that may help to improve its performance in mixed traffic
scenario is by first interpreting the whole traffic scene across series of video frames.
In [21] it is stated that scene understanding could provide contextual information and
scene structure that might be helpful in traffic accident recognition. Hence, it is suspected
that instead of pinpointing a single object within a single video frame or across frames,
looking at the bigger picture, in this case - the whole traffic scene, might provide better
accuracy in traffic accident detection.
In overviewing the covered Deep Learning-based traffic accident recognition
approaches in a mixed traffic flow environment, it can be seen that a majority of the
models proposed focus on modelling the motion flow of the road users to detect abnor-
mality [1, 19, 27, 28, 31] and modelling the crash appearances of the road users [7, 35].
Recently, a fusion feature-based approach that combines both the appearance and motion
crash features is widely adopted in traffic accident detection especially in mixed traffic
flow [15, 20, 21, 29]. It is worthwhile to note that the fusion feature-based approach has
significantly improved the overall performance of the traffic accident recognition model
46 S. T. Fu et al.

by considering multiple perspectives at the object level across video frames. The pos-
sibility of having other prominent features could be considered in this fusion approach
serving as a reinforcement for the decision made based on the other features. Perhaps in
the context of a mixed traffic flow environment, a fusion feature-based Deep Learning
model that takes into consideration features at both scene and object levels for traffic
accident recognition could improve the overall accuracy. However, the performance in
terms of recognition accuracy, speed, and computational cost of this fusion feature-based
Deep Learning approach warrants further investigation.
Lastly, comprehensive validation is necessary for any real-time detection/validation
model. From the literature, the most common performance evaluation metrics used for
Deep Learning models are detection rate, false alarm rate, and overall accuracy. How-
ever, only a few studies have employed all three measures to comprehensively validate
their model’s performance [1, 7, 15, 18, 27], and most did not report the specific mea-
surement of their model performance. It is observed that many proposed Deep Learning
models are validated against Machine Learning models, however, it is recommended to
have a comprehensive validation against other Deep Learning models as well, for better
benchmarking.
Therefore, this remains a challenge as most of the research works depended on their
own private datasets which cannot be accessed and compared against their proposed
models [7]. To date, benchmarking and balancing recognition accuracy to false posi-
tives/negatives among various Deep Learning models is an important issue that has yet
to be addressed in real-time traffic accident recognition literature.

4 Proposed Conceptual Framework for Traffic Accident


Recognition in a Mixed Traffic

The previous section highlighted the research need for Deep Learning models in the effec-
tive handling of traffic accident recognition. Drawing inspiration from fusion feature-
based deep learning approach that combines both the appearance and motion crash
features, this section presents a proposed model that visually expresses a fusion feature-
based deep learning model for vision-based traffic accident recognition that takes into
consideration features at both scene and object level for traffic accident recognition in
the context of a mixed traffic flow environment. This proposed model is inspired by
research works carried out by [15, 20, 21] and is believed to improve the robustness of
the model in accommodating traffic accident recognition in a mixed traffic flow envi-
ronment. Traffic accident defined for this proposed model includes major accident that
causes serious injuries and damages due to high-impact crashes happened on the road-
way. For instance, multi-vehicle crashes that cause severe vehicle appearance damage,
vehicle rollover, fallen motorists, or pedestrians detected continuously over a substantial
amount of time.
The model architecture of the proposed model mainly comprises of three main
streams of models: Model 1 focuses on classifying accident scenes at the frame level,
Model 2 focuses on classifying the damage status of the road user detected, whereas
Model 3 focuses on tracking the object across frames and model the damage status for
each of the object tracked as well as model the motion pattern across the frames for
Exploring Deep Learning in Road Traffic Accident 47

abnormality. The post-fusion score from these three streams of the models is used to
get the final score in determining the occurrence of a traffic accident. The complete pro-
cess flow of the proposed fusion feature-based deep learning model for traffic accident
recognition is illustrated in Fig. 5.
As discussed in the previous section, scene understanding is termed as an important
component to be considered in improving the accuracy of road traffic accident recogni-
tion in a mixed traffic scenario. Hence, in the proposed model, Model 1 accepts a video
input that is split into video frames in sequence and these video frames will run through
a recurrent neural network for feature extraction and video classification which produce
two independent outputs (accident and non-accident). From the reviews, recurrent neural
networks such as LSTM, CNN-LSTM, and ConvLSTM are some of the most popular
potential Deep Learning frameworks used in this context as they are particularly effec-
tive for real-time data such as traffic time series, weather data, and congestion status
data.
On the other hand, determining appearance change on the road user detected such as
crashes is also one of the prominent features in road traffic accident recognition. Model
2 is mainly trained using a deep neural network to identify patterns in the object detected
across space for the classification of damage features aimed at different classes of road
users such as cars, trucks, motorcycles, and pedestrians. This model is embedded as part
of the component for the development of Model 3.
Also, observing road user interaction is a primary component in traffic accident
recognition. This can be achieved through Multiple Object Tracking (MOT), in which
the trajectories of several moving objects across video frames are extracted and analysed
[49]. The main responsibilities of MOT are to be able to locate multiple objects, maintain
the identities of the objects, and yielding the individual object trajectories across the
video frames [50]. With the emergence of Deep Learning-based models, it has provided
a powerful framework to formulate and model the target association problem which can
boost the tracking performance significantly. Leveraging the remarkable studies by past
researchers on MOT in the transportation domain [20, 51], it is evident that utilizing a
Deep Learning-based approach can better track different types of road users in a mixed
traffic flow environment for traffic accidents recognition. Hence, Model 3 utilizes a deep
learning-based approach in detecting objects within each of the video frames from the
video sequences and takes the deep learning features of the object into account as one of
its tracking metrics. The tracking information such as bounding boxes, stack trajectories,
centroid positions, speed, and interaction-related parameters of moving objects across
several consecutive frames are saved into sequences, which are fed into a recurrent neural
network to learn the motion sequences across frames. On top of that, each of the objects
tracked in each frame is passed to Model 2 as an input for classification on the object’s
damaged status. The damage status for each of the objects tracked across consecutive
frames is saved into sequence and further passed into a recurrent neural network to learn
the appearance change features pattern. Both the output from the motion and appearance
abnormality is used to determine the occurrence of the traffic accident.
In short, the proposed model is comprised of three main streams of model in which
each stream houses a technique to cater for prominent features in traffic accident recog-
nition. However, this proposed model can be dynamically scaled to accommodate more
48 S. T. Fu et al.

simultaneous stream of models in the effort to increase traffic accident recognition accu-
racy. Likewise, there is a possibility that adding more streams may not result in significant
accuracy increase as its performance depends on the selected features and Deep Learning
techniques used. On the other hand, there is a risk of running too many streams in that
the accuracy increase would be negligible while acerbating computation time making
real-time operation impossible to achieve. Therefore, much care needs to be invested to
strike a balance between performance improvement and the number of streams used to
avoid the said extremes.

Fig. 5. The proposed fusion feature-based deep learning model for traffic accident recognition in
a mixed traffic flow environment

5 Conclusion and Future Direction


In this paper, a review of the recent Deep Learning techniques in traffic accident recogni-
tion for roadside sensing technologies is presented. Deep Learning models have proved
to be a prime choice in the effective handling of non-linear data such as real-time traffic
recordings. It has demonstrated promising results for traffic accident recognition. Also,
recent trends of traffic accident recognition have transitioned from accident recognition
in a non-mixed traffic flow environment to a mixed traffic flow environment as it is
reported that traffic involving various type of road users are more prone to accident and
there exists research gaps especially in the detection of smaller sized road users and
annotating unsafe interaction trajectories in a mixed traffic flow environment.
Nonetheless, there is limited literature on traffic accident recognition in mixed traf-
fic flow environments. On top of that, model performance can be improved by consid-
ering multiple traffic data sources as the model’s input source as well as adopting a
fusion feature-based model. Hence, developing a Deep Learning model in improving
current efforts in vision-based accident recognition for any mixed traffic flow environ-
ment under various conditions is probable and worth the investment for the research
effort. In response to the review, a fusion feature-based Deep Learning model is pro-
posed which comprises of three main streams of model that primarily examine both scene
and object-level features, to improve the performance in determining the occurrence of
a traffic accident in a mixed traffic flow environment. The proposed model is flexible to
Exploring Deep Learning in Road Traffic Accident 49

accommodate multiple streams of model; nonetheless, it is crucial to maintain a balance


between its performance and computational costs.

References
1. Ijjina, E, P., Chand, D., Gupta, S., Goutham, K.: Computer vision-based accident detection
in traffic surveillance. In: 2019 10th International Conference on Computing, Communica-
tion and Networking Technologies ICCCNT 2019 (2019). https://doi.org/10.1109/ICCCNT
45670.2019.8944469
2. Almaadeed, N., Asim, M., Al-Maadeed, S., Bouridane, A., Beghdadi, A.: Automatic detection
and classification of audio events for road surveillance applications. Sensors 18(6), 1858
(2018). https://doi.org/10.3390/s18061858
3. Tian, D., Zhang, C., Duan, X., Wang, X.: An automatic car accident detection method based on
cooperative vehicle infrastructure systems. IEEE Access 7, 127453–127463 (2019). https://
doi.org/10.1109/ACCESS.2019.2939532
4. Chang, W.J., Chen, L.B., Su, K.Y.: DeepCrash: a deep learning-based internet of vehicles
system for head-on and single-vehicle accident detection with emergency notification. IEEE
Access 7, 148163–148175 (2019). https://doi.org/10.1109/ACCESS.2019.2946468
5. Sai Kiran, M., Verma, A.: Review of studies on mixed traffic flow: perspective of develop-
ing economies. Transp. Dev. Econ. 2(1), 1–16 (2016). https://doi.org/10.1007/s40890-016-
0010-0
6. Phogat, A., Gupta, R., Kumar, E. N.: Study on effect of mixed traffic in highways, pp. 1288–
1291 (2020)
7. Wang, C., Dai, Y., Zhou, W., Geng, Y.: A vision-based video crash detection framework for
mixed traffic flow environment considering low-visibility condition. J. Adv. Transp. 2020,
1–11 (2020). https://doi.org/10.1155/2020/9194028
8. Parsa, A.B., Taghipour, H., Derrible, S., (Kouros) Mohammadian, A.: Real-time accident
detection: coping with imbalanced data. Accid. Anal. Prev. 129, 202–210 (2019). https://doi.
org/10.1016/j.aap.2019.05.014
9. Zhang, K.: Towards transferable incident detection algorithms. 6, 2263–2274 (2005)
10. Ki, Y.K., Lee, D.Y.: A traffic accident recording and reporting model at intersections. IEEE
Trans. Intell. Transp. Syst. 8(2), 188–194 (2007). https://doi.org/10.1109/TITS.2006.890070
11. Hui, Z., Xie, Y., Lu, M.,Fu, J.: Vision-based real-time traffic accident detection. In: Proceeding
of the 11th World Congress on Intelligent Control and Automation, 2014, pp. 1035–1038
(2015). https://doi.org/10.1109/WCICA.2014.7052859
12. Kwak, H.C., Kho, S.: Predicting crash risk and identifying crash precursors on Korean express-
ways using loop detector data. Accid. Anal. Prev. 88, 9–19 (2016). https://doi.org/10.1016/j.
aap.2015.12.004
13. Ravindran, V., Viswanathan, L., Rangaswamy, S.: A novel approach to automatic road-
accident detection using machine vision techniques. Int. J. Adv. Comput. Sci. Appl. 7(11),
235–242 (2016). https://doi.org/10.14569/ijacsa.2016.071130
14. Basso, F., Basso, L.J., Bravo, F., Pezoa, R.: Real-time crash prediction in an urban expressway
using disaggregated data. Transp. Res. Part C Emerg. Technol. 86, 202–219 (2018). https://
doi.org/10.1016/j.trc.2017.11.014
15. Singh, D., Mohan, C.K.: Deep spatio-temporal representation for detection of road accidents
using stacked autoencoder. IEEE Trans. Intell. Transp. Syst. 20(3), 879–887 (2019). https://
doi.org/10.1109/TITS.2018.2835308
16. Jiang, F., Yuen, K.K.R., Lee, E.W.M.: A long short-term memory-based framework for crash
detection on freeways with traffic data of different temporal resolutions. Accid. Anal. Prev.
141, 105520 (2020)
50 S. T. Fu et al.

17. Ghosh, S., Sunny, S.J., Roney, R.: Accident Detection Using Convolutional Neural Networks.
Int. Conf. Data Sci. Commun. IconDSC 2019, 1–6 (2019). https://doi.org/10.1109/IconDSC.
2019.8816881
18. Parsa, A. B., Chauhan, R. S., Taghipour, H.: Derrible, S.: Mohammadian, A.: Applying deep
learning to detect traffic accidents in real time using spatiotemporal sequential data 1, 312
(2019). http://arxiv.org/abs/1912.06991
19. Roy, D., Ishizaka, T., Krishna Mohan C., Fukuda, A.: Detection of collision-prone vehicle
behavior at intersections using siamese interaction LSTM. IEEE Trans. Intell. Transp. Syst.,
1–10 (2019). http://arxiv.org/abs/1912.04801
20. Huang, X., He, P., Rangarajan, A., Ranka, S.: Intelligent intersection: two-stream convo-
lutional networks for real-time near-accident detection in traffic video. ACM Trans. Spat.
Algorithms Syst. 6(2), 1–28 (2020). https://doi.org/10.1145/3373647
21. Lu, Z., Zhou, W., Zhang, S., Wang, C.: A new video-based crash detection method: balancing
speed and accuracy using a feature fusion deep learning framework. J. Adv. Transp. 2020,
1–12 (2020). https://doi.org/10.1155/2020/8848874
22. Kumeda, B., Fengli, Z., Oluwasanmi, A., Owusu, F., Assefa, M., Amenu, T.: Vehicle accident
and traffic classification using deep convolutional neural networks. In: 2019 16th International
Computer Conference on Wavelet Active Media Technology and Information Processing
ICCWAMTIP 2019, pp. 323–328 (2019). https://doi.org/10.1109/ICCWAMTIP47768.2019.
9067530
23. Rajesh, G., Benny, A. R., Harikrishnan, A., Jacobabraham, J., John, N. P.: A deep learning
based accident detection system. In: Proceeding of the 2020 IEEE International Conference
on Communication and Signal Processing ICCSP 2020, pp. 1322–1325 (2020). https://doi.
org/10.1109/ICCSP48568.2020.9182224
24. Zheng, K., Yan, W.Q., Nand, P.: Video dynamics detection using deep neural networks. IEEE
Trans. Emerg. Top. Comput. Intell. 2(3), 224–234 (2018). https://doi.org/10.1109/TETCI.
2017.2778716
25. Huang, T., Wang, S., Sharma, A.: Highway crash detection and risk estimation using deep
learning. Accid. Anal. Prev. 135, p. 105392 (2020). https://doi.org/10.1016/j.aap.2019.105392
26. Gupta, G., Singh, R. Patel, A. S., Ojha, M.: Time-distributed model in videos (2021)
27. Wang, P., Ni, C., Li, K.: Vision-based highway traffic accident detection. In: ACM Interna-
tional Conference on Proceeding Series, pp. 5–9 (2019). https://doi.org/10.1145/3371425.
3371449
28. Machaca Arceda, V. E., Laura Riveros, E.: Fast car crash detection in video. In: Proceeding
of the 2018 44th Latin American Computer Conference (CLEI) 2018, pp. 632–637 (2018).
https://doi.org/10.1109/CLEI.2018.00081
29. Paul, A. R.: Semantic video mining for accident detection, 5(6), 670–678 (2020)
30. Chung, Y.L., Lin, C.K.: Application of a model that combines the YOLOv3 object detection
algorithm and canny edge detection algorithm to detect highway accidents. Symmetry (Basel)
12(11), 1–26 (2020). https://doi.org/10.3390/sym12111875
31. Vu, H.N., Dang, N.H.: An improvement of traffic incident recognition by deep convolutional
neural network. Int. J. Innov. Technol. Explor. Eng. 8(1), 10–14 (2018)
32. Vu, N., Pham, C.: Traffic incident recognition using empirical deep convolutional neural
networks model. In: Cong Vinh, P., Ha Huy Cuong, N., Vassev, E. (eds.) ICCASA/ICTCC
-2017. LNICSSITE, vol. 217, pp. 90–99. Springer, Cham (2018). https://doi.org/10.1007/
978-3-319-77818-1_9
33. Srinivasan, A., Srikanth, A., Indrajit, H., Narasimhan, V.: A novel approach for road acci-
dent detection using DETR algorithm. In: 2020 International Conference on Intelligent Data
Science Technologies and Applications IDSTA 2020, pp. 75–80, (2020). https://doi.org/10.
1109/IDSTA50958.2020.9263703
Exploring Deep Learning in Road Traffic Accident 51

34. Pillai, M.S., Chaudhary, G., Khari, M., Crespo, R.G.: Real-time image enhancement for an
automatic automobile accident detection through CCTV using deep learning. Soft. Comput.
25(18), 11929–11940 (2021). https://doi.org/10.1007/s00500-021-05576-w
35. Shah, A. P., Lamare, J. B., Nguyen-Anh, T., Hauptmann, A.: CADP: a novel dataset for
CCTV traffic camera based accident analysis. In: Proceeding of the AVSS 2018 15th IEEE
International Conference on Advanced Video and Signal Based Surveillance, no. i (2019).
https://doi.org/10.1109/AVSS.2018.8639160
36. Tsiktsiris, D., Dimitriou, N., Lalas, A., Dasygenis, M., Votis, K., Tzovaras, D.: Real-time
abnormal event detection for enhanced security in autonomous shuttles mobility infrastruc-
tures. Sensors (Switzerland) 20(17), 1–24 (2020). https://doi.org/10.3390/s20174943
37. Li, P., Abdel-Aty, M., Yuan, J.: Real-time crash risk prediction on arterials based on LSTM-
CNN. Accid. Anal. Prev. 135, 105371 (2020). https://doi.org/10.1016/j.aap.2019.105371
38. Wen, L., et al.: UA-DETRAC: A new benchmark and protocol for multi-object detection and
tracking. Comput. Vis. Image Underst. 193, 102907 (2020). https://doi.org/10.1016/j.cviu.
2020.102907
39. Nguyen, H., Kieu, L.M., Wen, T., Cai, C.: Deep learning methods in transportation domain: a
review. IET Intell. Transp. Syst. 12(9), 998–1004 (2018). https://doi.org/10.1049/iet-its.2018.
0064
40. Theofilatos, A., Chen, C., Antoniou, C.: Comparing machine learning and deep learning
methods for real-time crash prediction. Transp. Res. Rec. 2673(8), 169–178 (2019). https://
doi.org/10.1177/0361198119841571
41. Pawar, K., Attar, V.: Deep learning approaches for video-based anomalous activity detection.
World Wide Web 22(2), 571–601 (2018). https://doi.org/10.1007/s11280-018-0582-1
42. Jun, G., Aggarwal, J. K., Gökmen, M.: Tracking and segmentation of highway vehicles in
cluttered and crowded scenes. 2008 IEEE Workshop on Applications of Computer Vision,
WACV (2008). https://doi.org/10.1109/WACV.2008.4544017
43. Kim, Z. W.: Real time object tracking based on dynamic feature grouping with background
subtraction. 26th IEEE Conference on Computer Vision Pattern Recognition, CVPR (2008).
https://doi.org/10.1109/CVPR.2008.4587551
44. Mendes, J. C., Bianchi, A. G. C., Pereira, Á. R.: Vehicle tracking and origin-destination count-
ing system for urban environment. VISAPP 2015 - International Conference on Computer
Vision Theory and Applications VISIGRAPP, vol. 3, pp. 600–607 (2015). https://doi.org/10.
5220/0005317106000607
45. Wu, S., Moore, B. E., Shah, M.: Chaotic invariants of Lagrangian particle trajectories for
anomaly detection in crowded scenes. In: Proceeding of the IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, pp. 2054–2060 (2010). https://doi.org/10.
1109/CVPR.2010.5539882
46. Patoliya, P., Bombaywala, P. S. R.: Object detection and tracking for surveillance system, vol.
3, issue 6, pp. 18–24 (2015)
47. Pradhan, B., Ibrahim Sameen, M.: Laser Scanning Systems in Highway and Safety
Assessment, vol. 7 (2020).https://doi.org/10.1007/978-3-030-10374-3
48. Veni, S., Anand, R., Santosh, B.: Road accident detection and severity determination from
CCTV surveillance. In: Tripathy, A.K., Sarkar, M., Sahoo, J.P., Li, K.-C., Chinara, S. (eds.)
Advances in Distributed Computing and Machine Learning. LNNS, vol. 127, pp. 247–256.
Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-4218-3_25
49. Ooi, H.-L., Bilodeau, G.-A., Saunier, N., Beaupré, D.-A.: Multiple object tracking in urban
traffic scenes with a multiclass object detector. In: Bebis, G., et al. (eds.) ISVC 2018. LNCS,
vol. 11241, pp. 727–736. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03801-
4_63
52 S. T. Fu et al.

50. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Kim, T.-K.: Multiple object tracking:
a literature review. Artif. Intell. 293, 103448 (2020). https://doi.org/10.1016/j.artint.2020.
103448
51. Chan, Z. Y., Suandi, S. A.: City tracker: multiple object tracking in urban mixed traffic
scenes. In: Proceedings of the 2019 IEEE International Conference Signal Image Processing
Applications ICSIPA 2019, pp. 335–339 (2019). https://doi.org/10.1109/ICSIPA45851.2019.
8977783
Alternate Approach to GAN Model
for Colorization of Grayscale Images: Deeper
U-Net + GAN

Seunghyun Lee(B)

Korean Minjok Leadership Academy, 800 Bonghwa-ro, Anheung-myeon, Hoengseong-gun,


Gangwon-do, South Korea
[email protected]

Abstract. Image colorization refers to applying appropriate colors in a given


grayscale image, such that the viewer can accept the results as close to reality. By
analyzing existing colorization algorithms based on AutoEncoder and VGG-16,
this paper showed that they are not able to provide an accurate result in most cases,
and are inefficient in terms of computation time. In contrast to these models, we
suggested a new model developed from an established GAN model. By reforming
the generator part and adding 1x1 convolutional layers based on VGG-11, we were
able to create a deeper model where we could also apply nonlinear functions such
as ReLU, and Leaky-ReLU. Comparing the results printed by our new model
and the conventional model, we proved that our model produced better results
in terms of the accuracy and clarity of colors and computation time. However,
there is still room for further research in which one can investigate the optimal
number of convolution layers and depth that maximizes accuracy and minimizes
computation time. Still, this research holds value in that it successfully provides
an alternate algorithm with better performance, and opens a path toward further
development for colorization algorithms.

Keywords: Colorization · Deep learning · GAN · CNN · U-Net · Autoencoder

1 Introduction

1.1 Background

Image colorization refers to when the input is a grayscale photo and the objective is to set
a natural, appropriate color to the photo as an output. As grayscale information does not
guarantee what color is appropriate for a certain section of the given photo, we use color
information from typically colored photos to construct an algorithm that determines a
plausible color for the grayscale photo.
Due to the ambiguity of the problem, it is yet to be perfectly solved, nor does it have
a definitive approach towards the solution. Apart from just colorizing the given photo,
the two crucial factors of this research are to 1) determine a color that seems natural for
any viewer, 2) use a wide variety of colors to capture significant details. These factors

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 53–64, 2023.
https://doi.org/10.1007/978-3-031-18461-1_4
54 S. Lee

serve to make this problem even more challenging, as it has to be achieved without clear
criteria to determine the accuracy of the output.
The evaluation of the algorithm’s accuracy is entirely dependent on the satisfaction of
the human viewers, who will use their experience to determine whether the colorization
has been done naturally. Thus, although we have a method to determine the accuracy,
the method requires extensive effort and is difficult to guarantee objectiveness [1].

1.2 Purpose

This research aims to construct a new image colorization program. Colorization programs
that already exist involve deep learning, shown in Fig. 1 (as a means of gathering immense
data), which opens various methods upon which the program can be established upon.
Thus, this research will investigate which model is able to provide an algorithm with
lower computation speed and high resolution.
Accuracy is another aspect to be focused on in this research. Existing colorization
programs suffer from low accuracy in their output due to the existence of “outlier”
photos. For example, a grayscale photo of an exotic-colored flower such as orange
would be colorized as relatively red or pink, since the algorithm mostly gathers data on
the typical color of a flower.
In this research, typical methods were utilized- CNN-based U-net, Autoencoder,
GAN - and compared these methods in terms of time efficiency and accuracy. This will
show the limitations of the currently available colorization algorithms. A new algorithm
based on the original GAN model will be suggested, and compared the results with the
original algorithm to see how much the algorithm has improved. The rest of the paper
proceeds as follows: firstly, in the related work section, we introduce several previous
kinds of research related to our topic. Then, in the material and methods section, some
deep learning models and datasets are introduced. In the results section, the experimental
results, especially the generated photos from the proposed model are exhibited. In the
discussion section, the principal finding of our research, real-life usage of this finding,

Fig. 1. The architecture of deep neural networks consists of three hidden layers
Alternate Approach to GAN Model for Colorization of Grayscale Images 55

and future research are covered. Lastly, we summarize those sections mentioned above
in the conclusion section.

2 Related Work
Kinani et al. used the DIV2K data set, from the NTIRE image coloring challenge. The
dataset includes 1000 color images of 2K resolution, 800 of them being training images
for the algorithm, another 100 being validation images, and the rest of the 100 being
test images. For training, they used the VGG19 model, a convolutional neural network
(CNN) that is pre-trained with a large dataset and consists of 19 layers. They also
produced the tested results and compared with original photos to show the accuracy of
the result visually. However, they failed to provide a detailed analysis on their result’s
accuracy. Although they define the PSNR index to represent accuracy numerically, they
do not provide insights to the linkage between the index value and the actual accuracy,
making it difficult to interpret the accuracy from the index itself. Although the team
presents their program as the second most accurate program out of the other DIV2K
dataset programs, they still fail to maintain high accuracy in the “outlier” photos which
hold completely different colors from the colorized image [2].
An et al. used the VGG-16 as a basic reference but mainly trained on ImageNet in
order to learn a variety of colors. They provide a more detailed function that expresses the
accuracy of the result. They also use it to provide an objective standard for evaluating the
accuracy of each colorization algorithm, which they apply to compare their algorithm
with other prominent algorithms. They conclude that Zhang et al. resulted with the
greatest per-pixel RMSE [3].
Shankar et al. used Inception ResNet V2, a pre-trained model that Google publicly
released, as the basis of their model, and a fusion layer to train the model further and gain
a closer, required output. Then they used it as an example to measure the algorithm’s
performance in terms of the degree of error. They also defined an error function and
applied it to find the optimal number of epochs and steps per epoch [4].
Varga et al. referred mainly to images in SUN database and other images too. They
used VGG-16 and a two-stage CNN as their model. They differ from other researches
in that they established a two-stage algorithm based on CNN that predicts the U and
V channels of the input. They also relied on Quaternion Structural Similarity (QSSIM)
as a base for evaluating the accuracy of colorization. They provide reasons as to why
they chose QSSIM, and quantitative analysis using QSSIM. Three different experiments
were carried out to compare each model. The first one is utilizing autoencoder-based
convolution, while the second and third ones are VGG16 based U-net, and our proposed
model, respectively [5].

3 Methods and Materials


3.1 Data Description
This study used a dataset of approximately 14,300 image files consisting of photos por-
traying streets, mountains, buildings, glaciers, trees, etc. Each image exists in two ver-
sions: a colored version and a grayscale version. Using this data, we train our algorithm
56 S. Lee

to accurately estimate the colored version from a given grayscale image. The dataset is
gathered from the Kaggle website, which can be accessed via https://www.kaggle.com/
theblackmamba31/landscape-image-colorization. The figures, which are demonstrated
in Fig. 2 show the sample images (color, and grayscale) from the given dataset [6].

Fig. 2. Examples of the Image Datasets Gathered from the Kaggle Website
Alternate Approach to GAN Model for Colorization of Grayscale Images 57

3.2 CNN

A Convolutional Neural Network (CNN) is a popular deep neural network, which


consists of multiple layers to perceive patterns within large amounts of data. Layers
include the convolutional layer, non-linearity layer, pooling layer, and fully-connected
layer. Convolutional and non-linearity layers have parameters, while pooling and fully-
connected layers do not have any parameters. The convolution layer conducts convolu-
tion operations in order to extract the features from the input images, and then the pool-
ing layer resizes the output from the convolutional layer [7]. Lastly, the fully-connected
layer is used to add a certain layer for better performance and classification [8], and
these procedures are described in Fig. 3.

Fig. 3. The General Architecture of Convolutional Neural Networks

3.3 Autoencoder

An Autoencoder (AE) belongs to the unsupervised learning in a machine learning criteria


and is mainly used for the noise reduction for the given dataset [9]. Unsupervised learning
utilizes a dataset that does not contain any label, and this is a significant difference
compared to supervised learning [10]. AE mainly consists of an encoder and a decoder,
and it simply compresses the input data into a lower-dimensional one through the encoder
and reconstructs it via the decoder [9]. The convolutional autoencoder is used for the
image dataset, and it adds a convolution operation from the normal autoencoder, as
described in Fig. 4 [11].

3.4 Proposed Model

Unlike previous research, we propose a novel approach based on the GAN (Generative
Adversarial Network). The GAN model consists of the generator, and discriminative
part respectively. The generator generates the synthetic data based on the probability
distribution of the given data, and the discriminator decides whether the generated data
is genuine or not. In previous research, the authors suggested a model which has similar
58 S. Lee

Fig. 4. The General Architecture of an Autoencoder Consists of an Encoder and Decoder

architecture to the GAN model. They utilized the U-net architecture, Patch GAN as a
generator and discriminator, respectively [12].
However, we propose a new model, which is based on asymmetry U-net architec-
ture. The origin U-net network has a symmetry architecture, which is similar to the
autoencoder. However, we tried to construct a deeper deep model than the previous
research, in order to attain better performance. Therefore, for the encoder parts in the
U-net, we added a 1x1 convolution layer before every single encoding layer. The 1x1
convolution layer is mainly used for reducing a computation, and it is a principal back-
ground of the GoogLeNet, presented by Google [13]. Through this approach, we could
add non-linearity to our encoder layers with activation functions, rectified linear acti-
vation function (ReLu), and leaky ReLu. The mathematical expression of the ReLu
function is f (x) = 0(forx < 0), x(forx >= 0) while the leaky Relu’s expression is
f (x) = 0.01x(forx < 0), x(forx >= 0). Main difference between those functions is that
leaky ReLu could prevent to prevent the value from converging to zero. These activation
functions are dscribed in Fig. 5.
In collaboration with asymmetry architecture, we successfully constructed a deeper
model, while reducing the total number of parameters, which could efficiently decrease
the computation time and memory. In other words, previous approaches utilized a lot of
memory for both GPU and RAM, which made it difficult to apply the model in relatively
simple environments such as Colab. Furthermore, with using a lot of memories, applying
the colorization techniques in a mobile environment would be tough.
Our proposed model mainly consists of two parts: generator and discriminator, which
is quite similar to other GAN models. The encoder part utilizes asymmetry U-net archi-
tecture which has the same architecture as the VGG11, and the PatchGAN was utilized
Alternate Approach to GAN Model for Colorization of Grayscale Images 59

Fig. 5. Visualization of Two Non-Linear Activation Functions: leaky ReLu, and ReLu

for the decoder part. As the PatchGAN utilizes a patch unit for determining the authen-
ticity of the generated image, it is far more effective in colorization compared to the
original GAN model. The overall process could be found in Fig. 6.

Fig. 6. The overall architecture of the proposed model: A deeper VGG 11 for the generator and
patch GAN for the discriminator: The generator produces fake images, and the discriminator
decides whether the input images are fake or not.

4 Result

4.1 AutoEncoder based Approach

The below figures, Fig. 7 present more predicted images produced by this algorithm.
These images are based on the test cases used for our improved algorithm, and the results
are an evident failure, being a nearly grayscale image. It is questionable whether these
images are really colorized. In the below section, we will reexamine these test cases,
and see how the results are improved in our newly constructed algorithm.
60 S. Lee

Fig. 7. Colorization results from the autoencoder based approach

4.2 U-Net(VGG16) based Approach

The below figures (Fig. 8) are also predicted images produced by this code. Notice that
these images are identical to the images used to test our improved code below. Comparing
the results, we see a clear distinction between the first and the third test case. Results from
this algorithm are overall monotone in color, and are distinct from the actual color image,
while the improved algorithms provide a more vivid colorization of the grayscale images.
The second test case turns out to be almost similar for both algorithms, but notice that
the improved algorithm provides a clearer colorization compared to the above results.
Alternate Approach to GAN Model for Colorization of Grayscale Images 61

Fig. 8. Colorization RESULTS from the U-Net(VGG 16) based approach

4.3 Proposed Model

The below figures (Fig. 9) present three test cases and compare grayscale images with
their color image and their predicted image printed by the algorithm. All three results are
evidently similar to the actual color image in terms of the overall color. There are detailed
parts where the colors are different, but we take credit in that distinct objects such as
the sky, computer and the desk has nearly accurate colors. Also, referring to the results
of these test cases via the two previously examined algorithms, we clearly see that the
results via this algorithm are more accurate. Even though huge amounts of the datasets
were utilized for evaluating our model, we chose those three different pictures because
we believed those images were quite hard to colorize. For instance, the photo of the pizza
and the third image contain diverse colors and the divider boundary of each object was
complicated. Furthermore, for the second image, the gradation of the sky was difficult
to depict, so for those reasons, these three images were chosen for the representations
of our results. In addition to the performance of colorization, the proposed model was
much better than other models in terms of the time required. For other models, training
in the same environment took almost an hour, but for the proposed model, it took less
than 20 min.
62 S. Lee

Fig. 9. Colorization Results from the Proposed Algorithm

5 Discussion
5.1 Principal Finding
In this research, we examined the typical methods used by the colorization algorithm:
autoencoder and CNN-based U-net. We tested these algorithms by using sample data
to compare the color image and the predicted image produced by the algorithm. As a
result, we found that the conventional algorithms lacked accuracy in the majority of the
test cases.
This research also suggested a new model by improving the traditional GAN model.
This method turned out to produce better-predicted images than the previously examined
algorithms, and we concluded that the GAN model can be made more accurate with less
memory usage (RAM, GPU) and computation time.

5.2 Real-Life Usage and Application


The usage of this research links to the usage of the colorization algorithm itself. In the
status quo, colorization is already used to reinforce research in fields like history. For
example, we can restore the color of gray photos taken in the 19th Century, and help
Alternate Approach to GAN Model for Colorization of Grayscale Images 63

find further resources within the photos. It can also let viewers capture the scene more
realistically.

5.3 Future Research


The topic of colorization still has room for further research. Colorization algorithms can
be improved in terms of time complexity and especially accuracy. To achieve the ideal
colorization algorithm which maintains a sufficient level of accuracy, future research
will have to be focused on finding the adaptive deep learning technique or model that
maximizes the accuracy of the results.

6 Conclusion
This research proposed an alternate colorization algorithm, which was developed upon
the conventional GAN model. We reformed the generator part of the original GAN
model, which was based on the unit model of VGG11, by adding convolution layers
to the symmetric unit model. This led us to create a deeper model and use nonlinear
functions like the ReLU function. As a result, the number of parameters involved in the
algorithm were reduced, and the computation time was shortened. By comparing the
results formed by the original algorithm and the reformed algorithm, we also confirmed
that our algorithm had greater accuracy, in terms of the plausibility of the colorization.

References
1. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe,
N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46487-9_40
2. Kiani, L., Saeed, M., Nezamabadi-Pour, H.: Image colorization using a deep transfer learning.
In: 2020 8th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS) (2020)
3. An, J., Gagnon, K.K., Shi, Q., Xie, H., Cao, R.: Image colorization with convolutional neural
networks. In: 2019 12th International Congress on Image and Signal Processing, BioMedical
Engineering and Informatics (CISP-BMEI) (2019)
4. Shankar, R., Mahesh, G., Murthy, K.V.S.S., Ravibabu, D.: A novel approach for gray scale
image colorization using convolutional neural networks. In: 2020 International Conference
on System, Computation, Automation and Networking (ICSCAN) (2020)
5. Varga, D., Sziranyi, T.: Fully automatic image colorization based on convolutional neural
network. In: 2016 23rd International Conference on Pattern Recognition (ICPR). (2016)
6. Kaggle: https://www.kaggle.com/theblackmamba31/landscape-image-colorization Accessed
23 Jan 2022
7. Joo, H., Choi, H., Yun, C., Cheon, M.: Efficient network traffic classification and visualizing
abnormal part via hybrid deep learning approach : Xception + Bidirectional GRU. Global
Journal of Computer Science and Technology, pp. 1–10 (2022)
8. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network.
In: 2017 International Conference on Engineering and Technology (ICET) (2017)
9. Badino, L., Canevari, C., Fadiga, L., Metta, G.: An auto-encoder based approach to unsu-
pervised learning of subword units. In: 2014 IEEE International Conference on Acoustics
(2014)
64 S. Lee

10. Ye, M., Ma, A.J., Zheng, L., Li, J., Yuen, P.C.: Dynamic label graph matching for unsupervised
video re-identification. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 5142–5150 (2017)
11. Seyfioglu, M.S., Ozbayoglu, A.M., Gurbuz, S.Z.: Deep convolutional autoencoder for radar-
based classification of similar aided and unaided human activities. IEEE Trans. Aerosp.
Electron. Syst. 54(4), 1709–1723 (2018)
12. Ren, H., Li, J., Gao, N.: Two-stage sketch colorization with color parsing. IEEE Access 8,
44599–44610 (2020)
13. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going
deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 1–9 (2015)
Implementing Style Transfer with Korean
Artworks via VGG16: For Introducing Shin
Saimdang and Hongdo KIM’S Paintings

Jeanne Suh(B)

Saint Paul Preparatory Seoul, 50-11 Banpo-dong, Seocho-gu, Seoul, Korea


[email protected]

Abstract. By introducing genre painting to the artists of the time, artist Kim
Hongdo is a man known for opening the Renaissance era of Joseon’s art history.
Not only did he introduce new genres of art, but he also combined his delicate tech-
niques with his own unique art styles and thus completed over 130 art pieces dur-
ing his lifetime. Shin Saimdang is another prominent artist of the Joseon Dynasty.
Despite the Confucianism beliefs that limited women at that time, Shin Saimdang
managed to introduce her meticulous art styles to the public and receive acknowl-
edgment from many officials of the time. This research aimed at a deep learning-
based algorithm to recreate original photos by implementing Kim Hongdo’s and
Shin Saimdang’s art styles. Unlike previous research which utilized western paint-
ings for the target of the style transfer, this paper proposed the traditional Korean
artwork; such difference contributes to making this research meaningful. Further-
more, this paper suggests a novel method that is based on the VGG16 model, in
order to reduce the computation speed compared to the VGG19 model. The model
implemented style transfer to five original photos which created successful results,
capable of introducing Kim Hongdo’s and Shin Saimdang’s art styles and tech-
niques to the general public. The brush strokes and color themes of each artist are
successfully recreated in the new images. Despite such drastic changes, the overall
structure of the original photo is well maintained and expressed. The five exam-
ples can become a helpful guideline for a better understanding of Kim Hongdo’s
and Shin Saimdang’s art styles which can further stretch to the understanding of
Joseon’s art history as a whole.

Keywords: Deep learning · Hongdo Kim · Style transfer · VGG16 · VGG19

1 Introduction

1.1 Background

Kim Hongdo, the exemplar of Joseon art history, is still considered as a phenomenal
artist that highly elevated the style of art throughout Korean art history. In 1745, Hongdo
was born as an only son of a poor family [1]. Unlike most artists of the time, Hongdo
wasn’t from an artist family. Therefore, his early influence on art was mostly from his

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 65–78, 2023.
https://doi.org/10.1007/978-3-031-18461-1_5
66 J. Suh

teacher Gang Se-hwang, a government official who was also known as an art critic as
well as a part-time artist. As an influence of Gang Se-hwang’s recommendation, Kim
Hongdo began to work under Jeongjo, the 22nd ruler of the Joseon dynasty, from his
mid-twenties. Hongdo succeeded in receiving high credibility from Jeongjo, and within
a few years, Hongdo was entitled to be in charge of several confidential errands of the
country, such as going to Tsushima island to draw the map of Japan on his own [2].
He also drew many portraits of the members of the royal family. Although Hongdo
was a versatile, talented artist capable of diverse painting genres such as landscape and
portraits, the reason why he is considered a crucial artist that influenced many artists
during the 18th and 19th centuries is because of his paintings that depicted specifically
the lives of ordinary people. This specific style of drawing is called genre painting. 25
pieces of Hongdo’s genre paintings were compiled into an art book called “Dan won
Pung sok do chub.” One famous painting from the book “Dan won Pung sok do chub” is
this painting called “Seodang.” which is described in Fig. 1. At first glance, “Seodang”
can be read from the center to the outside of the painting; at first, the focal point stays on
the little boy weeping in sadness, which opens up to the surrounding view of the other
boys and the teacher irresistibly laughing towards the crying boy. The rough, simple
strokes, as well as the low saturated gouaches, are also noticeable. The minor details of
Hongdo’s painting arouse curiosity from the viewer. For instance, Hongdo intentionally
hid the face of the boy located in the lowest part of the painting while exaggerating the
wrinkles on his clothing, making the viewers doubt whether the boy is laughing together
with his fellow classmates or trembling in fear that he would be the one who would

Fig. 1. A Painting by Kim Hondo, “Seodang”


Implementing Style Transfer with Korean Artworks via VGG16 67

be scolded next. Such minor details function as a comical element adding humor to
the painting. The exact year of Hongdo’s death is still left as a mystery, although some
historians assume that it was around the year 1806 [3]. One thing for sure is that Hongdo
had a higher influence than any other artists throughout Korean art history.
One of the major branches of Confucianism is Neo-Confucianism, also known as
“Xing li xue” in traditional Chinese characters, which was the national, cardinal beliefs
of the Joseon period. Neo-Confucianism and also Confucianism in general believed in
“Nam jon yeo bi,” a rather discriminating idea that men were better than women [4].
Therefore, the Joseon period was not the best time period for women to show the world
their talent. Shin Saimdang is the one and only female Joseon artist left on record. Shin
Saimdang was born into a very wealthy and prestigious family. The Shin family was
known for their great wealth and therefore, Shin Saimdang had more opportunities to
dedicate to different studies than many other girls of her age. From when she was very
young, she started to study poetry, drawing, and calligraphy, in which she was very
talented. Shin Saimdang was most famous for her floricultural drawings: drawings of
nature, especially flowers. Figure 2 is one of her famous paintings named “Gaji and Banga
kkaebi.” Her drawings were aesthetically eye-catching: it was very feminine, poetic, and
her brushstrokes were very meticulous and detailed [5]. Therefore, her drawings were
acknowledged by many famous scholars of the time, most famously Sukjong, the king
of the Joseon dynasty from 1674 to 1720. The records about her marriage life clearly tell
us that Shin Saimdang was not the passive, obedient type of woman that Confucianism
valued. However, Shin Saimdang is now more known as the “exemplar mother” after the
birth of Yulgok Yi i, a famous philosopher. Despite her talents in drawing, her talents were
hidden due to the idiosyncratic Confucianist ideas of discrimination towards women that
existed during the Joseon period [6].

1.2 Purpose
Throughout art history, Western and Eastern artists built different artistic styles depend-
ing on each of their cultural views on aesthetic preferences. The disparities were not
something made suddenly in one day; it is rather the product of centuries of cultural
separation between the west and the east. Unlike European artists of the past centuries,
who are now very well-known to the public, many Eastern artists are yet to be acknowl-
edged by a wider range of people. Through this project, the art styles and techniques of
artist Kim Hongdo can be easily approached and understood by a wider public through
the various examples provided through this algorithm.
There are a variety of artists throughout Korean art history, but both Kim Hongdo
and Shin Saimdang are cardinal examples, necessary for people to know in order to
understand Korean art history. Hongdo was so influential to the artists of his time that
Joseon’s art can be marked as the era before Hongdo and after Hongdo. He opened
the era of genre painting in the late 18th century of the Joseon dynasty [7]. Unlike
other artists that lived a similar time period as Hongdo, Hongdo specifically excelled
at capturing a specific scene, then copying it into his canvas with high accuracy and
delicacy. Hongdo would also omit the background in order to give extra focus to the
scenes he depicted. Furthermore, his paintings did not follow the traditional Joseon’s
style of drawing: utilizing multiple points of view and adding them up into one picture.
68 J. Suh

Fig. 2. A Painting by Shin Saimdang, “Gaji and Banga kkaebi”

Instead, Hongdo followed the camera’s direct angle of view and utilized that one point
of view in his paintings [8].
Shin Saimdang was a competent figure who succeeded in leaving her name in history,
despite the limits she had to face as a woman in the Joseon period. Although her name
is more known as a wife of a man and mother of a man, it is an undeniable fact that she
is a phenomenal painter. As a means to appreciate her virtues and talents, a portrait of
Shin Saimdang is on the front cover of the Korean 50,000 won bill.
In conclusion, Hongdo was capable of using simple methods and styles of drawings
to give depth to the painting and utilized new techniques that artists of the time failed
to come up with. Shin Saimdang was also a figure who drew beautiful drawings of
nature and was known for her meticulous art style. By implementing a deep learning
algorithm, original photos would be recreated through Hongdo’s and Shin’s art styles,
which could help introduce pictures of Hongdo and Shin efficiently. The rest of the paper
consists as follows: related works section, which introduces previous research, methods
and materials section about deep learning algorithms, proposed model, result section,
discussion section, and conclusion section.

2 Review of Related Works

Even though Chinese painting is quite popular in Asian countries, neural transfer research
about Chinese painting is scarce. Therefore, they proposed modified extended difference-
of-Gaussians to develop a novel neural transfer model for Chinese-style paintings. They
begin with an MXDoG filter and then combine the MXDoG operation with three new loss
Implementing Style Transfer with Korean Artworks via VGG16 69

values for the training process. The proposed model consists of two different networks,
which are generative network, and loss network, respectively. For the loss network,
VGG16, which is a pre-trained CNN model was utilized, and MXDoG loss was utilized
to evaluate the loss function of the proposed network. Chinese paintings were collected
from various search platforms, including Google, Baidu, and Bing. Their experiment’s
result reveals that the proposed technique yield more attractive stylized outcomes when
transferring the style of Chinese traditional painting, compared to the state-of-the-art
neural style transfer models [9].
Zhao et al. carried out research about neural style transfer but mainly focused on
inventing a novel loss function through combining global and local loss, in order to
enhance the quality of the proposed model. They constructed a deep learning model
based on the VGG19 network and designed the model to contribute to the overall loss.
Layers ‘relu1_2’, ‘relu2_2’, ‘relu3_3’ and ‘relu4_2’ were chosen as a global loss section,
and ‘relu_3’ was selected as the local loss part. The global loss part gathered more global
information in the given image and the local loss for the local one. With these proposed
novel methods, the authors successfully reduce the artifacts, and transfer the style of the
contents image, while preserving the base structure and color of the given image [10].
Gatys et al. investigated the method to overcome a drawback of the conventional
neural style transfer methods that those models sometimes could duplicate a color dis-
tribution of the style image. Therefore, the authors proposed two different novel style
transfer methods, while preserving the colors. The first one is color histogram matching
and this method makes the colors of the style picture be modified to suit the colors of
the content image. The second method is luminance-only transfer. During this method,
luminance channels are extracted from both style and content images and then generate
a luminance image. The experiment’s result yielded that the second method is proper
to preserve the colors, even though the relationship between the luminance and color
channels no longer existed in the output image of the model [11].
Gupta et al. compared the method to conduct neural style transfer with various CNN-
based deep learning algorithms. Pretrained CNN models were utilized via Keras API, and
VGG16, VGG19, ResNet50, and InceptionV3 were used for the comparison. The Gram
matrices were utilized to compute the loss of the proposed models. The result yielded that
InceptionV3 and ResNet50 were not suitable models for the style transfer, especially,
the InceptionV3 model showed only a black screen. Furthermore, the performance of
the VGG19 is far greater than the VGG16, since the VGG19 consists of more layers
compared to the VGG16. However, it was found that the training speed of the VGG19
is slower than the VGG16, so each model has pros and cons respectively [12].

3 Methods and Materials


3.1 VGG16
A convolutional neural network (CNN) is a deep learning algorithm, which is mostly
utilized for image analysis. A filter rotates the image and is used to extract the features
from the gathered image, via convolution operation. Then, a pooling layer is used to
reduce the size of the convolution layer, and those two layers are repeated during the
process. A VGG16 is a CNN-based model, which is trained with a large dataset called
70 J. Suh

“Imagenet”. As the model was pre-trained with the dataset beforehand, the weight during
the training was conserved, which allowed making better performance when tested with
the user’s dataset. The input size of the VGG16 is fixed as 224X224, the stride of the
Conv feature map is fixed to 1, and padding is applied during operation for the precise
analysis. In the case of the pooling layer, the max pooling is applied after the Conv layer,
and a total of five max poolings are used. These components of the VGG16 can be found
in Fig. 3. The default layers from the VGG16 can be easily loaded through the Keras API.
Therefore, when applying the model to the user’s dataset, users could combine various
deep learning models including deep neural network (DNN), long short term memory
(LSTM), recurrent neural network (RNN), and gated recurrent unit (GRU) [13].

Fig. 3. The overall architecture of VGG16

3.2 Style Transfer via VGG19

Style transfer refers to changing the style of the content image like a style image when
a content image and a style image are given. To this end, two images, VGG19, and a
generator model to create images are used. The purpose is to correctly blend the features
collected when the two content photos and style images travel through the VGG19 model
to combine the contents and style. From the style image, the feature is extracted from
almost all layers, and in the content image, the feature is extracted from the fourth layer
Implementing Style Transfer with Korean Artworks via VGG16 71

and passed to generator networks to create an image. Extracting features from the image
by CNN proceeds as follows: in the shallow layer, monotonous patterns are repeated,
but in the deeper layer, features in the shallow layer are becoming more complex and
enlarged. Therefore, the CNN could extract the features from the style image efficiently
and apply them to the content image. Previous neural transfer algorithms implied a
critical shortcoming: when the content image came in, the model should train it again,
which made it quite time-consuming. However, VGG19 based model could overcome
this problem by proceeding with the feed-forward process [14].

3.3 Proposed Model


As the VGG19-based style transfer model achieved high performance efficiently, this
research tried the VGG16-based style transfer model. As the VGG-16 model consists
of a lower number of layers compared to the VGG-19-based model, it requires less
computation time. As computation speed and memory usage are vital for deep learning
technologies, the usage of the proposed model could be efficient in this research area.
The overall process of the model is quite similar to the VGG-19 based one; for the
content_layers, only the ‘block5_conv2’ layer was utilized, whereas the ‘block1_conv1’,
‘block2_conv1’, ‘block3_conv1’, ‘block4_conv1’, and ‘block5_conv1’ were utilized for
the style_layers.

4 Result
4.1 Result of Hongdo KIM’S Painting
In order to implement the VGG16 based style transfer model, “Seodang” was selected
for the style image and five different pictures for the content image. During the execution,
total loss, style loss, and content loss were calculated, and each of them yielded 1.9979e
+ 05, 9.9740e + 04 and 1.0005e + 05 respectively.
Overall, the model succeeded in copying the general characteristics noticeable from
Hongdo’s paintings, including the rough, simple, and diluted black brush strokes, as well
as the yellowish, low saturated opaque colors. Hongdo’s paintings are mostly drawn in
a yellowish drab, which caused the model to recreate the opposite colors of yellow,
such as blue, drawn with less precision. Although the model did struggle in the details
that Hongdo’s paintings failed to provide, it succeeded in recreating a good overview of
Hongdo’s art style as well as techniques, and the results could be found in Fig. 4, 5, 6,
7, 8.
The training epoch for the model was set to 500, and at epochs 25, 75, 390, and 495,
the model yielded the transferred output, and it could be concluded that as the number of
the epoch increases, the performance of the model also got enhanced. Furthermore, the
figures below show that the minimum epoch should be higher than 75 since the picture
at epoch 75 could not exhibit the desirable output. Furthermore, the experiment was
conducted on Colab pro, where GPU is Tesla P100, and it only took 125 s to complete
the execution. This fast computation is mainly due to the proposed model that is mainly
based on the VGG16 model. The figures below depict the output of the model pear
epochs (Fig. 9, 10, 11, 12).
72 J. Suh

Fig. 4. Style transfer results from the proposed model (Data #1)

Fig. 5. Style transfer results from the proposed model (data #2)

Fig. 6. Style transfer results from the proposed model (data #3)

Fig. 7. Style transfer results from the proposed model (data #4)
Implementing Style Transfer with Korean Artworks via VGG16 73

Fig. 8. Style transfer results from the proposed model (data #5)

Fig. 9. Style transfer results from the proposed model (epoch: 25)

Fig. 10. Style transfer results from the proposed model (epoch: 75)
74 J. Suh

Fig. 11. Style transfer results from the proposed model (epoch: 395)

Fig. 12. Style transfer results from the proposed model (epoch: 495)

4.2 Result of Shin SAIMdang’s Painting


The following pictures are the image transfer results of Shin Saimdang’s paintings.
Although the brush strokes are vaguer in the style transfer image compared to the style
image, the colors used in Shin Saimdang’s paintings are well represented in the style
transfer image. Compared to Hongdo’s paintings, Shin’s paintings are relatively festive
and vivid, using a wider range of colors, including violet, reddish-orange, and green. Such
colors are a noticeable point in Shin Saimdang’s paintings, which makes it important
that such characteristics should firmly be shown in the resulting style transfer image.
Implementing Style Transfer with Korean Artworks via VGG16 75

Furthermore, the colors are not diluted but solid and clear to the eye. As shown in
Fig. 13, 14, 15, 16 and 17, the VGG16 model successfully captured such attributes of
Shin’s Saimdang’s paintings in terms of color. As a result, the style transfer image utilizes
a more vivid and bright color palette compared to the original content image.

Fig. 13. Style transfer results from the proposed model (data #1)

Fig. 14. Style transfer results from the proposed model (data #2)

Fig. 15. Style transfer results from the proposed model (data #3)
76 J. Suh

Fig. 16. Style transfer results from the proposed model (data #4)

Fig. 17. Style transfer results from the proposed model (data #5)

5 Conclusion
5.1 Discussion

One factor that can differentiate this model from other analogous research is that this
algorithm specifically recreates a piece from a Korean artist of the Joseon period. Similar
research tends to recreate the art styles of prominent European artists such as Vincent
Van Gogh, a representative Post-Impressionist artist of the 18th century. This research
can raise awareness of ethnic artists, in which our research chose Kim Hongdo, the artist
who opened the Renaissance period of Joseon dynasty’s art. The algorithm successfully
copied Hongdo’s art styles in terms of color and composition. There were also certain
limits in recreating cerulean colors and backgrounds of the photos, such as the sky. This
is because Hongdo did not draw backgrounds in any of his genre paintings and also
did not use a lot of cerulean colors. The piece used to recreate the photos, “Seodang,”
also does not have either a background or cerulean color. Despite these limitations, the
algorithm is successfully unique, in a way that it promotes a traditional Korean artist by
recreating its art style suggesting an easier approach to the public.
Implementing Style Transfer with Korean Artworks via VGG16 77

In terms of Shin Saimdang’s artwork, Shin’s usage of vivid and bright colors was
successfully implemented in the resulting images. Similar to Hongdo’s paintings, Shin
also did not draw backgrounds in her paintings, nor did she use cerulean colors. This
resulted in a similar result as Hongdo-implemented pictures in the above: a lack of
representation of blue colors and background. However, Shin Saimdang did have a wider
range of color usage, such as green, so a glimpse of blueish colors is noticeable in the style
transfer images. Another limitation is that Shin’s meticulous, detailed style of drawing
is implemented in a rather disappointing manner. To overcome this shortcoming, further
research will focus on constructing a pre-trained model with a large number of Korean
paintings, which could enhance the overall quality of the style-transfer results, especially
implementing the meticulous brush stroke styles of Shin Saimdang. Furthermore, we
would try to construct the virtual museum through metaverse in future research for
introducing Korean traditional paintings efficiently.

5.2 Summary/Restatement
The deep learning based algorithm captured the focal characteristics of Hongdo’s and
Shin’s paintings and applied them to the original photos. Furthermore, this model suc-
cessfully provides an opportunity for the public to acknowledge and appreciate artists
Kim Hongdo’s and Shin Saimdang’s art styles through the different examples processed
by the algorithm. The original photos that were implemented in Hongdo’s paintings were
recreated with high accuracy, replacing the rigid lines of the original photos with opaque
and curvy brush strokes. The overall color theme was also changed using yellowish
gouaches throughout the painting. The model was successful in recreating a new color
palette for each picture based on Shin Saimdang’s usage of vivid colors. The resulting
image utilized a relatively larger color palette using colors such as violet, pink, cerulean,
and more. This model is quite successful in that it can be a new guideline for those
who want to understand Hongdo’s and Shin’s art styles, as well as the art history of
the Joseon period since both Kim Hongdo and Shin Saimdang were two of the most
influential artists during the Joseon dynasty.

References
1. Lee, K.H.: The Painting in Our Times: Kang Sehwang‘s criticism of art and genre paintings
of Kim Hongdo. Art History and Visual Culture 15(0), 28-61 (2015)
2. Newsis: https://www.chosun.com/site/data/html_dir/2020/05/27/2020052704647.html
Accessed 26 Oct 2021
3. Cho, J.Y.: Samgong bulhwando (三公不換圖) by Gim Hongdo: Changes in Gim Hongdo’s
Paintings after 1800 and His Relationship with Hong UiyeongJo. Art History Association of
Korea, 275276(275276), pp. 149-175 (2012)
4. Koreatimes. https://www.koreatimes.co.kr/www/art/2017/03/691_225097.html Accessed 14
Jan 2022
5. Google Arts & Culture. https://artsandculture.google.com/story/animals-and-plants-in-
korean-traditional-paintings-i-plants-and-insects-national-museum-of-korea/mgXBU0JJW
G92Lg?hl=en Accessed 14 Jan 2022
6. K-Paper. https://k-paper.com/en/magazine_k_no1_sinsaimdang/?v=38dd815e66db
Accessed 4 Jan 2022
78 J. Suh

7. Kim, M.H., Chung, H.K.: A study on the characteristic of the tableware pottery and the food
culture for genre painting in the 18th Chosun period-focused on the works of Dan-won Kim
Hong-do. J. Korean Society Food Culture 22(6), 653–664 (2007)
8. KyungHyang Shinmun. http://m.khan.co.kr/amp/view.html?art_id=200408181759431
Accessed 6 Nov 2021
9. Li, B., Xiong, C., Wu, T., Zhou, Y., Zhang, L., Chu, R.: Neural abstract style transfer for
chinese traditional painting. In: Asian Conference on Computer Vision, pp. 212–227 (2018)
10. Zhao, H.H., Rosin, P.L., Lai, Y.K., Lin, M.G., Liu, Q.Y.: Image neural style transfer with
global and local optimization fusion. IEEE Access 7, 85573–85580 (2019)
11. Gatys, L.A., Bethge, M., Hertzmann, A., Shechtman, E.: Preserving color in neural artistic
style transfer. arXiv preprint arXiv:1606.05897 (2016)
12. Gupta, V., Sadana, R., Moudgil, S.: Image style transfer using convolutional neural networks
based on transfer learning. Int. J. Computational Systems Eng. 5(1), 53–60 (2019)
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
14. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural net-
works. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2414–2423 (2016)
Feature Extraction and Nuclei
Classification in Tissue Samples
of Colorectal Cancer

Boubakeur Boufama(B) , Sameer Akhtar Syed, and Imran Shafiq Ahmad

University of Windsor, Windsor, ON N9B 3P4, Canada


[email protected]

Abstract. Cancer is considered to be a major health risk and ranked


third most important cause of death in the USA. The American Cancer
Society (ACS) predicted that by the end of 2020, there will be close to
2 millions new cases and over half million deaths in the USA. In partic-
ular, colorectal, breast, lung, and prostate cancers are the most danger-
ous cancers. This paper aims at providing new solutions for Computer-
Aided Diagnosis (CAD) of colorectal cancer, using feature extraction and
machine learning algorithms. in this paper, four well-known machine
learning techniques have been compared to classify tissue categories.
That is, Random Forest, Naive Bayes, Multi-layer Perceptron and Sup-
port Vector Machine. In order to measure the performances of these
algorithms, we have used Precision, recall and F1-Score. In particular,
we have focused on the colors and morphological characteristics in the
images and, how they can be useful to improve the classification and
diagnosis of colorectal cancer. We believe that such an improvement rep-
resents a significant contribution to the state-of-the-art, in both quanti-
tative and qualitative ways.

Keywords: Features detection · Machine learning · Colorectal


cancer · SVM · MLP · Random Forest · Naive Bayes

1 Introduction
Humans are able to perceive the three-dimensional (3D) nature of a two-
dimensional (2D) picture that depicts a 3D environment around us. Further-
more, we are able to recognize diverse objects and we are even able to overcome
some optical illusions that may hinder our visual understanding. In other words,
our eyes are attracted by so many visual cues, but the brain only focus on a
small part of this visual information for interpretation purpose [4].
Computer/machine Vision is a multi-disciplinary research field that aims
at making a machine capable of understanding the contents of images, taken
by a variety of sensors. In particular, researchers in this fields are hoping to
be able to extract and interpret more information from images than what we,
humans, can perceive [2]. Furthermore, this has to be practical so that to be used
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 79–99, 2023.
https://doi.org/10.1007/978-3-031-18461-1_6
80 B. Boufama et al.

in real-life applications. Such systems, should mix both the science (theory)
and the engineering aspects of vision. Example of research topics in machine
vision include motion estimation, autonomous navigation, 3D reconstruction,
object/pattern recognition and augmented reality (AR). Given that an image is
a simplified representation of the real object, machine vision tasks are inherently
challenging and sensitive to a multitude of factors. Those challenges include
noise, lighting changes, resolution, occlusions, deformation of non-rigid objects
and intra-class variations [3].
This paper aims to contribute computer vision methods for assisting medical
diagnosis (CAD) and is inspired from [1]. As common, this field follows a scheme
based on the extraction and combination of features, that are obtained from
pre-processed images. These features are then utilised to create models for the
extraction of hidden information, making it possible to reach valuable outcomes
in performing CAD tasks.
Firstly, we address the problem statement in Subsect. 1.1, to connect a CADx
contribution provided by machine learning. So, we explain Machine Learning -
supervised techniques to nuclei classification based on features in Subsect. 1.2.

1.1 Problem Statement

WHO (World Health Organization) states that cancer is the third cause of death
in the world and the second cause in the developed countries, right after car-
diovascular diseases. According to GLOBOCAN (an online database for global
statistics on cancers), over 16 millions of Americans have or had cancer. Factor-
ing population growth and aging, this number is poised to exceed 22 millions
by the year 2030. This report is published each 3-year period by the American
Cancer Society and the National Cancer Institute [12]. This currents statistic
implies that 1 out of 5 males and 1 out of 6 females will develop cancer by the
age of 75. In particular, cancer will eventually kill 1 out of 8 males and 1 out of 12
females [15]. The most alarming fact is that the developed countries accounted
for 57% of the new cases and 65% of the total number of deaths.
Among men, three cancers stood out, i.e., prostate cancer (over 3.5 millions),
colon/rectal cancer (over 750k million) as well as skin melanoma (close to 700k).
The three dominant cancers in the female population are in order: breast cancer
(close to 4 millions), endometrial or uterine body (over 800k), and colon/rectal
cancer (over 750k). According to [12], it is estimated that there will be over 22
millions cancer survivors by the year 2030. According to [15], it is estimated that
57% or roughly eight millions, of all new cancer cases, 65%, or roughly 5 millions,
of cancer deaths and 48%, or roughly 15 millions, of living cancer patients will
occur in the developing countries.

1.2 Machine Learning (ML)

In Machine Learning, sample data of past experiences are used to program com-
puters for optimal solutions to new instances of the same problem [7].
Feature Extraction and Nuclei Classification 81

More precisely, statistical theory is used in machine learning to create models


that will make sample-based inferences. There are two stages in machine learning
solutions. First, in training stage, there is a need for efficient algorithms for
optimization, storage and data processing. Second, in learning, a model needs
to be built based on some parameters. In other words, learning uses training
data to optimize model features and to get a model to be used as a prediction
algorithm.
Overall, there are three categories of ML methods:

– Unsupervised: When labeled data is not available, this category of ML aims


at discovering hidden patterns and/or perform data clustering.
– Supervised: Labeled data is available and is used for training/testing.
– Reinforcement: No labeled data is needed and where a program interacts
within an environment where it learns and decides how to proceed to act intel-
ligently for its next move, by maximizing its reward [7]. Figure 1 summarizes
this categorisation.

Fig. 1. Types of machine learning algorithms (Image Acquired from https://www.


geeksforgeeks.org/)

Furthermore, classification algorithms, that are trained by labeled data


(supervised learning), are used to categorize the output variable. That is, there
are a number of classes where we should reach Yes or No for each one. A model
trained with labeled data is shown in Fig. 2. Once the training is done, the model
is then tested on a subset of the data set (test data) for predicting the output.
Figure 2 depicts an example for supervised ML for the case of three classes.

Fig. 2. Supervised and unsupervised ML (from https://www.javatpoint.com/


supervised-machine-learning)
82 B. Boufama et al.

Classification in ML can be used in e-Health to create intelligent applications.


For example, early cancer diagnosis and its prognosis. In particular, many cancer
researchers have been interested in the classification of cancer risk, that is, high-
risk vs low-risk, to prioritize medical treatment.

1.3 Feature Extraction


Digital pathology equipment are designed to handle pictures of tissue samples,
coming from different microscopes. Several visual cues, such as shape and color,
are extracted from the regions of interest (ROI) of these images. Then, using
segmentation and feature extraction [5], these cues are used to make better
diagnosis.
Furthermore, morphological descriptions contain the size and shape infor-
mation on the target region/object. In particular, the glandular area might be
utilized to distinguish between cancerous and benign tumors and, the perimeter
is also another information that might be utilized to characterize the size of the
cells in the segmentation. Learning from the experiences of pathologists in the
diagnosis of Oral Submucosal Fibrosis (FSO), the perimeter of eccentricity is
the morphological characteristic that is the diameter of the equivalent area to
describe the nucleus of the cell. Classification can be carried out using these
kind of characteristics. Furthermore, automated systems exist for differential
White Blood Cell (WBC) count based on a number of characteristics (19). The
latter includes perimeter, area, solidity, eccentricity, convex area and orienta-
tion. These features have been already utilized to create Content-Based Image
Retrieval (CBIR) systems. These CBIR systems used morphological similarity
to find histopathology pictures of the prostate. Note however, that the accu-
racy of the obtained morphological measurements depends on the quality of the
segmentation.
Pixel intensity carries information on the color (or gray scale). This feature
extraction approach uses different color spaces. For example, the HSV color
space (Hue, Saturation, Value) can be obtained from the original color image
and, the H-channel can be used as a feature. Another example is the use of
(white, pink, purple) space. Using different color space allows us to compare
classification performances. Figure 3 illustrates an example of the values of a
perception-based feature about an epithelium and stroma tumour, obtained from
a histopathological slide from a colorectal cancer data set [6]. The latter contains
1332 pictures of tissue samples of epithelium and stroma. The figure also shows
the feature values.
The organization of this paper is as follows: Sect. 2 presents the state of art in
image processing and algorithms being utilized in the discovery of cancer. Poste-
rior, Sect. 3 describes our methodology for addressing the decision question of is
a tissue cancerous (positive) or not. Then, Sect. 4 presents the obtained results
and illustrates the performance of the different algorithms we have employed to
obtain the classification of the histopathological pictures. Section 5 summarizes
our contributions and conclude our research work.
Feature Extraction and Nuclei Classification 83

Fig. 3. Sample image of epithelium and stroma tumours (Image Acquired from http://
fimm.webmicroscope.net/supplements/epistroma) (Image Acquired from [6])

2 State of the Art


The process of acquiring histopathology is guided by a multi-step precise method-
ology. Figure 5 [14] explains the steps of this process: (1) A biological sample is
extracted from the human organ. (2) A process known as fixation is performed
on the biopsy for two reasons, i.e., ensuring the stability of the chemicals and
avoiding a change in post-mortem tissues. (3) The sample has to be sliced into
sections to fit onto slides of glass. (4) Sections are then stained in order to
expose the components of the cells using chemical reactions. Hematoxylin and
Eosin (H&E) are widely used dyes for staining cell nuclei into purple or dark
blue and, cytoplasm and connective tissue into bright pink, as illustrated on
Fig. 4. (5) The stained sections are finally observed viewed and digitized using a
microscope.

Fig. 4. H&E and IHC stained images (Image Acquired from [5])
84 B. Boufama et al.

Fig. 5. The workflow diagram for the acquisition of histopathology pictures [14].

Because the above steps involve humans, there is a risk of variability resulting
into visual heterogeneity. At least three factors are worth mentioning here [14]:
(i) Magnification, it depends on how the human operator adjusts the lenses
of the microscope; (ii) Staining, meant to enhance the contrast the sample;
(iii) Slice Orientation: even using the same staining, the slice orientation is
affected by how the cut is done (cross-section vs longitudinal) and consequently,
the appearance of the tissue would vary.
Here is a typical workflow of a histopathology image [14]:

– Pre-Processing: This is needed to reduce noise and visual variability. In


particular, subsequent steps will benefit from it.
– Feature Extraction: It aims at creating an image representation that is
highly descriptive, where such information might not be directly visible in
pixels.
– Pattern Recognition: It aims at detecting important and useful patterns
in the image. This can be achieved via supervised or/and unsupervised tech-
niques.

As previously mentioned, methods for solving cancer-related health issues


have seen a significant evolution over the past years. In particular, computer
vision research to help with medical diagnosis has been increasing, especially to
come up with noninvasive techniques for colorectal cancer diagnosis.
Consequently, this paper aims at analyzing histopathological images with
the objective of deciding if a given tissue is cancerous or not. We have used ML
tools to automatically detect patterns for normal and abnormal tissues. This
also falls within the histology field, where the anatomy of biological tissues is
studied using microscopes. This work supported our belief that histopathological
cancer images are central for the understanding the cells biological structures,
hence facilitating the diagnosis and analyze the cancer disease [14].

3 Methodology
The classification model has been applied to two separate data sets. The perfor-
mance and robustness of the proposed model on these data sets was compared
and discussed in Sect. 4.3.
Here are the two data sets we have used.
– Data set I: This data set contains eight classes (described in Sect. 3.1). It
consists of 5000 rows and 161 visual features: 15 color features and around
150 morphological features.
Feature Extraction and Nuclei Classification 85

– Data set II: It consists of 165 images obtained from 16 H&E stained his-
tological sections from colorectal adenocarcinoma stage T3 or T4. They are
labeled as “Cancer” or “No Cancer”, depending on their overall glandular
architecture.

3.1 Data Set Collection


Data Set I. We obtained this data set from The Medical Center of Mannheim
of the University of Heidelberg in Germany. It consists of eight different labels,
see Fig. 6. The classes we have selected are among the most representative images
in our data set. In particular, they have wide variations in stain intensity, tex-
ture and illumination, that are present in typical histopathological images [9].
The images come from 10 different samples of colorectal cancer (CRC) primary
tumours.

Fig. 6. The selected categories of this study’s cancer/non-cancer

Classifying histopathological images as cancerous or not is key to diagnosis.


In addition, categorizing the cancer tissues is also important for the treatment.
This paper examines a data set of anonymous H&E stained Colorectal Cancer
(CRC) tissue slides (Fig. 6). This data set is already labeled (Table 1), making
it easy to be used with supervised ML techniques [10]. In addition, contiguous
tissue regions were annotated manually and tessellated [9]. A total of 625 non-
overlapping tissue tiles were used here. Their dimensions were 150px × 150px
(74 µm × 74 µm). To summarize, the specifically required images correspond
to previously labeled histopathology images of hematoxylin-eosin (H&E) from
colorectal cancer. That data is covered by an MIT license [8].

Data Set II. This data set was obtained from the database of “GlaS MIC-
CAI’2015: Gland Segmentation Challenge Contest”. It contains a total of 165
images. The latter were obtained from 16 H&E stained histological sections.
They were at stages T3 or T4 of colorectal adenocarcinoma. Note that the most
common colon cancer is Colorectal adenocarcinoma that originates in intestinal
glandular structures. Pathologists use the morphology of intestinal glands, such
as glandular formation and architectural appearance to get prognosis and to
choose individual treatments for patients [13].
Every segment comes from a separate patient, as they were obtained at differ-
ent dates. The digitization of histological sections to create a whole slide image
86 B. Boufama et al.

Table 1. Types of cancer we selected

Class Name Diagnostic Description


1 Adipose No cancer Adipose tissue
2 Complex Cancer Containing single tumour cells and/or few immune
cells
3 Debris No cancer Including necrosis, hemorrhage and mucus
4 Empty No cancer No tissue
5 Lympho Cancer Immune-cell conglomerates and sub-mucosal lymphoid
follicles
6 Mucosa No cancer Normal mucosal glands
7 Stroma Cancer Homogeneous composition, includes tumour stroma,
extra-tumoural stroma and smooth muscle
8 Tumor Cancer Tumour epithelium

(WSI) was done with the slice scanner Zeiss MIRAX MIDI. With a resolution
of 0.465 µm pixel. WSIs were rescaled to a pixel resolution of 0.620 µm (similar
to an objective magnification of 20X).
In order to cover the wide range of the tissue variety, 52 visual fields of
benign/malignant regions across the whole set of WSIs were chosen. A special
pathologist (DRJS) then classified every visual field as “malignant” or “benign”,
using the general glandular architecture. In addition, the boundaries of each
individual glandular region was delineated by the pathologist (see Table 2).

Table 2. The selected types of cancer we have used

Class Name Diagnostic Description


1 Adenocarcinoma Cancer Malignant tumours arising from glandular epithelium
2 Benign tissue No cancer Healthy or benign sample

3.2 Image Features

This phase is about the extraction of features that are morphological or coming
from colors. Table 3 describes these features that are important to model.

– Color Features
We use level-1 statistics, i.e., median, mean, skewness, kurtosis, standard devi-
ation and color histogram to characterize the color feature. These statistics
are calculated for different color spaces, like RGB, HSV, YUV and grayscale.
Figure 7 describes this process of extracting this information for RGB (simi-
larly, the same kind of information is extracted from other channels like YUV
and HSV).
Feature Extraction and Nuclei Classification 87

Table 3. The features of interest to our model

Color features Morphological featuresa


Mean Area
Median
Standard deviation Perimeter
Skewness
Kurtosis Circularity
Energy
Entropy Eccentricity
Color Histogram (8 bins)
a
For each of the morphological characteristics, a mean
and standard deviation is obtained

Fig. 7. Color features extraction

– Morphological Features
On the other hand, The extraction of morphological features is done via a
nucleation segmentation. The latter is performed on the hematoxylin and
eosin color space. The hematoxylin’s color component is extracted using
watershed algorithm to detect nuclei as illustrated on Fig. 8. Furthermore, it is
necessary to core-segment the morphological characteristics before proceeding
to estimate the above-mentioned statistical entities. Finally, the main features
like eccentricity, circularity, solidity, area, and perimeter, are obtained.
After the above features are obtained, we mine the useful data. Using pan-
das library, we generated a clean data frame to prepare sparse features for
next-stage classification algorithms. In the context of ML. The categorical
variables in our data sets are viewed as discrete entities that are coded as
feature vectors. “Dirty” non-organized data produces redundant categorical
variables, that is, several categories represent the same entity [11]. This can
be overcome by the conversion of categorical data into numbers. In our case,
one hot encoding was used, where binary vectors represent the categorical
variables. The binary vector has all zeros except for the index of the inte-
ger, marked as 1. The one hot encoding A one hot encoding allows a more
expressive representation of categorical eliminates redundancy. This enhances
the performance of our model.
88 B. Boufama et al.

Fig. 8. Nuclei segmentation - morphological extraction

Furthermore, we have decreased the size of our two data sets by ignor-
ing variables not representing relevant data for building our model (i.e.
“ID of sample”). We also evaluated whether the resulting data sets are bal-
anced or not. We made sure that classes have the same number of samples.
In the end, the obtained clean data sets are split into 80%/20% for training
and testing, respectively. Each set consists of random values from the corre-
sponding clean data set. This way, we are able to evaluate the performance
and robustness of the classification methods.

3.3 Cancer Classification Algorithms


We built a classification model for the diagnosis of colorectal cancer that is
based on ML. First, it is crucial to define the set of inputs and labels for our
classification algorithms. Input set complies with the features data set (prepared
as described in Sect. 3.2). The label set is determined by number of classes of
two data set, i.e., eight classes for the first and two classes for second data set
(see Sect. 3.1).
We have tested four supervised ML algorithms, that is, Naive Bayes, Random
Forest, Support Vector Machine (SVM) and Multilayer Perceptron (MLP).
Naive Bayes Algorithm
It is based on Bayes’ Theorem with the assumption that predictors are inde-
pendent. It is considered ‘Naive’ because of the assumption of rigid independence
among the input variables. In other words, it is assumed that features are not
related to each other, and the presence of one does not affect the presence of
any other one. Naive Bayes Classifier can be used as a benchmark model and
is easily trained. Because of its strong assumption, Naive Bayes performance
degrades when its predictors have dependencies.
We have used Naive Bayes with Python’s Sklearn library, by constructing a
classifier model with the command “GaussianNB” adjusted with default values
(Ignoring prior probabilities of the classes and a portion of the largest variance
of all features equal to 1e − 9).
Random Forest (RF)
It is a supervised algorithm for classification. It consists of a high number of
decision trees, operating as an ensemble. Every individual tree predicts a class
and, the one with the highest vote score is chosen as our model prediction.
Feature Extraction and Nuclei Classification 89

We have also employed Sklearn in the case of RF where, command “Ran-


domForestClassifier” was used with all default values, except for two parameters,
that is, maximum depth of the tree and randomness of the bootstrapping, were
set to 50 and 0, respectively.
Support Vector Machine (SVM)
SVM is best suited for regression and classification analysis. The goal of SVM
is to determine a hyperplane, in an n-dimensional space, to separate data in two
regions.
Sklearn is employed here as well, where command “make pipeline” was used
with default values (and iauto for “gamma SVC”).
Multilayer Perceptron (MLP)
This is an ANN that consists of several layers, meant for solving problems
with classes that are not linearly separable. MLP is made of one input layer, one
or more intermediate (hidden) layers and, one output layer.
TensorFlow library was used to define the MLP architecture that we cre-
ated. In particular, our model has three sequential layers, customized with the
parameters for data training shown in Table 4, with word “Dense” meaning
“Number of neurons of that layer”.

Table 4. Proposed parameters for our MLP (Training)

Parameter Value
Dense-1 100.0
Dense-2 50.0
Dense-3 8.0
Layer 1, 2 -Activation function ReLu
Layer 3 - Activation function Softmax
Epochs 30.0
Validation split 10.0%

4 Results
4.1 Eight-Class Data Set: Results of the Classification
The above four ML Algorithms are used with the first data set (the eight-
class one). We aim here at comparing the performances of SVM, MLP, RF and
Naive Bayes.
The obtained results of our experiments are shown in Table 5. As the table
shows, SVM yielded best results among all four, considering Accuracy, Precision
and Recall. However, RF and SVM perform best when F1-Score is used for
comparison. Overall, all four algorithms have performances exceeding 0.8 with
the precision lowest for MLP at 0.89.
90 B. Boufama et al.

Table 5. Performance of the 4 ML algorithms with the 8-class data set

Algorithm Accuracy Precision Recall F1-Score


Support Vector Machine (SVM) 0.98 0.98 0.98 0.98
Random Forest 0.95 0.96 0.95 0.95
Naive-Bayes 0.93 0.93 0.93 0.93
Multilayer Perceptron (MLP) 0.84 0.89 0.84 0.84

Fig. 9. Performance of the 4 ML algorithms with the 8-class data set

Figure 9 shows the 4-metric general classification results for each of our ML
technique. This figure suggests that the best classifications are obtained by SVM
and RF, with scores exceeding 0.95. On the other hand, Naive Bayes has scored
below 0.93 for each measurement while MLP scored even worse at (0.85).
One can notice from Table 6, that in addition to F1-score, it is crucial to look
at precision and recall, in order to get a good estimate of the performance of the
method. In addition, Table 6 provides some details about the classes giving our
model more challenges. In particular, some special patterns in Histopathological
images makes their analysis more challenging. For example, the poor precision of
MLP at 0.64 for the 6th class. This tells us that MLP is not good for predicting
a true positive value of “Mucosa” a “not cancer” class. Four values are under
0.70 in Table 6 (Classes 2, 3, 5, 6) for MLP.
SVM is a great computational diagnosis tool. For most of the used metrics,
SVM obtained scores that are close to 1. As a conclusion, we can say that SVM
can classify every class of the data set correctly.
We might also say that RF is somewhat reliable. The model obtained only 3
scores that were below 0.90. It is less computational-intensive and was able to
get a 1 score on for three easy classes.
Naive Bayes achieved scores above 0.90. However, it was challenged in classes
5 and 6. Comparing all 4 models, MLP was just acceptable, we believe it is the
least desirable model for CAD tools.
Figure 10 shows the performances by Precision for every class to compare
results among the proposed models. Similarly, Fig. 11 illustrates performances
by Recall. Last, Fig. 12 shows the result of the performances using F1-Score.
Considering class-8, one can notice a singularity. All methods achieved very high
scores, all above 0.98. We can conclude that class-8 is an easy class to classify.
Feature Extraction and Nuclei Classification 91

In addition to these scores, confusion matrices for all proposed methods are
needed to further evaluate their performances.

Fig. 10. Performance of the 4 algorithms using precision - 8-class dataset.

Fig. 11. Performance of the 4 algorithms using recall - 8-class dataset.

First, Fig. 13 describes confusion matrix of MLP. One can notice that class-2
is confused with class-3. The reason could be that MLP model does not dis-
tinguish between shapes (morphological characteristics) for classes “Debris” (3)
and “Complex” (2). In addition, class-6 and class-5 are close. Because there are
no obvious common characteristics, one might conclude of a model deficiency at
validation error.
Similarly, Naive Bayes’s confusion matrix (Fig. 14) suggests very good classi-
fication results where the classes are clearly distinguished. Although class-5 and
class-6 are confused, this is not a major classification issue because both of them
are non-cancer classes.
From Fig. 15, we can see that RF confuses class-6, the “Mucosa”, with class-1,
“Adipose”. This is due to the small variability in shapes from classes “Adipose”
and “Mucosa”. Despite this, most samples were classified correctly. We can con-
clude that this is still an efficient model. Even though the model slightly confuses
these classes, they are not “cancer” classes. Hence, this is not a major issue that
might affect a CADx system. Figure 16 the results of SVM that are robust in
classifying all classes.
92 B. Boufama et al.

Table 6. Class-wise performance of the proposed models

Method Class Precision Recall F1-Score


SVM 1 0.980 0.980 0.980
2 0.950 0.990 0.970
3 0.980 0.940 0.960
4 0.970 0.990 0.980
5 0.970 0.980 0.970
6 0.980 0.970 0.980
7 0.990 0.980 0.990
8 0.980 0.980 0.980
Random Forest 1 0.760 0.980 0.860
2 0.990 0.950 0.970
3 0.980 0.920 0.950
4 0.990 1.00 1.00
5 0.990 0.940 0.960
6 1.00 0.85 0.920
7 0.990 0.960 0.970
8 0.990 0.980 0.990
Naive-Bayes 1 0.980 0.950 0.970
2 0.910 0.930 0.920
3 0.900 0.920 0.910
4 0.980 0.990 0.980
5 0.890 0.880 0.880
6 0.810 0.870 0.840
7 0.980 0.920 0.950
8 0.980 0.980 0.980
MLP 1 1 0.770 0.870
2 1 0.452 0.690
3 0.660 0.980 0.790
4 0.850 0.980 0.910
5 0.990 0.650 0.780
6 0.640 0.990 0.780
7 0.970 0.820 0.890
8 1 0.980 0.990
Feature Extraction and Nuclei Classification 93

Fig. 12. Performance of the 4 algorithms using f1-score - 8-class dataset

Fig. 13. MLP confusion matrix - 8-class dataset

Fig. 14. Naive Bayes confusion matrix - 8-class dataset

Fig. 15. Random Forest confusion matrix - 8-class dataset


94 B. Boufama et al.

Fig. 16. SVM confusion matrix - 8-class dataset

Overall and by looking at all performance scores and confusion matrices,


SVM seems to be best choice for this type of classification. Even if all four
models are classical techniques that have proven their performance, SVM is the
best when working with high dimensions.

4.2 Binary Data Set - Classification Results

The performance of SVM, MLP, RF and Naive Bayes is compared when trained
on the second data set (2-class).

Fig. 17. Performance of SVM, MLP RF and Naive Bayes with the 2-class data set

The obtained results show that SVM and RF perform very well with scores
exceeding 0.8. Figure 17 suggests that RF yields the best classification with scores
over 0.90. On the other hand, MLP and Naive Bayes did poorly with scores below
0.7.
When selecting features, working with subsets of features, such as morphol-
ogy and color, can be useful to discover the relevant characteristics to be used
for classification. In the case of RF we can say that all its scores on performance
metrics are high as they exceed 0.85. All the other three algorithm’s scores are
in the range of 0.70–0.85. This suggests that RF is ranked first followed by SVM,
Naive Bayes and MLP, respectively.
We have added Fig. 18 to investigate the detailed precision of the four models
tested on the 2-class (binary) data set. On the other hand, Fig. 19 depicts the
Feature Extraction and Nuclei Classification 95

performance by recall and, Fig. 20 does the same thing by F1-Score. One can
easily see that precision and recall scores that are variable among classes, yield
a high F1-Score score of 0.90. To each algorithm there exists a difficulty to
recall on 0-class (“no cancer”) that prohibits it from reaching a value greater
than 0.750, thus precision score has to compensate and define a mean value too
different through each algorithm. Given the variability in scores of Precision,
Recall and F1-Score, we can conclude RF, Naive-Bayes, SVM and MLP are not
having enough training data. Therefore, we cannot say whether or not they are
reliable and/or robust, when using only color and morphological features.
We also obtained the confusion matrices for these algorithms to evaluate
their performance in distinguishing between ‘cancer’ and ‘no-cancer’.

Fig. 18. Performance of the 4 models with precision - 2-class data set

Fig. 19. Performance of the 4 models with recall - 2-class data set

Unlike Subsect. 4.1, this data set is binary as it consists of two classes only:
“c 0 =cancer” or “c 1 =no-cancer”. MLP confusion matrix is shown on Fig. 21
which describes what can easily confuse the class “cancer”, while “no cancer” is
recognizable for this model.
On the other hand, RF and SVF seem to perform better according to Fig. 24.
Both models are able to clearly distinguish between a cancer tissue sample and
non cancer one. However, RF has better precision than SVM.
Figure 22 suggests that Naive Bayes yields comparable results as RF. How-
ever, Naive Bayes suffers from higher level of confusion for the “no-cancer” class
96 B. Boufama et al.

Fig. 20. Performance of the 4 models with F1-score - 2-class data set

Fig. 21. MLP confusion matrix - binary dataset

Fig. 22. Naive Bayes confusion matrix - binary dataset

Fig. 23. Random Forest confusion matrix - binary dataset


Feature Extraction and Nuclei Classification 97

Fig. 24. SVM confusion matrix - binary dataset.

(Fig. 23). RF comes as best performing model when using all obtained classifica-
tion scores (all metrics) as well as the confusion matrices. Its scores of 0.91 and
0.89 for accuracy and F1-score are high.

4.3 Comparison

First, when looking at the results from Fig. 9 and 17, we can see that Naive
Bayes and MLP achieve performance scores that exceed 0.80 on the first data
set (8-class). Their scores fall to the range of 0.60–0.70 on the second data set
(2-class). The results of these models could possibly be improved by customizing
them further. For instance, one can add a new layer to the MLP model or by
changing the activation function of its final layer. For Naive Bayes, one can
replace its “Gaussian” by a “Multinomial” configuration. It is worth noting that
the second data set (2-class) is quite small, explaining in part why the methods
performed poorly (misclassifications).
Using the same Fig. 9 and 17, SVM easily distinguishes classes from each
other on the 8-class data set (scoring 0.90). However, SVM performs not as
good on the 2-class data set (scoring 0.83). RF on the other hand yields great
performance for the two data sets. RF achieves scores exceeding 0.90 for all met-
rics and easily differentiates between different classes as shown in the confusion
matrix.
After careful metrics analysis, and to avoid misclassification of morphological
and color features, especially on the binary data set, we believe that the number
of training samples should be increased.
When the performance of methods with the binary data set improves, while
the number of classes and the number of samples are reduced, this is very likely
due to the small size of the data set and its small validation set as well.
To summarize, considering the number of features detected properly, we can
say that current results suggest that RF performs best, with large and small
data sets. We can also say that MLP performs the worst in the 8-class and 2-
class data sets, mainly due to smaller number of samples in the context of neural
networks.
98 B. Boufama et al.

5 Conclusion
Similar techniques have been used in the past to classify images in the context of
colorectal cancer classification. However, these techniques used mainly texture as
feature. This paper has shown that other features, i.e., color and morphological
characteristics, are also very useful in the context of colorectal cancer classifica-
tion. Hence, paving the way for other possibilities to investigate the same kind
of images.
Among four ML models, RF has achieved the best performance when con-
sidering Accuracy, Precision and Recall scores. These scores exceeded 0.95 for
8-class (first) data set, and exceeded 0.90 for the 2-class(binary) data set. The
proposed models were tested on two very different data sets to investigate the
computational cost and classification performance. RF has proven itself to yield
high accuracy even when the set of training data was smaller. In other words,
RF has the capability to increase the learner’s generalization. Up to a certain
extent, the performance of SVM, a classical classification model, is comparable
to the performance of RF.
In future work, it is possible to improve the configuration, with the goal
of improving precision and accuracy, using different kinds of experiments. In
particular, more experiments with more images of adenocarcinoma to increase
samples, will likely improve the 2-class (binary) classification. Comparison with
other ML techniques and combining multiple models, like RF and SVM, could
lead to better results. In addition, one might explore other customization, like
increasing the depth in RF, to enhance performance and robustness.

References
1. Syed, S.A.: Color and morphological features extraction and nuclei classification
in tissue samples of colorectal cancer (2021). Electronic Theses and Dissertations.
8539. https://scholar.uwindsor.ca/etd/8539
2. Bluteau, R.: Obstacle and change detection using monocular vision. Electronic
theses and Dissertations (2019)
3. Yang, M.H., Kriegman, D., Ahuja, N.: Detecting faces in images: a survey. Pattern
Anal. Mach. Intell. IEEE Trans. 24, 34–58 (2002)
4. Davies, E.: Computer Vision: Principles, Algorithms, Applications, Learning (2017)
5. Irshad, H., Veillard, A., Roux, L., Racoceanu, D.: Methods for nuclei detection,
segmentation and classification in digital histopathology: a review current status
and future potential. IEEE Rev. Biomed. Eng. 7, 97–114 (2014)
6. Jain, A.K., Lal, S.: Feature extraction of normalized colorectal cancer histopathol-
ogy images. In: Hu, Y.-C., Tiwari, S., Mishra, K.K., Trivedi, M.C. (eds.) Ambient
Communications and Computer Systems. AISC, vol. 904, pp. 473–486. Springer,
Singapore (2019). https://doi.org/10.1007/978-981-13-5934-7 42
7. Alpaydin, E.: Introduction to Machine Learning, 2nd edn. The MIT Press, Cam-
bridge (2010)
8. Xu, Y., Zhu, J.Y., Chang, E., et al.: Weakly supervised histopathology cancer
image segmentation and classification. Med. Image Anal. 18(3), 591–604 (2014)
Feature Extraction and Nuclei Classification 99

9. Kather, J.N., Weis, C.A., Bianconi, F., Melchers, et al.: Multi-class texture analysis
in colorectal cancer histology. Sci. Rep. 6, 27988 (2016)
10. Kather, J.N., Marx, A., Reyes-Aldasoro, C.C., et al.: Continuous representation
of tumor microvessel density and detection of angiogenic hotspots in histological
whole-slide images. Oncotarget 6, 19163–19176 (2015)
11. Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty
categorical variables. Mach. Learn. 107(8–10), 1477–1494 (2018)
12. Miller, K.D., Nogueira, L., Mariotto, A.B., et al.: Cancer treatment and survivor-
ship statistics. 2019 CA A Cancer J. Clin. 69, 363–385 (2019)
13. Sirinukunwattana, K., Snead, D.R.J., Rajpoot, N.M.: A stochastic polygons model
for glandular structures in colon histology images. IEEE Trans. Med. Imaging 34,
2366–2378 (2015)
14. Arévalo, J., Cruz-Roa, A., González, F.: Histopathology image representation for
automatic analysis: a state of the art review. Revista Med. 22(2) (2014)
15. Wild, C.P., Stewart, B.W. (eds.): World Cancer Report 2014, pp. 482–494. World
Health Organization, Geneva, Switzerland (2014)
A Compact Spectral Model
for Convolutional Neural Network

Sayed Omid Ayat1 , Shahriyar Masud Rizvi2(B) , Hamdan Abdellatef3 ,


Ab Al-Hadi Ab Rahman2 , and Shahidatul Sadiah Abdul Manan2
1
Advanced Technologies Incubator Centre, Sharif University of Technology,
Tehran, Iran
2
School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru,
Johor, Malaysia
[email protected], {hadi,shahidatulsadiah}@utm.my
3
Department of Electrical and Computer Engineering, Lebanese American
University, Byblos, Lebanon
[email protected]

Abstract. The convolutional neural network (CNN) has gained wide-


spread adoption in computer vision (CV) applications in recent years.
However, the high computational complexity of spatial (conventional)
CNNs makes real-time deployment in CV applications difficult. Spectral
representation (frequency domain) is one of the most effective ways to
reduce the large computational workload in CNN models, and thus ben-
eficial for any processing platform. By reducing the size of feature maps,
a compact spectral CNN model is proposed and developed in this paper
by utilizing just the lower frequency components of the feature maps.
When compared to similar models in the spatial domain, the proposed
compact spectral CNN model achieves at least 24.11× and 4.96× faster
classification speed on AT&T face recognition and MNIST digit/fashion
classification datasets, respectively.

Keywords: Convolutional neural network (CNN) · Spectral domain


CNN

1 Introduction

The convolutional neural network (CNN) is a machine learning model that has
been successful in handling complex problems in Computer Vision (CV). It is
one of the most accurate solutions for image recognition tasks, and it achieves
this by leveraging the input image’s intrinsic invariance to motion, rotation,
and deformation [1]. CNN is suitable to deal with non-linear problems since it
can learn the features and patterns of the given data. It has been successfully
used in a wide range of applications, including natural language processing [2],
document processing [3], financial forecasting [4], face detection and recognition
[5], speech recognition [6], monitoring and surveillance [7,8], image classification
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 100–120, 2023.
https://doi.org/10.1007/978-3-031-18461-1_7
Compact Spectral CNN 101

[9,10], autonomous robot vision [11], and character recognition [12]. These appli-
cations mainly utilize deep learning algorithms that can autonomously extract
features from input data to achieve high accuracy.
One of the most significant challenges of CNN is the computational cost of
executing them [13,14]. This is especially true in view of the rapid increase in
the usage of big data in web servers, as well as the large number of samples to be
classed in the cloud. The growing number of datasets and the complexity of CNN
models placed a significant computing burden on any processing platform. The
higher precision required in real-world applications has raised the computational
difficulty of CNNs, in addition to the issue of high data complexity. To achieve
the high accuracy requirements of today’s practical recognition applications,
CNNs must be larger and deeper, thus requiring more processing resources.
The most computationally intensive aspect of CNN is the convolution lay-
ers [13] (hereafter referred to as CONV layers). In the inference (classification)
phase, spectral representation can significantly speed up the computation of con-
volutions [15–18]. The convenient property of operator duality between convolu-
tion in the spatial domain and Element-Wise Matrix Multiplication (EWMM) in
the spectral (frequency) domain allows convolution to be computed in spectral
domain with higher computational efficiency. Computing CONV layers in spec-
tral domain can significantly lower the computational cost of the CONV layers
from O(N 2 × K 2 ) to O(N 2 ), where N is the width of input feature map, and K
is the width of the kernel, thanks to this operator duality.
Since high computational complexity (CC) of existing spatial or spectral
CNN is an obstacle for real-time deployment in embedded applications, this
research aims to provide efficient optimizations for reducing the computational
burden of CNNs. More specifically, the objective is to eliminate unnecessary com-
plex computations in different layers of the spectral CNN model while retaining
an acceptable level of accuracy. These optimizations make CNNs more suitable
for implementation in resource-constrained systems.
Existing spectral CNN models can be memory-intensive due to larger-sized
kernels multiple spectral-spatial domain transformations, necessitating a lot of
memory usage and processing power [1,16]. The training of these networks is typ-
ically conducted offline to save energy and remove some computing strain from
embedded devices, and only the time-sensitive task of classification (testing) is
implemented on the target system [19,20]. Therefore, the goal of this research
is to increase the run-time speed of testing (classification) algorithm, as mea-
sured in milliseconds (ms), while maintaining acceptable recognition accuracy,
as evaluated by the misclassification error rate (MCR).
The following are the primary contributions of this work: we present a
compact spectral convolutional neural network model with a smaller feature
map (FM) size that has minimal CC and higher classification speed. The pro-
posed spectral CNN model outperforms conventional models by 24.11× and
4.96× in terms of classification speed on AT&T face recognition and MNIST
(Mixed National Institute of Standards and Technology) digit/fashion classifi-
cation datasets, respectively, when compared to the equivalent network in the
102 S. O. Ayat et al.

spatial domain. Furthermore, on the MNIST dataset, accuracy benchmarking


against state-of-the-art strategies for CC reduction in CNN models demonstrates
that the proposed model is more effective than other approaches.
The paper is organized as follows. Section 2 covers the literature review on
the different approaches to reduce CC of CNN models. Section 3 presents the
proposed model. Section 4 describes the experimental design to verify the per-
formance of the proposed model. Section 5 presents the results and also bench-
marking of the proposed model against previous CNN models. The final section
concludes the work and suggests possible future work.

2 Related Work
The size and depth of state-of-the-art CNNs have grown in response to the rapid
increase in today’s datasets and the higher accuracy demands of new applica-
tions. On the other hand, most hand-held devices that use artificial intelligence,
such as smartphones, are becoming smaller and more energy-efficient. These
trends necessitate algorithmic optimizations that are hardware friendly in order
to lower the computational workload required to run these compute-intensive
algorithms [21].
Singular Value Decomposition (SVD) is one of the techniques in linear alge-
bra that has been used extensively to reduce the CC of the network. It was first
proposed by Denton et al. [22] in object recognition systems and works based
on the factorization of the input matrix. Another popular compression method
is pruning techniques employed by Han et al. [23] to decrease the complexity of
CNN. This technique is based on the idea that some neurons have less contribu-
tion to the network performance and, therefore, can be removed. These methods
are hard to implement in real-world applications and may result in accuracy loss
if overused [24]. Another work in [25] applied data quantization (or precision
scaling) method that dynamically changes the bit-width of the data from 4-bits
to 16-bits in different CNN layers.
The binarized neural network (BNN) is a commonly used approach for reduc-
ing number of computations [26–29]. The goal here is to cut down on the num-
ber of bits used to represent activation (output FMs of CONV layers) or ker-
nels (filter weights). As a result, despite binarized convolution having a CC of
O(N 2 × K 2 ), the storage requirements as well as the computational cost of the
CNN model are significantly reduced [30]. For instance, the work in [27] pro-
posed the use of one/two-bit data format in their CNN model that makes a
considerable reduction in the computation cost and therefore the run-time of
the algorithm. The classification time of this model on MNIST dataset is almost
seven times faster than its baseline implementation. In addition to fewer com-
putations and faster operations, BNNs also improve the energy efficiency of the
embedded hardware as proven in [31].
Another approach is to use stochastic computing (SC), which is suitable for
area optimized hardware implementations [32–35]. In this method, a sequence
of randomly generated bit-streams is used to represent the actual number in the
Compact Spectral CNN 103

deterministic domain. This method is inspired by the sequential behavior of the


human brain on its neural spike trains [36]. This form of data representation will
facilitate the use of low-cost and hardware-friendly logic elements to perform
the deterministic operations [37]. The CC of convolution in SC-CNNs is O(N 2 ×
K 2 × L), where L is the length of the probabilistic input bit-stream.
A previous study in [38] which utilizes the General Matrix Multiplications
(GEMMs) has shown that CONV and Fully Connected (FC) layers in the CNN
can be transformed into a form of General Matrix Multiplications (GEMMs).
This new representation of the CONV and FC layers helps the authors in [38] to
leverage on the optimized OpenCL libraries available targeted for the efficient
implementation on the Graphics Processing Units (GPUs). The matrix form of
the CNN model is derived from rearranging the fully unrolled loops in the CONV
and FC layers into the Toeplitz matrix [39]. However, as reported by Sze et al.
[40], the GEMM technique can produce unnecessary computations in the first
layer of CNN, which can be considered as a major drawback of this approach.
Another interesting approach to reduce CC of CNN models is by applying
the Winograd transform, introduced by Shmuel Winograd and Don Coppersmith
[41]. This minimal filter algorithm that is also called Coppersmith–Winograd.
This was intended to be applied to CONV layers with kernels size ≤3. Lavin et al.
[42] have shown that this method can achieve 7.28× of speed ups compared to
the GEMM technique in the classification time of the VGG16 network model.
In comparison to the conventional procedure in the convolution that requires
N 2 × K 2 operations, the Winograd algorithm involves only (N + K − 1)2 mul-
tiplications. In return, the amount of addition operations is higher than the
conventional method, as shown by Lavin and Gray in [42].
Computing CNN in the spectral domain is another effective way of reducing
the CC of CNN models. Unlike SC and binarization approaches, which optimizes
an application based on its data-level representation, spectral domain approaches
optimize CNNs at the algorithmic level. In this approach, both input images
and weights are converted to complex-valued spectral domain or Fourier space
through the use of the fast Fourier transform (FFT). In spectral domain, infor-
mation is represented based on its frequency components. These complex-valued
data contains a real and an imaginary part when rectangular format is employed.
It can also be represented in polar format with an amplitude and a phase com-
ponent. Data represented in one of these two methods can be easily converted
from one format to the other. polar format has a computationally simpler imple-
mentation for multiplication, which requires one adder and one multiplier. On
the other hand, a multiplication in rectangular format requires four multipliers
and two adders. However, computing the addition operation in the polar format
requires expensive trigonometric operators and hence, the rectangular format is
typically preferred over the polar format.
The most significant advantage of computing CNN in spectral domain is
that it converts the convolution operation into a much simpler form of EWMM,
exhibiting a CC of just O(N 2 ). Computing EWMM is much faster on any com-
puting platform such as Central Processing Unit (CPU) [43], Graphics Pro-
cessing Unit (GPU) [16], or even Field Programmable Gate Arrays (FPGA)
104 S. O. Ayat et al.

[19,24,30,44]. Another benefit of using spectral representation is that it is inde-


pendent of the input kernel (filter) size. In other words, the kernel matrix is con-
verted to the same size as the input matrix after applying the FFT as EWMM
requires both input and the kernel to have the same size.
Many state-of-the-art approaches for accelerating inference computation
optimize data-level representations of weights or FMs through compression or
quantization, rather than optimizing the number of computations at the algo-
rithmic level. For instance, Niu et al. [45] prune weights of CONV layers resulting
in sparse kernels. The authors initially constrain the number of non-zero weights
based on the target sparsity and retrain their model to improve accuracy. Sun
et al. [46] perform fixed-point quantization of weights and analyze the quantiza-
tion errors of different CONV layers resulting from quantization. Based on the
quantization errors, the authors specify optimal bit-widths for different CONV
layers. Guan et al. [47] compress FMs so that they can be stored in a sparse
manner. They retain FM values that are above a configurable threshold while
zeroing out the remaining values. This results in sparse storage of FMs. It is
worth noting that all these above-mentioned approaches reduce computational
or memory costs at the expense of some loss in accuracy. In fact, the more com-
putational or memory costs are minimized under these approaches, the more
significant is the loss of accuracy. Furthermore, most of these data-level opti-
mizations require a dedicated hardware accelerator to make the most of these
approaches. This present work reduces the CC of spectral domain CNN mod-
els even further than state-of-the-art approaches using extremely compact FM
sizes and does so with minor or no loss in accuracy, making spectral CNNs more
appealing for resource-constrained environments.

3 Proposed Baseline Spectral CNN Model

The proposed spectral CNN model for handwritten digit classification using
the MNIST datasets in the spectral domain is shown in Fig. 1. This model is
referred to as CNN3 in this work and it goes through two steps of computational
reduction. The first step is based on the fusion technique proposed in [15], and
the resulting CNN is labeled as CNN2. The second step in computation reduction
(that creates CNN3 model from CNN2) is to compact the input FM sizes, and is
explained farther in Sect. 4. Compared to the conventional spatial CNN model
(denoted as CNN1) depicted in Fig. 2, CNN3 has fused layers, smaller FM sizes,
and executes CNN computations in the spectral domain.
In our proposed spectral CNN model in Fig. 1, the entire feature-extraction
segment including the CONV layers C1, C2, and C3 (and associated pooling and
activation layers) are performed in the spectral domain. After layer C3, we per-
form IFFT to perform the classification segment (involving the fully connected
and softmax layers (F 4 and X5)) in spatial domain. Therefore, the layers in the
classification segment are computed with real numbers, rather than complex-
valued numbers. The spectral rectified linear unit (SReLU) is employed here to
prevent multiple domain switching, similar to the spectral CNN model proposed
Compact Spectral CNN 105

in [15]. These methods are summarized in this paper, and readers are advised
to refer to the original paper for more information.

C1 C2 C3
20@3×3 50@3×3 150@1×1

F4 X5
10@1×1 10@1×1
Input
1@3×3

Full
Softmax
Connection

3×3 3×3 3×3


EWMM EWMM EWMM

Fig. 1. Proposed spectral CNN model with reduced FM sizes (labeled as CNN3) for
handwritten digit classification

Input C1 S1 C2 S2 C3 F4 X5
1@28×28 20@24×24 20@12×12 50@8×8 50@4×4 150@1×1 10@1×1 10@1×1

4×4 Full
Softmax
Convolution Connection

5×5 2×2 5×5 2×2


Convolution Max-pooling Convolution Max-pooling

Fig. 2. CNN Model for handwritten digit classification using the conventional spatial
model (labeled as CNN1)

The CNN architecture consists of several layers assembled together to form


a unique network, and each layer in this structure has its specific job. In this
section, we will take a look at these underlying layers and discuss each one’s
tasks and algorithms briefly. The spectral representation of each layer is also
provided.

3.1 Convolution Layer

Convolution layers play an essential role in CNN architecture as the CNN name
implies the importance of these layers. These layers are the main memory of
the network that can extract the information from the input data and save
them in the kernel weights parameters. Therefore, its main job is to do feature
106 S. O. Ayat et al.

extraction from the raw data given to them. Deep learning algorithms usually
employ multiple CONV layers in their structures. Each of these layers is designed
to extract features at various levels. These layers are responsible for extracting
the superficial information such as curves and edges. As we go deeper into the
network, the CONV layers extract higher-level information such as semi-circles,
and squares [48] and also covers more area of the input pixel space.

Algorithm 1. Convolution Layer in CNN.


1: for ( rw = 0 ; rw  R ; rw ++ ) {
2: for ( cl = 0 ; cl  C ; cl ++ ) {
3: for ( to = 0 ; to  M ; to ++ ) {
4: for ( ti = 0 ; ti  N ; ti ++ ) {
5: for ( i = 0 ; i  K ; i ++){
6: for ( j = 0 ; j  K ; j ++){
7: Output FM [to] [rw] [cl] + =
8: Weights [to] [ti] [i] [j] ×
9: Input FM [ti] [rw + i] [cl +j];
10: } } } } } }

Algorithm 1 shows the computation steps involved in these layers that only
can be expressed as multiple nested loops that should cover the whole input
FM, output FM, and also kernels. These nested loops are the main reason why
CONV layers are the most computation-intensive part of the whole network as
they can occupy 90% of the computation of the whole network [19,44].
After the input FM has gone through the EWMM process with the associated
trainable weight kernel (w), each output FM (x) is computed as the sum of the
input FMs (x). As shown in Eq. 1, this concept is used to implement the compute
intensive CNN CONV layers as simple EWMM in the Fourier domain.


I I
Yj = Xi · W i, j = F ( xi  wi,j ) (1)
i=1 i=1

where i and j are the indexes for input FMs and output FMs, respectively.
Algorithm 2 demonstrates the operations involved in EWMM where C1 nFMs,
C2 nFMs, R, C and K are FM size in layer C1, FM size in layer C2, row
and column sizes of the FM and finally K is the kernel size in the convolution
operation. In this paper, capital letters are used to present the Fourier transform
of the original signal. Similarly, non-capital letters represents the original FMs
in the spatial domain.
Compact Spectral CNN 107

Algorithm 2. EWMM Operations in Convolution Layer (C2).


1: for ( to = 0 ; to  C2 nFMs ; to ++ ) {
2: for ( ti = 0 ; ti  C1 nFMs ; ti ++ ) {
3: for ( rw = 0 ; rw  R ; rw ++ ) {
4: for ( cl = 0 ; cl  C ; cl ++ ) {
5: Output FM [to] [rw] [col] + =
6: Weights [to] [ti] [rw] [col] ×
7: Input FM [ti] [rw] [col];
8: } } } }

3.2 Sub-sampling Layer


The sub-sampling layers, or also known as pooling layers, are also fundamental
operations in the CNN. Their primary function is basically to reduce the dimen-
sion of the input FM and send the higher-level abstraction of data to the next
successive CONV layer. The most well-known type of pooling layer is based on
the M ax function applied on the input sub-matrix with the size of 2 × 2. There-
fore the output FM is 75% smaller in this case, and consequently, the CC in
the next layer would be much lesser than the previous layer. Another popular
type of pooling layer is based on averaging, i.e., the resulting pixel is the average
value of the all pixels in the input window.
The spectral pooling method [43] is adapted here to produce a down-sampled
approximation for the input in the pooling layer because the M ax pooling and
averaging method cannot be directly applied to the spectral domain. The fre-
quency representation is cropped in this manner by keeping only the top left
corner H × W sub-matrix of frequencies in the lower frequency spectrum:

Input : X ∈ C M ×N
(2)
Output : Y ←
− Crop (X, H × W ), Y ∈ C H×W

3.3 Activation Function


The activation function in CNN is, in fact, simulating the behavior of electro-
chemical reactions inside the axon in the neuron. The electrical signal in the
axon hillock is produced when the result of electrochemical reactions inside the
axon reached a certain threshold level [1]. This behavior has been emulated by
a wide variety of activation functions in literature. In this work, SReLU in Eq. 3
is used because it prevents multiple spatial-spectral domain switchings, which
are computationally very costly.

f (x) = c0 + c1 · x + c2 · x2 (3)

3.4 Batch Normalization Layer


Batch normalization is a process to boost CNN’s functionality, accelerate learn-
ing, and improve stability. It was first introduced in 2015 by Ioffe et al. [49] and
108 S. O. Ayat et al.

is used to adjust pre-activation values via a transformation matrix. Batch nor-


malisation allows the network to train at much higher rates and become more
versatile when it comes to weight initialization.
Normalization is simply a scaling and shifting operation applied to each input
value since mean (μB) and variance (σB2) do not change:


mb
Xi (0, 0) − (μB × M × N )
X̂ ←  (4)
i=1
σB
2 +

Yi ← γ X̂ + DC (5)
DC ← X̂new = X̂old + (β × δ(u, v) × M × N ) ≡ BNγ,β (Xi ) (6)
As in the original work [49], the scale and shift parameters gamma and
beta are determined using the back-propagation step in spatial domain. As pro-
posed in [15], we have added the BN layer after each CONV layer. All activation
functions will thus benefit from a uniformly shaped, continuous, and stable dis-
tribution over their optimal values.

4 Proposed FM Size Reduction for Compact CNN Model


The FM sizes of the baseline spectral CNN model (CNN2) ware mainly based
on the conventional CNN model in the spatial domain (CNN1). However the
spectral CNN model proposed in this work (CNN3) is different from the baseline
model (CNN2), with significant difference in FM sizes as compared to both
CNN1 and CNN2. The structure of the proposed CNN3, after applying the size
reduction in its input FMs, is shown in Table 1. In this section, we describe
methods to find the optimum FM sizes that best suits the proposed spectral
CNN model without the constraint of following the exact FM sizes specified in
the spatial model (CNN1) or the baseline spectral model (CNN2).
To apply the FM size reduction in the CNN3, we only need to reduce the
input FM size in the first CONV layer C1. This layer has the largest FM size
among all CONV layers of the network. We start our experiments from the
minimum possible change (which is 11 × 11) until the maximum possible change
(which is 1 × 1). Every time that the changes are made to the network, the
network is retrained, and its classification accuracy is measured. Experimental
results (provided in the result section of this paper) shows that there is not much
difference in the classification accuracy of the network until the size of 3 × 3.
As we continue the experiments, there is a significant accuracy drop when we
shrink the network further to 2 × 2 and 1 × 1 FM sizes. Therefore for the final
implementation of our spectral model, we choose this compact version of the
network (CNN3 with the FM size of 3 × 3) to achieve faster classification on the
target device. It is worth noting that after FM size reduction in the first CONV
layer (C1 ), further FM size reduction through pooling layers are not necessary,
as shown in Table 1.
To visualize the effect of this FM size reduction on the CNN performance
even before running the experimental work, we applied a low pass filter (LPF)
Compact Spectral CNN 109

Table 1. FM size reduction involved in the spectral CNN for digit classification task

Layer Type Spatial CNN Spectral CNN Proposed


(CNN1) [15] (CNN2) spectral model
(CNN3)
I0 Input 1 × 28 × 28 1 × 12 × 12 1×3×3
C1 Conv 1×20×28×28 1 × 20 × 12 × 12 1 × 20 × 3 × 3
P1 Pooling 20 × 12 × 12 20 × 12 × 12 –
R1 SReLU 20 × 12 × 12 20 × 4 × 4 20 × 3 × 3
C2 Conv 20×50×12×12 20 × 50 × 4 × 4 20 × 50 × 3 × 3
P2 Pooling 50 × 4 × 4 50 × 4 × 4 –
R2 SReLU 50 × 4 × 4 50 × 4 × 4 50 × 3 × 3
C3 Conv 50×150×4×4 50 × 150 × 4 × 4 50 × 150 × 3 × 3
R3 ReLU 150 × 1 × 1 150 × 1 × 1 150 × 1 × 1
F4 Full 10 × 1 × 1 10 × 1 × 1 10 × 1 × 1
X5 Softmax 10 × 1 × 1 10 × 1 × 1 10 × 1 × 1

with different values of cutoff frequencies on some randomly picked samples of


MNIST and AT&T datasets. Figure 3 shows the effect of applying these LPFs
with the original size of 12 × 12 in [15] until the smallest possible size of 1 × 1.
Figure 3 can give us an intuition of what information does the compact spectral
CNN may receive from the input image when the FM size is shrunk.

1×1 2×2 3×3 4×4 5×5 6×6 7×7 8×8 9×9 10×10 11×11 12×12

MNIST

mnist-back-image

mnist-back-random

Fashion-MNIST

mnist-rot

AT&T

Fig. 3. Effect of reduction in the input FM size of CNN3 on the input images from
different datasets

From Fig. 3, it is quite clear that by shrinking the input FM sizes in CNNs,
the visible features get more blurred. Most of the samples are recognizable up to
the size of 3 × 3. This situation is problematic for smaller FMs, whereas in the
110 S. O. Ayat et al.

case of 2 × 2 and 1 × 1, the information is almost completely lost. For the FM


size of 3 × 3 in the noisy samples (mnist-back-random dataset), the objects are
not only identifiable, but they are also more clearer since the background noise
is filtered out due to the LPF effect.
Figure 1 shows the structure of this network (labeled as CNN3) where the
input layer and the CONV layers use the FM size of 3 × 3. This is because
of the fusion layer that is applied in the first CONV layer (introduced by [15]
(CNN2 model)) reduces the FM size through spectral pooling that functions
like a 3 × 3 LPF. Unlike spatial CNNs and existing spectral CNNs, the proposed
model performs pooling only once and before the first CONV layer. This means
that no matter how many pixels (spectrums) are provided as input to the CNN,
this LPF is going to pick only those 3×3 spectrums in the low frequency, and the
rest are going to be discarded. This allows the proposed CNN3 model to avoid
unnecessary computations. In another word, only 3 × 3 pixels, selected from the
28×28 input are provided to the first CONV layer and hence all the CONV layers
process inputs of this same size. The smaller FM sizes also ensure that all the
kernels are smaller-sized as well, which can be beneficial for deploying spectral
CNNs on memory-constrained devices. The detailed performance of this network
on the other datasets (rather than MNIST) is presented in the next sections.

5 Results and Discussion

The compact spectral CNN model proposed in this paper was evaluated on differ-
ent standard datasets. The effectiveness of the proposed computation reduction
method is compared to spatial models (CNN1) and conventional spectral models
(CNN2). Both accuracy and classification times are compared among the three
CNN models discussed in this paper. Furthermore, accuracy of the proposed
model was compared to some earlier works in spatial CNNs employing SC and
BNN method for computation reductions. Due to differences in platforms, CNN
models, and test-cases employed, a complete comparison with prior publications
that use spectral representation in neural networks is not feasible.
We have applied the previous approaches to the same LeNet-5 architecture,
utilizing the same test cases (MNIST variations and AT&T face dataset), and
operating on the same system, in order to make a fair comparison with earlier
works. As a result, three alternative CNN models have been constructed for our
tests in order to benchmark the proposed spectral CNN model’s performance.
All network weights and inputs in the spatial model (named CNN1) are in real
number format because it is run in the spatial domain. The baseline spectral
CNN model, dubbed CNN2, is the subject of the second experiment. The final
network proposed in this paper (named CNN3) is similar to CNN2, with the
exception that the input FM sizes are reduced to 3 × 3 pixels.
The CNN models were trained on MNIST, Fashion-MNIST, mnist-back-
random, mnist-rot, mnist-back-image and AT&T datasets. Figure 4 and Figure 5
show some randomly selected samples of these datasets. The networks are trained
Compact Spectral CNN 111

off-line in MATLAB using the open-source MatConvNet [50] package. The net-
work’s classification task is implemented in the C programming language, with
the Fourier transform for the network weights performed off-line in MATLAB.

Fig. 4. Ten different classes in the (a) MNIST, (b) Mnist-Back-Random (c) Mnist-Rot
and (d) Mnist-Back-Image and (e) Fashion-MNIST datasets

Fig. 5. Forty different classes of faces in the AT&T dataset

5.1 Classification Speed Comparison


The classification time for the three CNN models under consideration are mea-
sured in terms of time required to classify one test sample. This time is computed
from averaging the classification times of the total number of test samples con-
tained in each dataset. The classification times for all the three models shown in
Table 2. In the case of AT&T experiment, the speed-up is substantially higher
than MNIST variants, and the main reason is the bigger kernel sizes required in
the CNN model.
112 S. O. Ayat et al.

Table 2. Classification speed (Per sample image) of the proposed spectral model in
comparison with previous approaches

CNN model Method Dataset Test time Speed up


CNN1 Conventional [5, 50] MNIST variants 7.50 (ms) 1×
(Spatial domain)
AT&T 43.41 (ms) 1×
CNN2 Fused layer [15] MNIST variants 2.2 (ms) 3.4×
AT&T 2.58 (ms) 16.82×
CNN3 (Proposed) Fused layer+ MNIST variants 1.51 (ms) 4.96×
FM size reduction
AT&T 1.80 (ms) 24.11×

As evident from the results provided in Table 2, the proposed spectral CNN
model outperforms previous approaches in terms of classification time. The pro-
posed model outperforms spatial model by about 5 times in case of MNIST
variants and 24 times when evaluated on AT&T dataset. Essentially, the high
CC in the CONV layers [51] impacts the classification speed of the spatial CNN
model [50]. This computational burden is removed in this work by computing the
convolutions as EWMMs in the spectral domain. Furthermore, because of the
reduced FM sizes in its design, the algorithm for the proposed model (CNN3)
requires fewer computations for classification operations than existing spectral
CNN models such as CNN2. For this reason, CNN3 can perform faster than
CNN2 on any computing platform.

5.2 Classification Accuracy of Proposed CNN Model

The CNN models CNN1 and CNN2 are trained for 50 epochs. Because of their
slower convergence rate, CNN3 had to go through 200 epochs of training. On
MNIST variations as well as AT&T face recognition datasets, each CNN model
is tested independently. The test accuracy of the three CNN models were mea-
sured in terms of MCR. Figure 6 and 7 show the training and testing MCRs of
the proposed spectral model (CNN3) on MNIST variants and AT&T datasets,
respectively. Table 3 documents the test accuracy attained on all the three CNN
models for the six datasets employed in this work for evaluating the models.
Compact Spectral CNN 113

Fig. 6. Training and Testing MCRs of Spectral CNN3, with and without Batch Nor-
malization (BNorm) technique, on (a) MNIST, (b) Mnist-Back-Random (c) Mnist-Rot
and, (d) Mnist-Back-Image Datasets

Fig. 7. Training and Testing MCRs of spectral CNN3, with and without Batch Nor-
malization (BNorm) technique, on (a) AT&T and, (b) Fashion-MNIST datasets
114 S. O. Ayat et al.

Table 3. Classification Accuracy of Proposed Spectral Model (CNN3) in Comparison


with Spatial Counterpart (CNN1 [5, 50]) and Spectral Model without FM Size Reduc-
tion (CNN2 [15]). Here, if Accuracies of CNN1, CNN2 and CNN3 are Denoted as A,
B and C, Dif f 1 = A − C and Dif f 2 = B − C.

Dataset Spatial Spectral Proposed Diff1 Diff2


CNN1 CNN2 spectral
CNN3
MNIST 99.69% 99.35% 99.65% 0.04% −0.30%
MNIST-back-rand 94.15% 89.50% 92.06% 2.09% −2.56%
MNIST-rot 91.48% 88.90% 86.72% 4.76% 2.18%
MNIST-back-image 94.65% 90.97% 88.51% 6.14% 2.46%
AT&T 98.75% 97.50% 94.75% 4.00% 2.75%
Fashion-MNIST 94.45% 92.80% 90.47% 3.98% 2.33%

It is worth noting that, in general, there is always a trade-off between model


accuracy and network speed performance in neural networks [1]. The findings
of our experiments reveal that in the worst-case scenario, significantly faster
inference time can be achieved at the expense of less than 6.14% classification
accuracy. The compact spectral CNN model proposed in this work indicates that
it has a higher level of robustness to noise than CNN2, as shown by the results
obtained for MNIST and MNIST-back-rand. The reason for this is that CNN3’s
architecture benefits from the properties of an LPF.

Fig. 8. Classification accuracy of the network with batch normalization on MNIST


dataset, applying different input FM sizes in the C1 layer
Compact Spectral CNN 115

The difference between CNN2 and CNN3 is only in the FM sizes of the
CONV layers. The input FM size in CNN2 is 12 × 12, whereas this size in
CNN3 is reduced to 3 × 3. The reason why 3 × 3 is chosen for this network is
illustrated in Fig. 8, where it shows the classification accuracy of the network
applying different input FM sizes in the C1 layer. As can be observed from the
plot, the performance does not considerably drop while the number of frequency
spectrums in the FM is reduced. This behavior exists until FM size is reduced to
2 × 2. In this case, accuracy is not in the acceptable range anymore. It is worth
reporting that the network with the single DC spectrum (1 × 1 pixel) does not
converge at all. The accuracy has a minor drop in the case of 3 × 3 FM size
in the first 50 epochs, which is compensated with a longer training time of 200
epochs.
In all of the experiments, the spectral domain network takes longer (or more
epochs) to converge in the training phase than the conventional method, espe-
cially for CNN3, where the training duration is increased from 50 to 200 epochs.
However, it is not a significant issue, particularly for the end-user who merely
requires a quick and accurate classifier machine, independent of how much time
it requires for training stage. In other words, we can use dedicated hardware to
speed up the time-critical inference process while leaving the training phase to
servers or a powerful host PC.

5.3 Accuracy Benchmarking

We have benchmarked accuracy our proposed model against previous CNN mod-
els (both spatial and spectral) that employed state-of-the-art strategies for reduc-
ing CC in CNNs. Table 4 shows the accuracy attained by our model as well these
previous works. The majority of these works are aimed at various embedded sys-
tem applications.
The proposed model’s accuracy was compared to previous works that use
spectral representation, as well as works that use various spatial domain tech-
niques (such as SC and BNN) to reduce CC in CNN. The CC of computing
convolution in all the three approaches are also listed in Table 4. On the MNIST
dataset, the results show that the proposed model is more accurate than alter-
native approaches. With BNN and SC approaches, the reduction in computation
frequently comes at the cost of a significant loss of accuracy in network perfor-
mance. In other words, in neural networks, there is always a trade-off between
model accuracy and data bit-width of the weights and activations [13].
116 S. O. Ayat et al.

Table 4. Classification accuracy benchmarks of LeNet-5 CNNs on MNIST dataset


applying different methods of computational complexity (CC) reduction

Work Year Paradigm CC of Convolution Best Test Accuracy


Courbariaux et al. [26] 2016 Spatial O(N 2 × K 2 ) 99.06%
BNN
Hubara et al. [27] 2016 96.04%
Liang et al. [28] 2018 98.24%
Wu et al. [29] 2020 98.82%
Li et al. [33] 2017 Spatial O(N 2 × K 2 × L) 98.37%
SC-CNN
Ma et al. [32] 2018 99.04%
Li et al. [34] 2019 99.13%
Abdellatef et al. [35] 2022 99.19%
Ayat et al. [15] 2019 Spectral O(N 2 ) 99.35%
CNN
Liu et al. [17] 2020 99.12%
Rizvi et al. [16] 2021 97.65%
Watanabe et al. [18] 2021 96.92%
This work 2022 99.65%

6 Conclusions

In this work, we have demonstrated that computing CNN in spectral domain can
be accurate and computationally inexpensive. This gain in computational effi-
ciency arises from computing convolutions as element-wise products in spectral
domain with a much lower CC of O(N 2 ) (instead of O(N 2 × K 2 )) and employ-
ing extremely compact FM sizes. This work contributes to the spectral CNN
paradigm with a model that computes inference with fewer computation than
state-of-the-art spectral CNNs. It achieves its objectives by introducing an FM
size reduction approach that results in a CNN model that offers faster inference
speed than state-of-the-art spectral CNN models. This proposed method does
not impact the classification accuracy, just as originally intended, and moreover
produces reduced MCRs in datasets with noisy images. This is because the pro-
posed model discards high-frequency components like an LPF, which results in
a noise-tolerance solution.
As discussed in Sect. 2, BNN and SC are other successful approaches for
simplifying computations in CNN. Therefore, a possible path for future work is to
design a hybrid CNN model with BNN or SC data representation in the spectral
domain (hybrid spectral-BNN model or hybrid spectral-SC model). Unlike BNN
and SC methods, the spectral representation is independent of the number of
binary digits (bit length) representing data. In other words, the focus of spectral
representation is on the CNN algorithm rather than the bit length of the data.
Hence, it is possible to further compress the bit width of the complex-valued
data in the spectral domain toward binary or stochastic implementation. The
Compact Spectral CNN 117

main concern would be regarding how to maintain the accuracy of the hybrid
network at an acceptable range.

Acknowledgment. The authors thank Universiti Teknologi Malaysia (UTM) for their
support under the Research University Grant (GUP), grant number 16J83.

References
1. Almasi, A.D., Wozniak, S., Cristea, V., Leblebici, Y., Engbersen, T.: Review of
advances in neural networks: neural design technology stack. Nerocomputing 174,
31–41 (2016)
2. Collobert, R., Weston, J.: A unified architecture for natural language processing:
deep neural networks with multitask learning. In: Proceedings of the 25th Inter-
national Conference on Machine Learning, pp. 160–167. ACM (2008)
3. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural
networks applied to visual document analysis. In: Proceedings of the 7th Interna-
tional Conference on Document Analysis and Recognition (ICDAR), vol. 3, pp.
958–963. IEEE (2003)
4. McNelis, P.D.: Neural Networks in Finance: Gaining Predictive Edge in the Market.
Academic Press, Cambridge (2005)
5. Ahmad Radzi, S., Mohamad, K.H., Liew, S.S., Bakhteri, R.: Convolutional neural
network for face recognition with pose and illumination variation. Int. J. Eng.
Technol. (IJET) 6(1), 44–57 (2014)
6. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep struc-
tured semantic models for web search using clickthrough data. In: Proceedings of
the 22nd ACM International Conference on Information & Knowledge Manage-
ment, pp. 2333–2338. ACM (2013)
7. Rasti, P., Uiboupin, T., Escalera, S., Anbarjafari, G.: Convolutional neural network
super resolution for face recognition in surveillance monitoring. In: Perales, F.J.J.,
Kittler, J. (eds.) AMDO 2016. LNCS, vol. 9756, pp. 175–184. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-41778-3 18
8. Yeap, Y.Y., Sheikh, U.U., Ab Rahman, A.A.: Image forensic for digital image
copy move forgery detection. In: 14th IEEE International Colloquium on Signal
Processing and Its Applications (CSPA), pp. 239–244. IEEE (2018)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. Commun. ACM 60(6), 84–90 (2017)
10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Proceedings of the 3rd International Conference on Learning
Representations, ICLR (2015)
11. Sermanet, P., et al.: A multirange architecture for collision-free off-road robot
navigation. J. Field Robot. 26(1), 52–87 (2009)
12. LeCun, Y., et al.: Handwritten digit recognition with a back-propagation network.
In: Proceedings of the 2nd International Conference on Neural Information Pro-
cessing Systems (NIPS), pp. 396–404. NeurIPS (1989)
13. Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., Yu, B.: Recent advances in
convolutional neural network acceleration. Neurocomputing 323, 37–51 (2019)
14. Amer, H., Ab Rahman, A., Amer, I., Lucarz, C., Mattavelli, M.: Methodology
and technique to improve throughput of FPGA-based Cal dataflow programs: case
study of the RVC MPEG-4 SP intra decoder. In: Proceedings of the IEEE Work-
shop on Signal Processing Systems (SiPS), pp. 186–191. IEEE (2011)
118 S. O. Ayat et al.

15. Ayat, S., Khalil-Hani, M., Ab Rahman, A., Abdellatef, H.: Spectral-based con-
volutional neural network without multiple spatial-frequency domain switchings.
Neurocomputing 364, 152–167 (2019)
16. Rizvi, S., Ab Rahman, A., Khalil-Hani, M., Ayat, S.: A low-complexity complex-
valued activation function for fast and accurate spectral domain convolutional
neural network. Indones. J. Electr. Eng. Inform. (IJEEI) 9(1), 173–184 (2021)
17. Liu, S., Luk, W.: Optimizing fully spectral convolutional neural networks on
FPGA. In: Proceedings of the 19th IEEE International Conference on Field-
Programmable Technology (ICFPT), pp. 39–47. IEEE (2020)
18. Watanabe, T., Wolf, D.: Image classification in frequency domain with 2SReLU:
a second harmonics superposition activation function. Appl. Soft Comput. 112,
107851 (2021)
19. Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., Cong, J.: Optimizing FPGA-based
accelerator design for deep convolutional neural networks. In: Proceedings of the
2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
pp. 161–170. ACM (2015)
20. Qiu, J., et al.: Going deeper with embedded FPGA platform for convolutional neu-
ral network. In: Proceedings of the 2016 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pp. 26–35. ACM (2016)
21. Ristretto, G.P.: Hardware-oriented approximation of convolutional neural net-
works. arXiv preprint (arXiv:1605.06402). arXiv (2016)
22. Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear
structure within convolutional networks for efficient evaluation. In: Proceedings of
the Advances in Neural Information Processing Systems, pp. 1269–1277. NeurIPS
(2014)
23. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: Proceedings of the 28th International Conference on
Neural Information Processing Systems (NIPS), pp. 1135–1143. NeurIPS (2015)
24. Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning
networks for learning and classification: a review. IEEE Access 7, 7823–59 (2018)
25. Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations
of nonlinear convolutional networks. In: Proceedings of the IEEE Conference on
Computer Vision and pattern Recognition (CVPR), pp. 1984–1992. IEEE (2015)
26. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural
networks: Training deep neural networks with weights and activations constrained
to +1 or -1. arXiv preprint (arXiv:1602.02830). arXiv (2016)
27. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neu-
ral networks. In: Proceedings of the Advances in Neural Information Processing
Systems, pp. 4107–4115. NeurIPS (2016)
28. Liang, S., Yin, S., Liu, L., Luk, W., Wei, S.: FP-BNN: binarized neural network
on FPGA. Neurocomputing 275, 1072–1086 (2018)
29. Wu, Q., Lu, X., Xue, S., Wang, C., Wu, X., Fan, J.: SBNN: slimming binarized
neural network. Neurocomputing 401, 113–122 (2020)
30. Mittal, S.: A survey of FPGA-based accelerators for convolutional neural networks.
Neural Comput. Appl. 32(4), 1109–1139 (2018). https://doi.org/10.1007/s00521-
018-3761-1
31. Umuroglu, Y., et al.: A framework for fast, scalable binarized neural network infer-
ence. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 65–74. ACM (2017)
Compact Spectral CNN 119

32. Ma, X., et al.: An area and energy efficient design of domain-wall memory-based
deep convolutional neural networks using stochastic computing. In: Proceedings
of the 19th International Symposium on Quality Electronic Design (ISQED), pp.
314–321. IEEE (2018)
33. Li, J., et al.: Hardware-driven nonlinear activation for stochastic computing based
deep convolutional neural networks. In: Proceedings of the International Joint Con-
ference on Neural Networks (IJCNN), pp. 1230–1236. IEEE (2017)
34. Li, Z., et al.: HEIF: highly efficient stochastic computing-based inference framework
for deep neural networks. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 38(8),
pp. 1543–1556 (2019)
35. Abdellatef, H., Khalil-Hani, M., Shaikh-Husin, N., Ayat, S.: Accurate and com-
pact convolutional neural network based on stochastic computing. Neurocomputing
471, 31–47 (2022)
36. Qian, W., Li, X., Riedel, M.D., Bazargan, K., Lilja, D.J.: An architecture for fault-
tolerant computation with stochastic logic. IEEE Trans. Comput. 60(1), 93–105
(2010)
37. Hayes, J.P.: Introduction to stochastic computing and its challenges. In: Proceed-
ings of the 52nd ACM/IEEE Design Automation Conference (DAC), pp. 1–3. IEEE
(2015)
38. Bottleson, J., Kim, S., Andrews, J., Bindu, P., Murthy, D.N., Jin, J.: clCaffe:
OpenCL accelerated Caffe for convolutional neural networks. In: Proceedings of
the IEEE International Parallel and Distributed Processing Symposium Workshops
(IPDPSW), pp. 50–57. IEEE (2016)
39. Bareiss, E.H.: Numerical solution of linear equations with Toeplitz and vector
Toeplitz matrices. Numerische Mathematik 13(5), 404–424 (1969)
40. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural
networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
41. Winograd, S.: Arithmetic Complexity of Computations. SIAM (1980)
42. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 4013–4021. IEEE (2016)
43. Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neu-
ral networks. In: Proceedings of the 28th International Conference on Neural Infor-
mation Processing Systems (NIPS), pp. 2449–2457. ACM (2015)
44. Ayat, S., Khalil-Hani, M., Ab Rahman, A.: Optimizing FPGA-based CNN accel-
erator for energy efficiency with an extended Roofline model. Turk. J. Electr. Eng.
Comput. Sci. 26(2), 919–935 (2018)
45. Niu, Y., et al.: SPEC2: SPECtral SParsE CNN accelerator on FPGAs. In: Proceed-
ings of the 26th IEEE International Conference on High Performance Computing,
Data, and Analytics (HiPC), pp. 195–204. IEEE (2019)
46. Sun, W., Zeng, H., Yang, Y.-h., Prasanna, V.: Throughput-optimized frequency
domain CNN with fixed-point quantization on FPGA. In: International Conference
on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–8. IEEE (2018)
47. Guan, B., Zhang, J., Sethares, W., Kijowski, R., Liu, F.: Spectral domain convolu-
tional neural network. In: Proceedings of the 46th IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 2795–2799. IEEE (2021)
48. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
49. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by
reducing internal covariate shift. arXiv preprint (arXiv:1502.03167). arXiv (2015)
120 S. O. Ayat et al.

50. Vedaldi, A., Lenc, K.: MatConvNet: convolutional neural networks for MATLAB.
In: Proceedings of the 23rd ACM International Conference on Multimedia (MM),
pp. 689–692. ACM (2015)
51. Cong, J., Xiao, B.: Minimizing computation in convolutional neural networks. In:
Proceedings of the 24th International Conference on Artificial Neural Networks
(ICANN), pp. 281–290. IEEE (2014)
Hybrid Context-Content Based Music
Recommendation System

Victor Omowonuola(B) , Bryce Wilkerson, and Shubhalaxmi Kher

Arkansas State University, State University, Arkansas 72467, USA


[email protected]

Abstract. Due to the increase in technology and research over the past few
decades, music had become increasingly available to the public, but with a vast
selection available, it becomes challenging to choose the songs to listen too. From
research done on music recommendation systems (MRS), there are three main
methods to recommend songs; context based, content based and collaborative fil-
tering. A hybrid combination of the three methods has the potential to improve
music recommendation; however, it has not been fully explored. In this paper, a
hybrid music recommendation system, using emotion as the context and musical
data as content is proposed. To achieve this, the outputs of a convolution neural
network (CNN) and a weight extraction method are combined. The CNN extracts
user emotion from a favorite playlist and extracts audio features from the songs and
metadata. The output of the user emotion and audio features is combined, and a
collaborative filtering method is used to select the best song for recommendation.
For performance, proposed recommendation system is compared with content
similarity music recommendation system (CSMRS) as well as other personalized
music recommendation systems.

Keywords: Music recommendation system · Convolutional neural network ·


Mel-spectrogram · Emotion recognition · Valence-arousal model

1 Introduction
Music has always been an important topic while discussing the development of the
modern internet. In the earlier days, music was mostly bought from the stores in the form
of vinyl or discs. As the internet evolved, music piracy increased and was ubiquitous in
the listening, sharing, and storage of music. This was because music was more accessible
in the mp3 format. In the modern internet, music streaming services such as Spotify,
Pandora and Deezer have increased the availability of music with subscription or free.
Due to the enormous amount of music content available, these streaming services have
developed recommender systems that account for user preferences in recommending
music to them. Recommender systems are important because they can enhance the user
experience on a particular music streaming site. By enhancing the experience, it may help
with providing quick and potential recommendation to users so that they can be retained,
and more potential users can be gained. Also knowing user preferences is important for

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 121–132, 2023.
https://doi.org/10.1007/978-3-031-18461-1_8
122 V. Omowonuola et al.

placing advertisement, which serve as a major income generator for streaming sites.
Information about Music Information Retrieval, Generation, and recommendation are
discussed in the following sections.

1.1 Music Information Retrieval


Music Information Retrieval (MIR) can be traced back to signal processing, musicol-
ogy, and psychology. It is the science of retrieving information from music. Most of
MIR focuses on retrieving information from the content of music such as music scores
(when attainable), MIDI and other digital audio formats. MIR research has created many
exciting applications such as automatic music transcription, music classification, music
generation and music recommenders [1]. Due to the exciting applications of MIR, more
research has been done recently on expanding the MIR process thereby serving as a
foundation to build better recommendation systems. An example of applications of MIR
(Music generation) is discussed in the next section.

1.2 Music Generation


Music generation has been a goal for MIR and Artificial Intelligence (AI) researchers
for decades. Artificial music generation involves the use of an algorithm or a learning
system to generate music pleasing to humans. With the advent of machine learning, more
attempts have been made and progress no matter how small has been achieved in generat-
ing music automatically. The combination of research done on MIR and its applications
has made it possible to select music recommendations for users using recommender
systems described in the next section.

1.3 Music Recommenders


Music recommendation involves recommending a song that correlates to a particular
user’s taste in music. New or old songs can be recommended based on various situations
and songs that do not usually correlate to a user’s taste but might be liked by the user can
also be recommended. In the early days of music recommendation (MR) collaborative
filtering (CF) was the preferred method. CF exploits a user’s historical likes and dislikes
about an item. The interactions include clicks or ratings which are represented in a user-
item matrix. Because it is based on historical ratings, it is assumed that your previous
likes will remain the same as your future likes. User based and item-based CF are the
main distinctions used and are differentiated depending on if the recommendations are
made by similarities between items or users.
In user-based recommendation, user similarities are computed using either Pearson’s
Correlation Coefficient (Eq. 1) or Cosine similarity (Eq. 2).

i∈LIK (rIK − XI )(rKI − XK )
PearsonCoefficient : Sik =  (1)
2 2
i∈LIK (r IK − XI ) i∈LIK (r KI − XK )

XI • XK
CosineSimiliarity : Sik =    (2)
XI XK 
Hybrid Context-Content Based Music Recommendation System 123

where,
SIK is the similarity between the ratings given by users I and K, LIK are the items
rated/liked by both users, XI and XK is the average rating of each user [2]. rIK is the
rating if user-I has not listened to the song and is calculated using Eq. 3 below.

n∈NI Sni (rnx − Xn )
rIx = XI +  (3)
n∈NI Sni

where NI is the set of users I nearest neighbors who rated item x with respect to the
similarity score SI . Finally, the items with the highest rIx are recommended to user I.
The process for CF is to calculate the similarities between user-I and other users,
select the users with highest similarities, and take the weighted average of their ratings
using the similarities as the weight. Because of the difference in people creating bias,
each user’s average rating is subtracted and added back for the target user-I as shown in
Eq. 3. Therefore, CF finds users similar to the target user and recommends songs to the
target user based on the weights attached to each song.
Another form of recommendation is the Content Based recommendation System
(CBF). This system recommends songs based on the features extracted from the audio
signal like, rhythm, tempo, melody, and metadata. The information can also contain
low-level audio data such as the Mel-Frequency Cepstral Coefficients (MFCCs), Zero
Crossing Rate (ZTR), and Teager Energy Operator (TEO). Since audio information is
used to recommend songs, it eliminates the need to have a large user base and is also
more accurate than CF in predicting users’ taste [3].
The last form of recommenders is context based. This recommends factor in contexts
like time, location, weather, news, and emotion. Since music is often laden with contexts
like emotion, it is important to factor them in while recommending songs to a user. Many
music streaming services take advantage of this by creating playlists such as morning
moods, sad/happy, or Sunday tunes.
In this article, a preliminary music recommendation system that incorporates the
CBF, CB, and context-based methods to analyze user’s taste and accurately predict
songs to be recommended to a user is proposed and explained. Emotion is used as the
context, metadata is used for the content, and collaborative filtering is to be done when
a large number of users is acquired. The rest of the paper is structured as follows. In the
next section, related works done on music recommendation are presented. In Sect. 3, our
proposed model is discussed. Section 4 gives more details on the experimental results.
Analysis of the proposed system is done in Sect. 5. Finally, the paper concludes with a
summary and directions for future work.

2 Related Work

Researchers have used audio content to recommend music, while using a two-stage
approach by extracting low level audio content like MFCC and using these features
to predict what a user likes [4, 5]. The results achieved however, were unsatisfactory.
Therefore, researchers developed better recommendation systems such as the use of a
deep belief network and a probabilistic graphical model to unify the two steps into an
124 V. Omowonuola et al.

automated process [6]. One of the earliest uses of deep learning (DL) in content-based
music recommendation (MR) is by Van Den Oord, who used a CNN using Rectified
Linear Units (RELU) to reduce processing time. The input to the CNN consisted of short
audio samples from 7 digital and track data in the Million Song Database containing
the song metadata and was used to train the network. Each song was represented by 50
latent factors and used to minimize the mean square error (MSE) of predicting a user’s
likes. The experiment showed that a CNN performs best in user predictions.
Other researchers used the process of collaborative filtering to collect and recom-
mend song data to users. This method was feasible as shown in [5] but it had a cold
start problem. Due to the lack of user data at the beginning of the process, it was impos-
sible to recommend songs to a different user. To curtail this problem, content-based
music recommendation was incorporated into collaborative filtering methods. Papers
by Richard Stenzel and Thomas Kamps show the advantages of incorporating multiple
recommendation methods.
To increase accuracy of recommender models, researchers have also extracted emo-
tional data from songs by creating a CNN trained by data in the MediaEval 1000 songs
database (DEAM). This database contains song information classified by an emotional
scale of valence and arousal. The CNN model is based on regression and gives values
for both valence and arousal. Mel spectrograms of 15s audio snippets are used as inputs
to the CNN and the network is trained by the DEAM dataset. The experiments obtained
mean absolute error below 0.1, meaning that the predicted valence and arousal are only
I value away from the truth.
Previous work on music recommendation has been done by our group of researchers;
the paper involved the use of the Spotify API to access song information and attributes
like danceability, energy and valence. Attributes were normalized and then used in a
K-means cluster in which songs belong to a cluster with the nearest mean. A function
then utilized the cosine distance to recommend songs to a user. When this model was
tested with a subset of 20 people, recommended songs were close to a user’s preference
about 75% of the time [7].
Overall, this paper uses three music recommendation models (content based, col-
laborative filtering, and context based) methods verify and create a system that achieves
higher accuracy in predicting musical likes and recommending music to a user. By
perusing research done by previous authors, we believe this hybrid method should lead
to greater accuracy in predictions.
The next section gives an overview of the recommender systems, the dataset used,
explains the processing of the audio signals, and explains the architecture. The content
used and valence-arousal method which was chosen to represent emotion is explained
in this section.

3 Model Construction
3.1 Emotional Representation
Music is a great conveyor of emotions, and the emotions perceived from listening to
music vary by person. However, research over the recent years has shown improvements
in classifying music based on moods/emotions. This can be seen by the enormous number
Hybrid Context-Content Based Music Recommendation System 125

of emotional playlists in music streaming sites, and success in the use of CNN and
recurrent neural networks (RNN) in classifying songs [8, 9]. Due to the difficulty in
granulating emotions and the variability in emotion perceived by humans, Posner et al.
proposed a model which splits emotions into two components: valence and arousal.
Valence describes the attractiveness (positive valence) or aversiveness (negative valence)
of stimuli along a continuum (negative – neutral – positive), and arousal refers to the
perceived intensity of an event from very calming to extremely exciting or agitating [9].
The valence-arousal scale and its association with emotions are depicted in Fig. 1. Since
this scale represents a wide range of emotions, it was chosen as the form to represent
emotions found in songs.

Fig. 1. Valence-arousal scale

As shown in Fig. 1, most emotions can be represented using this scale. For example,
a low arousal and low valence score represents tiredness, and the opposite represents
excited.
In our emotional classification, the authors chose to use a CNN since it has been
established as a method of classifying emotion [8–11]. Since CNN’s were created to
classify images, the audio data must be represented using an image. Spectrograms are
widely used in machine learning field to represent audio data therefore the authors
chose to use it also. Audio spectrograms are a way to visualize sound on a 2D plane,
the horizontal axis represents time, and the vertical axis represents frequency. Color
represents the amplitude of the time-frequency pair. The spectrogram can be defined as
an intensity plot of the Short-Time Fourier Transform (STFT) magnitude. Let X be a
signal of length Y. Consecutive equal segments of length n are taken where n < Y. The
segments may be overlapping. Each segment is windowed, and the spectrum is computed
using the STFT. The STFT is represented by Eq. (4) below.
 ∞
STFT {x(t)}(τ, ω) ≡ x(t)ω(t − τ )e−iωt dt (4)
−∞

where, ω(τ ) is the window function centered around zero, and x(t) is the signal to be
transformed. Some research has shown that humans do hear frequencies on a linear scale,
therefore a normal spectrogram is limited in fully representing the way audio is heard
and translated by our brains. Therefore, a more accurate image representation is the
126 V. Omowonuola et al.

Mel-spectrogram. This spectrogram converts the frequencies into a Mel scale as shown
in Eq. 5.
 
f
M (f ) = 1125ln 1 + (5)
700

A Mel scale is a logarithmic transformation of frequencies; this transformation


creates a scale of pitches judged by listeners to be equidistant to each other. This is
important because human hearing is based on a type of real-time spectrogram encoded
by the cochlea, the Mel-spectrogram closely represents this process [11]. An example
Mel-spectrogram is shown in Fig. 2.

Fig. 2. Representation of an audio signal with a mel-spectrogram [12, 13]

Training and testing of the neural network were done with the DEAM dataset at first.
The dataset consists of 1000 songs with 744 unique songs selected from the Free Music
Archive. Each audio snippet has a length of 45s from a random starting point in a song.
The songs are annotated in a table consisting of valence and arousal values ranging from
1 to 9 [14]. The arousal and valence values are centered around 5 and there is a lack
of songs with low arousal and valence values. The distribution of the arousal values is
shown in Fig. 3. Due to the distribution of values not being balanced, this leads to a lack
of subjectivity in the results. To curtail that; data from Deezer emotional classification
dataset was added to improve the balance of the new dataset.
Due to the distribution of values not being balanced, this leads to a lack of subjectivity
in the results. To curtail that; data from Deezer emotional classification dataset was added
to improve the balance of the new dataset. At first, the songs are divided into three 15
s samples. The audio files are converted to mono and then resampled at 16 kHz. The
songs are converted into Mel-spectrograms using the ‘Librosa’ python library [15]. A
window size of 2048 and hop size of 512 is used.
The valence arousal values are normalized from [01] and the spectrogram
is normalized from [(−1)(+1)]. Two network architectures; (a) the pre-trained
mobilenet_v2_130_224 model developed by google, and (b) a simple network as shown
in previous research are chosen for comparison of results [7].
The simple network developed consists of an input layer of 128 * 1024 (128 Mel
scales and 1024-time window), followed by a convolution layer with 16 filters with a 3*3
kernel size. The Relu activation function is used next in the model to reduce processing
Hybrid Context-Content Based Music Recommendation System 127

Fig. 3. Distribution of valence-arousal values

time followed by a 3*3 pooling layer; a dense layer of 64 neurons and output layer of 10
layers. As mentioned before, hidden layers used the Relu activation function, and output
layer used the SoftMax function. The model was used to get outputs for both valence
and arousal.
For the pre-trained mobilenet_v2_130_224 model, each spectrogram was converted
into images of 224*224 pixels. These pixels were converted into tensors used as the
input. The sequential Keras Api was used since our model is not complicated. The first
layer takes the images and finds patterns in them; the second layer takes the information
from the first layer and outputs it into ten unique labels (0–9). SoftMax is used as the
activation function and Adam as the optimizer function. For both models, the dataset
was split into an 80:10:10 split for training, testing, and validation.

3.2 Content Recommendation


Content recommendation involves extracting features in songs and recommends songs
that have a similar feature to the liked songs of the user. Research on MIR gives several
ways of feature extraction. Use of a spectrogram, low-level audio features like MFCC,
and mid-level audio features like pitch and beat. In this research, a mix of low-level
audio features and mid-level features are used to describe songs. These features are.
MFCC: The MFCC is derived from a Mel-scale spectrogram. The calculation of
MFCC can be obtained by first segmenting audio signals into Lens frames and then
applying a hamming window (HW) defined by Eq. 6 to each frame:
 

HW (n) = 0.54 − 0.46Cos , 0 ≤ n ≤ Lens (6)
Lens
where, Lens is the number of samples taken in a frame. Therefore, the input signal Sigi
is converted to an output Oi using Eq. 7.

Oi = Sig i × HW i (7)

Output signal is then converted into the frequency domain by the fast Fourier
transform using Eq. 8.
 
i
Mel(i) = 2595 × log 10 1 + (8)
700
128 V. Omowonuola et al.

Pitch. Pitch extraction calculates the distances between the peaks of a given segment
of the music audio signal. Let Sig i denote the audio segment, k denotes the pitch period
of a peak, and Leni denotes the window length of the segment, and the pitch feature can
be obtained using Eq. (9).


Leni−k−1
Pitch(k) = Sig i Sig i+k (9)
i=0

Tonality: Tones are letters attached to different frequencies by humans. Most music
has various tones in them, but they end up with a home tone from which variation starts.
This home tone is used as a feature of the song by musicians. Since it is in the form of
letters, the letters are encoded into numbers to serve as input to the neural network.
Other features like beat and rhythm are also used as inputs to the neural network. To
increase the accuracy of the predictions, high level features used in previous research
are also used as inputs. These features as defined by Spotify are:
Danceability: Describes how suitable a track is for dancing based on a combina-
tion of musical elements including tempo, rhythm stability, beat strength, and overall
regularity.
Energy: Represents a perceptual measure of intensity and activity. Typically, ener-
getic tracks feel fast, loud, and noisy. For example, death metal has high energy, while
a Bach prelude scores low on the scale.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are
averaged across the entire track and are useful for comparing the relative loudness of
tracks.
Speechiness: This detects the presence of spoken words in a track. The more exclu-
sively speech-like the recording (e.g., talk show, audiobook, poetry), the closer to 1.0
the attribute value.
Instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah”
sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly
“vocal.”
Liveness: Detects the presence of an audience in the recording. Higher liveness
values represent an increased probability that the track was performed live.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
Duration: The duration of the track in milliseconds [16].
These features were obtained from the Free Music Archive (FMA) small and features
dataset. The FMA is a large dataset created for music analysis containing about 100,000
songs from 160 genres. The audio files and features of the songs are provided [17]. The
FMA small dataset contains 8000 songs divided into an equal amount of 8 genres. Due to
the balance of this dataset and its relation to the dataset used for emotional classification,
it was chosen by the authors [8]. Table 1 shows the division of genres in the FMA small
dataset. Due to problem of achieving a better accuracy using the FMA dataset, a new
dataset mixing data from the FMA and GTZAN dataset was used for classification.
The GTZAN dataset (refer Table 2) contains about 1000 music samples (10 different
genres) 30 s each [18]. Audio labels are also provided making it similar to the FMA
dataset.
Hybrid Context-Content Based Music Recommendation System 129

Table 1. FMA small dataset

Genres Number of Tracks


Electronic 1000
Experimental 1000
Folk 1000
Hip-Hop 1000
Instrumental 1000
International 1000
Pop 1000
Rock 1000

The new data set contains 200 songs each of the genres present in the GTZAN dataset
(blues, classical, country, disco, hip hop, jazz, metal, pop, reggae, rock).

Table 2. GTZAN small dataset

Genres Number of tracks


Blues 1000
Classical 1000
Country 1000
Disco 1000
Hip-Hop 1000
Jazz 1000
Metal 1000
Pop 1000
Reggae 1000
Rock 1000

The features serve as input to our neural network to classify songs into genres. At
first, each feature is normalized from 01 to reduce processing time and increase accuracy.
Since the dataset was small, a simple deep neural network (DNN) was developed to
classify genres.
To train the DNN, the data set was split into a 70–30 split for training and testing.
Considering results from experimentation, Relu is used instead of the sigmoid function
to lead to faster convergence and the ADAM optimizer function with a learning rate of
0.01 gave the best results. The parameters of the model are listed below.

• Three hidden layers of sizes 256, 128, and 64


130 V. Omowonuola et al.

• Batch size of 64
• 1000 Epoch
• Relu activation function
• Learning rate of 0.001 and momentum of 0.9
• Adam Solver

The output of the model gives 10 different genres, this layer is discarded after training,
and each song is represented by a vector of 64 values.

4 Experimental Results/Analysis
In the emotional classification model, two different neural networks were used. The pre-
trained mobilenet_v2_130_224 (referred to as v2_net) and the created neural network
is referred to as emoCNN. The results were obtained with the k-fold cross validation
method, and the accuracy rate was attained as the average of three-fold. Table 3 shows
the accuracy from the training, validating, and testing stages.

Table 3. Accuracy data for emotional classification

Training set Validation set Testing set


V2_NET 84.7% 80.3% 80.2%
EMOCNN 88.3% 84.2% 83.7%

With the higher accuracy of the EmoCNN model, it was chosen to represent the
valence-arousal values for the final recommendation model. Higher accuracy could be
observed with a balanced and larger dataset compared to the one used, but that will be
for further research. On the other hand, the neural network used to classify songs based
on genres, performed better. The root mean square error (RMSE) was used to evaluate
the model. It is defined by;

N  2
i=1 (ri − ri)
RMSE = (10)
N
where, ri , ri ’ are the true and predicted ratings, respectively. Overall, this genre classifi-
cation model attained RMSE of 0.53, the accuracy was lower than expected, however,
considering genre classifications with similar dataset do not have an accuracy above 70%,
it is not surprising. Moreover, accuracy observed in the model satisfied the requirement
for genre predictions.
With both models built and tested, a vector combining the vector and arousal values,
and 64 latent values from genre classification is made. This vector is attached to each song
and will be used to recommend songs while a collaborative filtering model is employed.
This is to be done in future research. Overall, the accuracy of the recommendations
should be high based on the results of the emotional and feature classification.
Hybrid Context-Content Based Music Recommendation System 131

Based on previous research by the authors, a k-means classification model with


cosine similarity can also be used to recommend songs to a user. The results will be
determined based on criteria like, precision, hit rate and recall determining the better
model for recommendation.

5 Limitations and Challenges


Representing music visually always leads to a lack of details. Since music is perceived
with a different medium some details do not translate well visually. Using Spectrogram
to visualize music therefore leads to limitations. This limitation is reduced in our research
with the use of Mel-spectrograms combined with MFCCs and other details; however, it is
important to note that, most of these methods were developed for speech recognition and
not for music. Due to this fact, there is a cap on the music recommendation performance.
Also, since a CNN is used to classify images, the patterns recognized will not be as
accurate as patterns observed in audio. This can be shown in complex music. When a
particular frequency is observed in a spectrogram, it might not be a single sound but an
amalgamation of multiple sounds or interactions between sound waves. This makes it
difficult to identify simultaneous sounds in spectrograms.
A combination using a Mel-spectrogram and other methodologies was used to reduce
this limitation as shown in the above sections. Further research to address the limitations
of the spectrogram in order to create a machine learning system that uses audio instead
of visual methodologies is being done by the authors. Lastly, a challenge faced was the
limited availability of datasets for music recommendations research. Most datasets are
limited by size (FMA) or do not contain the audio signals of the songs used (Million
Song Dataset). A better choice involved creating a new database with valence-arousal
values, audio data, audio features including genres and other features. The creation of
this database increased the accuracy of the prediction system.

6 Conclusion
The influence music has on the population is immense and cannot be overlooked, there-
fore by examining music and finding out why particular features influence our musical
taste, helps in providing newer way to understand music. This paper describes prelimi-
nary research done on creating a hybrid music recommendation system. Neural networks
are leveraged to get emotional values and classify genres to extract features. This model
differs from other approaches by incorporating emotion as a context and adding content
and collaborative filtering methods to recommend songs. Emotions are extracted using
Mel-spectrograms and represented using the valence-arousal scale, songs are classified
into genres using audio data, and genres are used to get interesting song features. Due
to the limitations of the dataset for genre classification, more research is necessary to
create a better feature database and use it for analysis. By doing so, variety of genres,
tracks, and features, will be incorporated with the data set. This will lead to better rec-
ommendation model designs that classify music accurately and provide benefits to the
learning and teaching process of music. Besides, the prototype model tested by people
to identify errors, as accuracy is not an exact prediction for human recommendations.
132 V. Omowonuola et al.

By creating a model like this and integrating it into music recommendation systems, the
digital profiling and therapy effects of the model can be actualized.

References
1. Chou, P.-W., Lin, F.-N., Chang, K.-N., Chen, H.-Y.: A simple score following system for
music ensembles using chroma and dynamic time warping. In: Proceedings of the 2018 ACM
on International Conference on Multimedia Retrieval, pp. 529–532 (2018)
2. Luo, S.: Intro to Recommender System: Collaborative Filtering. Towards Data Sci-
ence (2018). https://towardsdatascience.com/intro-to-recommender-system-collaborative-fil
tering-64a238194a26
3. Hassen, A.K., JanBen, H., Assenmacher, D., Preuss, M., Vatolkin, I.: Classifying music genres
using image classification neural networks. In: Archives of Data Science, Series A (Online
First), 5(1), 20. KIT Scientific Publishing (2018)
4. Pedro, C., Koppenberger, M., Wack, N.: Content-based music audio recommendation. I:n
Proceedings of the 13th annual ACM international conference on Multimedia, pp. 211–212
(2005)
5. Yoshii, K., Masataka, G., Kazunori, K., Tetsuya, O., Okuno, H.: Hybrid collaborative and
content-based music recommendation using probabilistic model with latent user preferences.
In: ISMIR6, pp. 296-301 (2006)
6. Wang, X., Wang, Y.: Improving content-based and hybrid music recommendation using deep
learning. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 627–
636 (2014)
7. Mandapaka, J.S., Omowonuola, V., Kher, S.: Estimating musical appreciation using neural
network. In: Proceedings of the Future Technologies Conference FTC 2021, pp. 415–430.
Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89880-9_32
8. Malik, M., Sharath, A., Konstantinos, D., Tuomas, V., Dasa, T., Jarina, R.: Stacked convo-
lutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:
1706.02292 (2017)
9. Aljanaki, A., Yang, Y.-H., Soleymani, M.: Developing a benchmark for emotional analysis
of music. PloS one 12(3), e0173392 (2017)
10. Akella, R.: Music Mood Classification Using Convolutional Neural Networks. San Jose State
University, Master’s project (2019)
11. O’Shaughnessy, D.: Speech Communication. Addison Wesley, Human and Machine (1987)
12. Roberts, Leland. 2020. Understanding the Mel Spectrogram. Medium. March 14, 2020. (2020)
13. Roberts, L.: Medium.com, 05-Mar-2020. https://medium.com/analytics-vidhya/understan
ding-the-mel-spectrogram-fca2afa2ce53 Accessed 21 Jun 2022
14. Soleymani, M., Caro, M.N., Schmidt, E.M., Sha, C.-Y., Yang, Y.-H.: 1000 songs for emotional
analysis of music. In: Proceedings of the 2nd ACM international workshop on Crowdsourcing
for multimedia, pp. 1–6 (2013)
15. McFee, B., Raffel, C., Liang, D., Ellis, D.P., McVicar, M., Battenberg, E., Nieto, O.: librosa:
Audio and music signal analysis in python. In: Proceedings of the 14th Python in Science
Conference 8, 18–25 (2015)
16. Spotify. 2019. Web API Reference | Spotify for Developers. Spotify.com. (2019). https://dev
eloper.spotify.com/documentation/web-api/reference/
17. Defferrard, M., Benzi, K., Vandergheynst, P., Bresson, X.: Fma: A dataset for music analysis.
arXiv preprint arXiv:1612.01840 (2016)
18. Olteanu, A.: GTZAN Dataset - Music Genre Classification. Kaggle.com (2019). https://www.
kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification
Development of Portable Crack Evaluation
System for Welding Bend Test

Shigeru Kato1(B) , Takanori Hino1 , Tomomichi Kagawa1 , and Hajime Nobuhara2


1 Niihama College, National Institute of Technology, Niihama-City, Japan
{s.kato,t.hino,t.kagawa}@niihama-nct.ac.jp
2 University of Tsukuba, Tsukuba-City, Japan

[email protected]

Abstract. This paper describes a system to evaluate the crack severity for the
welding bend test fragment. The examination in the welding qualification test in
Japan, is conducted by human visual inspection and its burden is concern. The
authors constructed an equipment to photograph the fragment specimens under
the stable optical condition. The proposed system is also designed for portability
to assist the evaluator in the field. We employed Resnet18 to evaluate the given
image. The image input layer of original Resnet18 was remodeled from 224-
by-224 to 500-by-500 to capture the crack feature in detail. The output layer is
replaced with three classification nodes such as “Bad,” “Good,” and “Neutral”
expressing the crack severity levels. Experiments showed that 83% accuracy were
obtained, confirming that CNN adequately captured the surface crack conditions.
The experimental details, remarks, and future works are discussed.

Keywords: Welding · Bend test · CNN · Machine learning · Machine vision

1 Introduction
In developed countries, skilled welding technicians are not sufficient. This issue leads to
a deficit of veteran to nurture young welders. Given this issue, several studies investigate
nurturing methods and construct educational system in the welding [1–3]. In Japan, the
welding qualification is certified by passing a practical examination stipulated by JWES
(Japanese Welding Engineering Society) policy [4]. The veteran judgment conducts
visual inspection for the bend test fragments shown in Fig. 1. As shown in Fig. 1, a
welding plate by a beginner is cut in a specific width and then bent at the welding joint
[5].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 133–144, 2023.
https://doi.org/10.1007/978-3-031-18461-1_9
134 S. Kato et al.

Fig. 1. Schematic of forming a weld bend test fragment.

As Fig. 2 illustrates, the veteran (skilled welder) visually inspects the surface cracks
of the bend test fragments. However, the evaluators must carefully check many pieces
of bend test specimens, and thus this burden is a concern.

Fig. 2. Visual inspection of bend test fragments.

This study aims to build an automatic evaluation system using the Convolutional
Neural Network (CNN) model [6] to address the above problem. The schematic of the
proposed system is shown in Fig. 3.
Development of Portable Crack Evaluation System for Welding 135

Fig. 3. Schematic of proposed system.

The input to the CNN is specimen’s image and the CNN outputs the classification
result as “Bad”, “Good”, and “Neutral”. Such system will support the evaluators’ deci-
sions. Therefore, we intend to construct a system in a compact size so that we can carry
the system to the inspection site. If a computer judges “Bad” and “Good” specimens
correctly, the evaluator’s workload reduces. As the result, the evaluator will be able to
concentrate on evaluating “Neutral” specimens.
Various CNN models have been developed, such as VGG [7], Googlenet [8], Resnet
[9], Densenet [10], and Efficientnet [11]. These CNNs have high accuracy in the task
of classifying 1000 different objects in the RGB color images. Therefore, these models
are compatible with the bend test fragment’s RGB color image. As our initial attempt,
we chose Resnet [9] in the present paper because its structure is simplest in Directed
Acyclic Graph (DAG) networks [8–11]. On the other hand, VGG [7] is a classical style
compared with DAG networks because all layers are arranged one after the other.
We found excellent achievements on evaluating welding deficits using CNN [12–17].
However, our study differs from other similar studies in that it focuses on the evaluation
of specific crack severity features on the bend test fragment.

2 System Construction

We constructed equipment to photograph specimens in a stable optical condition. In


the following subsection, we describe the portable equipment to photograph the speci-
mens. The photographed specimen is extracted from a background using a simple image
processing using color markers. The extracted specimen image is evaluated by CNN
customized to fulfill our goal. We employed the Resnet18-CNN [9] in the present paper.

2.1 Equipment to Photograph Specimens

Figure 4 shows the equipment to photograph bend test fragments. The digital camera
is fixed to the top end of the arm frame of the photographing stand. The photographing
stand is placed inside a compact box to keep stable optical condition. The optical box can
be folded and portable. From the hole in the ceiling of the optical box, we can confirm
the camera’s view. The LED rumps are attached to the ceiling to maintain stable optical
conditions.
136 S. Kato et al.

Fig. 4. Photographing stage is inside the shooting box to take the bend test fragment’s picture
using the fixed digital camera on the top of the arm.

As shown in Fig. 5, the specimen is put on the center of the markers so that only
specimen part is extracted. The extracted image is input to the CNN.
Development of Portable Crack Evaluation System for Welding 137

Fig. 5. Image processing to extract the bent welding part.

The bend test fragment is a roller bending test defined by JIS [18] (Japanese Industrial
Standards.) The plate thickness is mainly 9 [mm], however, few 10 [mm] specimens are
included. The material is SS400 (steel) specified by JIS. Note that we photographed bend
test fragments made from flat plates or pipes. The fragments are part of the flat plates or
pipes. The fragments were bent backward or faceward. Since focus of the present study
is on the surface crack severity, the thickness, or bent directions are not important.

2.2 Proposed CNN

Figure 6 shows the structure of the proposed CNN based on Resnet-18 [9]. The image
input layer size of original Resnet-18 is [224,224,3] resolution, and the output layer
comprises 1000 classification nodes. On the other hand, the proposed CNN is remodeled
so that the image input layer size [500,500,3] resolution, and output layer with three
classification nodes represents the probability of “Bad,” “Good,” and “Neutral” crack
severity. The class with highest probability becomes the classification result. The kernel
values of all convolutional layers were the same as the original Resnet-18 before training
in the following experiment to validate the proposed CNN.
138 S. Kato et al.

Fig. 6. Proposed CNN to classify crack severity.

3 Experiment

We visited JWES in Niihama-City, Japan to photograph bend test fragment pictures


using the portable photographing equipment on 17 Feb 2022 and 23 Mar 2022. And
then, all bend test fragment images were extracted automatically along with the pink
markers from the pictures successfully, as shown in Fig. 7. All images were converted to
Development of Portable Crack Evaluation System for Welding 139

[500,500] resolution RGB images compatible with the proposed CNN input layer size.
In total, 105 pictures were obtained.

Fig. 7. Examples of extracted images.

The images were classified to “Bad”, “Good”, and “Neutral” depending on the
crack severity level. Figure 8 shows the classification examples. The numbers of “Bad”,
“Good”, and “Neutral” images were 29, 59, and 17, respectively (105 = 29 + 59 + 17).
140 S. Kato et al.

Fig. 8. Classification depending on crack severity.

To validate the CNN, we carried out repeated random subsampling validation similar
to a bootstrap validation [19, 20] which is confident even though a small number of data
is available. First, 5 “Bad” and 5 “Good” images were taken out randomly from all 105
image data comprising 29 “Bad” 59 “Good” and 17 “Neutral” images (105 = 29 +
59 + 17). The taken 10 images (10 = 5 “Bad” + 5 “Good”) were left out to use for
testing CNN accuracy. The remaining 95 (95 = 105 – 10) image data were used for
training the CNN. The training was performed under the condition shown in Table 1.
After training the CNN, 10 images not used for training were input into the trained CNN
to validate accuracy of the CNN. This routine was performed in 20 trials as Table 2.
Table 2 enumerates the test data set and accuracy for each trial.
As Table 1 shows, the images were augmented as left-right and up-down reflected at
random while training the CNN to avoid overfitting. Figure 9 shows the loss and accuracy
changes for all 20 trials. Since the loss and accuracy were improved as iteration proceeds,
the training went well. Figure 10 shows the accuracy in each trial and the confusion matrix
for all 20 trials.
Development of Portable Crack Evaluation System for Welding 141

Table 1. CNN training condition.

Method/Value
Solver SGDM(Stochastic Gradient Descent with Momentum)
Learn Rate 10−4
Max Epochs 150
Mini batch Size 32
Total Iterations 300
Augmentation Left-Right and Top-Down Reflection
CPU Intel core i9 10980XE
Main Memory 98 GB
OS Windows 10 64bit
Development Language MathWorks, MATLAB (R2022a)
GPU Nvidia RTX A6000 (VRAM 48GB, 10752 cuda cores)

Fig. 9. Training loss and accuracy.


142 S. Kato et al.

Fig. 10. Accuracy in test data set.

Table 2. Test data set and accuracy of CNN in each trial.

Trial Test for “Bad” image ID Test for “Good” image ID Accuracy
(Bad 129) (Good 159)
1 4 9 12 24 26 21 47 49 53 58 0.8000
2 1 9 13 26 29 10 16 23 24 38 0.9000
3 9 18 23 27 28 10 15 36 40 49 0.9000
4 1 7 11 16 23 13 14 19 47 57 0.8000
5 6 14 17 27 29 2 6 23 36 40 0.7000
6 2 4 10 17 18 1 5 24 51 59 0.9000
7 7 9 17 18 28 5 18 24 38 41 1
8 7 15 19 23 27 13 18 24 41 46 0.6000
9 3 7 13 23 29 10 30 38 58 59 0.9000
10 2 14 20 27 28 33 35 42 50 56 0.7000
11 9 16 18 22 23 1 17 23 45 47 0.9000
12 10 17 20 24 29 11 18 28 38 49 0.9000
13 14 15 17 26 28 41 45 54 56 57 0.8000
14 4 6 10 25 26 4 24 38 52 56 1
15 6 16 18 27 29 25 29 32 46 51 0.8000
16 14 17 18 21 28 1 24 39 53 56 0.8000
(continued)
Development of Portable Crack Evaluation System for Welding 143

Table 2. (continued)

Trial Test for “Bad” image ID Test for “Good” image ID Accuracy
(Bad 129) (Good 159)
17 3 16 22 24 27 3 21 36 52 55 0.8000
18 1 4 7 24 29 7 10 17 40 58 0.8000
19 6 12 21 23 26 19 26 34 37 49 0.9000
20 6 7 8 13 14 6 15 17 29 58 0.7000
- - - Average 0.8300

The mean accuracy was 83% in total. Therefore, the proposed CNN captured the
crack severity features. However, 22 “Bad” images were misjudged as “Good,” as
Fig. 10(b) shows. In the future, we will experiment with not only Resnet18-CNN but
also with various high performance CNNs such as VGG [7], Googlenet [8], Densenet
[10], and Efficientnet [11].

4 Conclusions

This paper proposed automatic evaluation method for evaluating the welding bend test
crack severity to assist the human visual inspection. We constructed the equipment to
photograph the fragment specimens under the stable optical condition and employed
the Resnet18-CNN to evaluate the given image. The input layer of Resnet18-CNN was
customized from 224-by-224 to 500-by-500 to capture the crack feature in detail. The
output layer consists of three nodes expressing crack severity level such as “Bad”, “Good”
and “Neutral”. In the experiment, the mean accuracy was 83%. In the future, we will
experiment with other CNNs and compare their results.

Acknowledgments. The authors would like to thank Ueno in MathWorks for technical advice.
This work was supported by a Grant-in-Aid from JWES (The Japan Welding Engineering Society.)

References
1. Asai, S., Ogawa, T., Takebayashi, H.: Visualization and digitation of welder skill for education
and training. Welding in the world 56, 26–34 (2012)
2. Byrd, A.P., Stone, R.T., Anderson, R.G., Woltjer, K.: The use of virtual welding simulators
to evaluate experimental welders. Weld. J. 94(12), 389–395 (2015)
3. Hino, T., et al.: Visualization of gas tungsten arc welding skill using brightness map of backside
weld pool. Trans. Mat. Res. Soc. Japan 44(5), 181–186 (2019)
4. The Japanese Welding Engineering Society http://www.jwes.or.jp/en/ Accessed 28 Mar 2022
5. Wan, Y., Jiang, W., Li, H.: Cold bending effect on residual stress, microstructure and mechan-
ical properties of Type 316L stainless steel welded joint. Engineering Failure Analysis
117,104825 (2020)
144 S. Kato et al.

6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
7. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv preprint arXiv:1409.1556 (2014)
8. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
(2016)
10. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional
Networks. In: CVPR, 1(2), p. 3 (2017)
11. Tan, M., Le, Q.V.: EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks. ArXiv Preprint ArXiv:1905.1194 (2019)
12. Park, J.-K., An, W.-H., Kang, D.-J.: Convolutional neural network based surface inspection
system for non-patterned welding defects. Int. J. Precision Eng. Manufacturing 20(3), 363-374
(2019)
13. Dung, C.V., Sekiya, H., Hirano, S., Okatani, T., Miki, C.: A vision-based method for crack
detection in gusset plate welded joints of steel bridges using deep convolutional neural
networks. Automation in Construction 102, 217-229 (2019)
14. Zhang, Z., Wen, G., Chen, S.: Weld image deep learning-based on-line defects detection
using convolutional neural networks for Al alloy in robotic arc welding. J. Manuf. Process.
45, 208–216 (2019)
15. Dai, W., et al.: Deep learning assisted vision inspection of resistance spot welds. J. Manuf.
Process. 62, 262–274 (2021)
16. Abdelkader, R., Ramou, N., Khorchef, M., Chetih, N., Boutiche, Y.: Segmentation of x-ray
image for welding defects detection using an improved Chan-Vese model. Materials Today:
Proceedings 42(5), 2963–2967 (2021)
17. Zhu, H., Ge, W., Liu, Z.: Deep learning-based classification of weld surface defects. Appl.
Sci. 9(16), 3312 (2019)
18. The Japanese Industrial Standards Committee https://www.jisc.go.jp/eng/index.html
Accessed 29 Mar 2022
19. Priddy, K.L., Keller, P.E.: Artificial Neural Networks - An Introduction, Chapter 11, pp. 101–
105. Dealing with Limited Amounts of Data. SPIE Press, Bellingham, WA, USA (2005)
20. Ueda, N., Nakano, R.: Estimating expected error rates of neural network classifiers in small
sample size situations: a comparison of cross-validation and bootstrap. In: Proceedings of
ICNN’95 - International Conference on Neural Networks, 1, pp.101–104 (1995)
CVD: An Improved Approach of Software
Vulnerability Detection for Object
Oriented Programming Languages Using
Deep Learning

Shaykh Siddique1(B) , Al-Amin Islam Hridoy2 , Sabrina Alam Khushbu2 ,


and Amit Kumar Das2
1
Prairie View A&M University, Prairie View, TX 77446, USA
[email protected]
2
East West University, Dhaka, Bangladesh

Abstract. Software vulnerability poses a significant security threat to


the simultaneous expansion of the digital revolution. With increasing
numbers of software and their vulnerabilities, detecting vulnerabilities
accurately is a substantial challenge. Various static and deep learning
approaches are executed to make the tasks more manageable, but detec-
tion accuracy is still a significant factor. In this paper, we are introducing
Common Vulnerability Detector (CVD), a deep learning-based vulnera-
bility detection system that can analyze Object-Oriented Programming
(OOP) Language assembled source codes and can detect vulnerabilities
with the highest accuracy. We implemented a highly optimized Convo-
lutional Recurrent Neural Network (CRNN) for source code analysis to
achieve this. By implementing this model on a SARD dataset of C Sharp
source codes, CVD could successfully detect six common and dangerous
vulnerabilities with an accuracy of 96.10% and F1 score of 96.40%. We
compared CVD with all the known and popular methods and CVD out-
performed all of them. According to the performance and results, our
proposed CVD model is a promising step in vulnerability detection. Fur-
thermore, this model can be the stair for something revolutionary in the
world of vulnerability detection.

Keywords: Software vulnerability · Software security · Convolutional


recurrent neural networks · Code analyzer · Deep learning

1 Introduction

Software vulnerability is a huge issue nowadays. Hacking, Cracking, Malware


Attack, Data Leak, Cyber Attack, and many more major security issues can
have these vulnerabilities in software. So, the detection of vulnerabilities of soft-
ware is a must. But detecting all the vulnerabilities by hand is arduous and
time-consuming task. In 2010 the common vulnerabilities were uniquely indexed
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 145–164, 2023.
https://doi.org/10.1007/978-3-031-18461-1_10
146 S. Siddique et al.

about 4,600 by Common Vulnerabilities and Exposures (CVE) [1]. In 2016 the
number rose about to 6,500 [1]. An automatic detection system can reduce effort.
So, Machine Learning and Deep Learning approaches are being used in this area
to solve this problem. Several works are done for vulnerability detection using
the deep learning approach, but there are still limitations in detection perfor-
mance. Also, manual human labor and high false-negative rates are significant
drawbacks of existing solutions. Despite the effort expended in the pursuit of
safe programming, software vulnerabilities remain a significant concern and will
continue to be. Object-Oriented Programming (OOP) Languages are now trendy
among all types of developers. All the software and programs we see in our daily
lives are made with an Object-Oriented Programming (OOP) Language.
There are many popular Object-Oriented Programming (OOP) modeled Lan-
guages like C-Sharp (C#), Java, Python, Ruby, PHP, C++, VB.Net, JavaScript,
etc. One of the core objectives of OOP Languages is data security and encapsula-
tion. A simple line of vulnerable code can ruin the software’s complete security,
causing various damage to the user. For example, Banshee is a famous cross-
platform media player, made by the OOP Language C# [36]. This was the
default media player in the Linux Mint operating system for a long time. It had
a vulnerability CVE-2010-3998 for version 1.8.0 or earlier that allowed any local
user to gain privileges using a Trojan Horse [33]. Banshee was later replaced by
Rhythmbox, which also had a vulnerability CVE-2012-3355 for version 0.13.3
or earlier that allowed any local users to execute arbitrary codes by a Symlink
Attack [31]. So, for the safety of the users, vulnerability detection is a crucial
task. It needs to be done to know what they are using, and developers can under-
stand what they are missing. Also, it should be done so that a large number of
vulnerabilities can be detected in a short time without having much manual
human labor.
We introduce CVD (Common Vulnerability Detector), deep learning
approached complete model to determine vulnerabilities from the source codes
of various OOP-based programs and software. Specifically, we present a Deep
Learning based model that will be able to detect and pinpoint the common vul-
nerabilities of an OOP source code without any kind of manual labor with high
accuracy and a low false-negative rate. All the code analyzers used on deep learn-
ing models till now are mainly based on text classifiers, which is not fully efficient
for this particular purpose. We present a model that works directly with a code
analyzer that analyzes codes using a code classifier. The major contributions
that have been made in this paper are summarized below:

• A fully optimized OOP parser is applied for the OOP Languages. We know,
all the OOP languages have an almost similar structure in them. So, this
OOP parser can break any OOP source codes into small pieces and tokenize
them, so that they can be processed accordingly.
• Common Vulnerability Detector (CVD) is designed as an intelligent vulnera-
bility detector that detects common vulnerabilities of any OOP source codes
with high accuracy and low false-negative rate. To do so, we present a fully
CVD - Common Vulnerability Detector 147

optimized convolutional recurrent neural network (CRNN) capable of analyz-


ing the source codes.
• We compare the Common Vulnerability Detector (CVD) model with the other
existing machine learning algorithms to show CVD’s performance and opti-
mization.

The rest of the article is ordered as the most related works are in the Sect. 2
section. The details of the dataset are in Sect. 3 section. Section 4 describes our
research and deep learning-based models’ methods are explained in Subsect.
4.2, where our proposed CVD model with the convolutional recurrent neural
networks exists. The performances, results, and accuracies are evaluated in the
Sect. 5 segment and in Sect. 6, we finish with conclusions.

2 Related Works
Several experiments are conducted to analyze the source code. Two main
branches, static algorithm analysis tools, and dynamic machine learning algo-
rithm analysis describe them. Static Application Security Testing (SAST) [20],
Attackflow [4], the dotTEST by Parasoft [3] are examples of static source code
vulnerability analysis tools.
Using machine learning algorithms for source code analysis and detecting
bugs/vulnerabilities is a very new concept in computer science. A source code
analysis model for PHP is named TAP, a token, and deep learning-based tech-
nology [16]. In this paper, they designed a custom tokenizer, a new method to
perform data flow analysis. They used Deep Learning Technology (LSTM) to
solve the problem, and TAP became the only machine learning model for deal-
ing with 7 vulnerability categories of PHP. To achieve their goal, they collected
their datasets from SARD where, total samples are 42,212 with 29,258 good,
and 12,954 vulnerable samples. That dataset contains 12 categories of CWE
vulnerabilities. After tokenizing all the source codes, they used Word2vec, which
transforms these into vector space using Neural Network. In the main Experi-
ment, they used LSTM over RNN because LSTM is an improved version of RNN
to deal with long-term dependencies, and thus LSTM outperforms RNN. Then
they used all the created vector spaces as the input of the LSTM layer, and
the output layer was decoding the vulnerability labels. Though TAP got good
results, it is not perfect. Many targeted labels from CWE-862 were unspecified
in the results, and nearly all targeted labels from CWE-95 were unidentified.
The lack of samples in these categories may be a potential explanation.
Yongcheng Liu and his team proposed a new model using neural networks
to imbed vulnerability concerned text to derive their implicit meaning [17]. Fea-
ture merging, providing support for text features, and other implicit features
are the significant components of the proposed model for vulnerability classifi-
cation. To get a more accurate tacit representation of vulnerability, concerned
text lemmatization, eliminating stop words, and embedding neural networks are
used. Based on the training model output of text feature through the neural net-
work, embedded representation, or the classified result to determine if the bug
148 S. Siddique et al.

is exploitable or not. Finally, these obtained results are joined with other robust
features and a two-class ML model is used to detect the possibility of attackers
exploiting vulnerabilities. Open-source intelligence data from different databases
like NVD, SecurityFocus, and Exploits are used to extract the fastEmbed sys-
tem’s characteristics. This implicit artwork of predicting exploitation has been
replicated and the replication has been used as a benchmark. This model beats
the baseline model regarding performance in both generalization, and prediction
ability without terrestrial interlacing by training the establishment of vulner-
ability concerning text classification on highly biased datasets. Moreover, this
model also beat the baseline model’s performance for predicting vulnerabilities
with a 33.577% improvement [17].
A system was built and named ‘VulDeePecker’, short for Vulnerability Deep
Pecker [28]. The first vulnerability dataset ever has been presented here. This
VulDeePecker later was able to find four new vulnerabilities from three different
software programs that were not listed in the National Vulnerability Database
(NVD). To accomplish this, Recurrent Neural Network (RNN), Bidirectional
Recurrent Neural Network (BRNN), and Gated Recurrent Unit (GRU) have
been ruled out because Long Short-Term Memory (LSTM) outperforms all of
them. So, Bidirectional LSTM (BLSTM) has been chosen as unidirectional is not
sufficient in this case. In the process, it is assumed that the programs’ source
codes are available, and those are written in C/C++ codes. To train the BLSTM,
at first the library/API calls and similar code slices are extracted and turned
them into code gadgets. Code gadgets are lines of codes or number of instruc-
tions matched with one another. Then the code gadgets are labeled with 0 and 1
indicating if it has a vulnerability or not. After that, the code gadgets are trans-
formed into symbolic representations and then converted into vectors. The vec-
tors are the input of the BLSTM Neural Network. The program has been tested
with 19 well-known C/C++ products. After testing, VulDeePecker detected 4
new vulnerabilities of 3 different known products that were not listed in the NVD.
There are some limitations too. It can only analyze C/C++ source codes; the
implementation works with the BLSTM. A better approach can be possible, and
the dataset only contains buffer error vulnerabilities, and resource management
error vulnerabilities. Some of the Chinese researchers proposed a new automatic
vulnerability classification model named TFI-DNN [24]. This model is made by
combining Term-Frequency- Inverse Document Frequency (TF-IDF), Informa-
tion Gain (IG), and Deep Neural Network (DNN). After applying this model to
a dataset from NVD, it acquired an accuracy of 87%. The recall is 0.82, Precision
is 0.85, and the F1-score is 0.81. Another research team presented three deep
learning models for software vulnerability detection and compared their perfor-
mances with the traditional model. The three models they used are Convolution
Neural Network (CNN), Long Short-Term Memory (LSTM), and a Combination
of both named CNN-LSTM. After implementing, they compared them with the
traditional method Multi-layer Perception (MLP) [35]. The prediction accuracy
of their proposed method is 83.6%, which outperformed MLP. A study was per-
formed by Chakraborty et al. where they used CNN+RF, BLSTM, BGRU, and
CVD - Common Vulnerability Detector 149

GGNN deep learning models to build the vulnerability prediction framework


[10]. Natural language processing is also used for analyzing the source code [26].
Encoder-decoder, LSTM, and RNN deep learning models are operated by n-gram
tokenization. The suffix tree classification [11], graph-based feature analysis [29]
for neural learning, and vulnerabilities detection with code metrics [37] are also
very well known in the field of source code vulnerability detection.

3 Dataset and Preprocessing


For the supervised machine learning technique of classification, the dataset is a
purely compliable Object-Oriented model source code. These are collected from
the Software Assurance Reference Dataset (SARD), which are created under
the National Institute of Standards and Technology (NIST) projects [6]. The
dataset contains 32,000 C# (C-Sharp) source code files according to the OOP
perspective. The source codes include real, production-level software applications
with known bugs and vulnerabilities to ensure real-time software performance.
C# test suites consist of 6 classes of Common Weakness Enumeration (CWE)
[2]. For each type of Vulnerable CWE, we also have safe source codes, which helps
to train machine about the classification of valid source codes and corresponding
CWE known vulnerable source codes. For the training and the testing phases,
samples are split into an 80:20 ratio. The dataset consists of 13,333 good samples
and 18,667 vulnerable samples.

• CWE-22: Importing control of a path to an unauthorized memory.


• CWE-78: Importing risky OS Command (OS Command Injection).
• CWE-89: Importing directly of SQL Command (SQL Injection).
• CWE-90: Importing control of Special Elements used in a Lightweight Direc-
tory Access Protocol Query (LDAP Injection).
• CWE-91: Blind XPath Injection (XML Injection).
• CWE-327: Use of a Broken or Risky Cryptographic Algorithm.

For all corresponding to six CWEs’, there are also safe source codes. These
invulnerable source codes are labeled as a good target class. So overall, we have
seven unique target classes for prediction from source codes. Each vulnerable
source codes contain different types of flaws.

Listing 1.1. Vulnerable source code sample


1 public static void Main ( string [] args ) {
2 string tainted_2 = null ;
3 tainted_2 = Console . ReadLine () ;
4 // no filtering
5 string query = " SELECT  * FROM  ’" + tainted_2 + " ’" ;
6 string connectionString = " server = localhost ; uid = mysql_user ; password =
mysql_password ; database = dbname " ;
7 MySqlConnection dbConnection = null ;
8 try {
9 dbConnection = new MySqlConnection ( connectionString ) ;
10 dbConnection . Open () ;
11 MySqlCommand cmd = donnection . CreateCommand () ;
150 S. Siddique et al.

12 cmd . CommandText = query ;


13 MySqlDataReader reader = cmd . ExecuteReader () ;
14 while ( reader . Read () ) {
15 Console . WriteLine ( reader . ToString () ) ;
16 }
17 dbConnection . Close () ;
18 } catch ( Exception e ) {
19 Console . WriteLine ( e . ToString () ) ;
20 }
21 }

Two examples of source code files are shown here. Listing 1.1 is the vulnerable
source file, in line 5 there is no filtering, but direct use of the input variable in
SQL query. It is possible to inject SQL queries here, which is labeled as CWE-89
(SQL Injection). As well as the vulnerable code samples, there are also good code
samples for that corresponding CWE. Listing 1.2 shows the sample of safe code
for CWE-89. Before applying the SQL query, the input variables are preprocessed
and removed all special characters to make the query safe. The query remover
preprocessing is added from code line number 6 to 13. All the characters which
can be responsible for making SQL query are replaced there.
Listing 1.2. Safe source code sample
1 public static void Main ( string [] args ) {
2 string tainted_2 = null ;
3 string tainted_3 = null ;
4 tainted_2 = Console . ReadLine () ;
5 tainted_3 = tainted_2 ;
6 string pattern = @ ‘ ‘/^[0 -9]* $ / " ;
7  Regex  r  =  new  Regex ( pattern ) ;
8  Match  m  =  r . Match ( tainted_2 ) ;
9  if  (! m . Success )  {
10   tainted_3  =  ‘‘" ;
11 } else {
12 tainted_3 = tainted_2 ;
13 }
14 string query = ‘‘ SELECT * FROM ’"  +  tainted_3  +  ‘‘’" ;
15 string connectionString = ‘‘ Server = localhost ; port =1337; User Id =
postgre_user ; Password = postgre_password ; Database = dbname " ;
16  NpgsqlConnection  dbConnection  =  null ;
17  try  {
18   dbConnection  =  new  NpgsqlConnection ( connectionString ) ;
19   dbConnection . Open () ;
20   NpgsqlCommand  cmd  =  new  NpgsqlCommand ( query ,  dbConnection ) ;
21   NpgsqlDataReader  dr  =  cmd . ExecuteReader () ;
22   while  ( dr . Read () )  {
23    Console . Write ( ‘ ‘{0}\ n " , dr [0]) ;
24 }
25 dbConnection . Close () ;
26 } catch ( Exception e ) {
27 Console . WriteLine ( e . ToString () ) ;
28 }
29 }

As simpler the parser, the more generalized supporting power for different
types of programming languages (Object Oriented Structured). In this step, all
the source codes are cleaned, comments are cleared. We designed a custom source
code parser to analyze the Object definition of the source code files. We try to
CVD - Common Vulnerability Detector 151

identify the source code files structural patterns to make a simple parser, parse
all the source codes as a single file, means a single test case, and CWE id is the
targeted class of that sample.

Fig. 1. OOP structural parser

All of the source code files contain the same structure as defined in the Fig. 1.
Our Object-Oriented Model parser unifies the tokens to extract features from
the dataset. For the methods implementation section of fig, the parser parses
word by word and removes all special characters (Code Syntax). The dataset
is considered as the natural language for analyzing source codes using deep
learning algorithms. Using regular expressions, the OOP parser is made simpler
to support all OOP-based source code datasets of different languages.

4 Methods

Different types of classification algorithms are applied to find out the most suit-
able algorithm based on the source code dataset. Depend on algorithm struc-
tures, and the samples are trained and tested on both static classification pre-
diction algorithms and deep learning-based algorithms.
152 S. Siddique et al.

4.1 Static Classification


Some static classification algorithms are applied to analyze the performance
because it is a supervised learning system. We use Natural Language Toolkit
(NLTK) [30] to classify with static algorithms to perform static algorithms
better. The most common static artificial Intelligent classification algorithms
applied here. The k-nearest neighbors [19] computation generally requires a large
number of costly distance computations. A structured tree based on different
conditions like chances, events, outcomes which help to predict from new sam-
ples is decision tree [18]. An ensemble learning technique for classification is
random forests or random decision forests. They’re correcting the habit of over-
fitting their training set for decision trees. Random forests are sometimes used
as “blackbox” models, as they produce accurate predictions over a wide variety
of data [21]. Logistic regression [23] is a statistical model that utilizes a logis-
tic feature to model and a binary dependent variable in its basic form. Both
regression and classification problems can be solved using Logistic Regression.
Likely the other optimization algorithms Stochastic Gradient Descent (SGD)
classifier [7] show impressive performance for optimization and smoothness on
large-scale data samples. Naive Bayes a probabilistic classifier based on Bayes
theorem Naive Bayes. The machine and input vectors conceptually implement
Support Vector Machine (SVM), and map non-linearly to a very high-dimension
feature space [15].

Algorithm 1: Feature Extraction for each sample of Static Classification


allsyntaxs ← word tokenize(SourceCode);
features ← BinaryAssociativeArray(False);
foreach syntax ← VocabularyFrequencyMatrix do
if (syntax  allsyntaxs) then
features[word] = True;
end
end

In Algorithm 1, VocabularyFrequencyMatrix is the syntax list of all fre-


quently founding words from source codes. Then the static classification algo-
rithms are applied to the features with targeted classes.

4.2 Shallow to Deep


As well as static classification, deep learning algorithms are also applied to mea-
sure the performance pointed out in Fig. 2. Here, we applied different neural
network models with some deep, dense output layers.
After preprocessing, the source codes are described in Sect. 3 section and
tokenized in cleaned source codes to sequences. To make the sequence length
equal size, the sequences are padded with zeros. The length of the padding
CVD - Common Vulnerability Detector 153

Fig. 2. Workflow of CVD vulnerability detector system

size is equal to the maximum length of each sequence. The dataset now seems
like sequences to multiclass classification. Text classification algorithms always
create an extra bias on sequence order. But a bug in code can be anywhere,
which cannot assure sequence order classification. We proposed an Optimized
Convolutional Recurrent Neural Network model for source code analysis and
named the overall system as a Common Vulnerability Detector (CVD). Here we
applied 4 popular deep learning neural network models Long short-term memory
(LSTM) [22], Gated Recurrent Unit (GRU), Recurrent Neural Network (RNN)
for text classification, and Convolutional Neural Network (CNN) to investigate
the performance of our CVD model source code analyzer. All 5 models are shaped
with equal training parameters and similar shaped hidden layers for performance
evaluation simplicity.

Long Short-Term Memory (LSTM)


Long Short-Term Memory is a gradient-based Neural Networking algorithm.
This algorithm was presented by Sepp Hochreiter and Juergen Schmidhuber in
1997 [22]. This algorithm in Fig. 3 uses a three-layer units. Among them, one is
the input unit; another one is the output unit. The 3rd one is a layer for 3 cell
blocks of each size 1. LSTM can add or remove information to the cell state.
In the algorithm, LSTM first decides which information to remove from the cell
state. Then it decides which new information to add to the cell state.
To do that, at first, a layer named sigmoid layer of Eq. 1 decides which value
needs to be updated. Then another layer named tan hyperbolic (tanh) layer with
Eq. 2 creates a vector for new values.
1
σ(x) = (1)
1 + e−x
154 S. Siddique et al.

ex − e−x e2x − 1
tanh (x) = = (2)
ex + e−x e2x + 1

Fig. 3. Long Short-Term Memory (LSTM)

Later these two parts are combined to create an update to the state. After
that, the output is produced. The output is based on the cell state but a filtered
version. For that, the sigmoid layer is run to decide what parts are going to be
outputted, and then the cell state is put through the tanh layer and multiplied
by the output of the sigmoid layer to get what was decided.

Gated Recurrent Unit (GRU)


Gated Recurrent Unit is a novel neural network model proposed in 2014 by
Kyunghyun Cho [12]. This is also a type of RNN and a lot similar to LSTM.
But unlike LSTM it has a hidden unit (or hidden state) instead of the cell state.
It has two gates named reset gate and update gate. The reset gate decides about
the previous hidden state whether to ignore it or not. And the update state
decides the amount of information from the previous hidden state that needs to
be carried over to the current hidden state. Though this hidden unit is mostly
motivated by the LSTM unit, it is simpler to compute and implement, and
speedier to train than LSTM [12].

Recurrent Neural Network (RNN)


There are a lot of Neural Network models for text classification. Recurrent Neural
Network (RNN) is one of them. RNN models are built by integrating different
hidden layers. There have been several works on text classification using RNN
[34]. RNN consist of several layers. In this text classification model of RNN,
there is a Long Short-Term Memory (LSTM) layer with 128 neurons/cores in it.
There is a Time Distributed Dense layer with 256 neurons in it. Tanh activation
CVD - Common Vulnerability Detector 155

function is used to activate the Dense layer. Another Dense layer has been used
for the output of the model. The activation function of this Dense layer is named
softmax. So overall, inputs are given in the input layers. Those inputs are passed
through all the layers of the model, like sequential architecture. According to the
position and meaning of the words, inputs are parsed and tokenized. After that,
the tokens are passed through all the layers, and finished with the output layer.
Thus, RNN has been used as a persuasive text classifier.

Convolutional Neural Network (CNN)


Yann LeCun first proposed a Convolutional Neural Network in 1980. This algo-
rithm is mainly used for image recognition, and this sort of task [5]. CNN is
composed of several layers of artificial neurons. The primary purpose of these
artificial neurons is to calculate the weighted sum of the inputs and give an acti-
vation value. CNN usually consists of several convolution layers. When an image
is inputted into the model, each CNN layer generates multiple activation maps
that highlight the relevant features of that given image pixels. The bottom layer
usually finds features like horizontal, vertical, and diagonal edges. The final layer
of CNN is the classification layer. The classification layer takes the output of the
final convolution layer as input and starts the process again. At first, CNN takes
the inputs from a large dataset and process them with random values. When the
output does not match the labels provided in the dataset, the model learns and
repeats the process with some corrections. This how the training process runs.
After several runs, the models get ready for testing with an unlabeled dataset.
When it gets a good accuracy on the test dataset, it is ready for real-life use.
CNN acquired great success and reputations in image recognition, EMG recog-
nition, video analysis, natural language processing [14], anomaly detection, drug
detection types of supervised learning [27].

Optimized Convolutional Recurrent Neural Network for CVD


Figure 4 shows a six-layer optimized Convolutional Recurrent Neural Network
(CRNN) for CVD. When the sequences are available, those are given as input in
the model embedding layer in the three-dimensional shape. The input embedding
layer turns positive integers into a dense vector of fixed size. Using sequential
model architecture, the output of the embedding layer activates the next Time
distributed layer’s inputs.
Time Distributed Layers are wrappers that make inputs as slices. The Con-
volution layers with 32 kernels extract the best features from the embedded
matrix.
156 S. Siddique et al.

Fig. 4. CVD neural network model

⎡ ⎤
a b c d  
⎣ ⎦ w x
Input = e f g h Kernel =
y z
i j k l
 
aw + bx + ey + f z bw + cx + f y + gz cw + dx + gy + hz
Filtered output =
ew + f x + iy + jz f w + gx + jy + kz gw + hx + ky + lz
The output is resampled and taken the maximum value according to the win-
dow size using the MaxPooling Layer.When we have the super feature matrix
(CNN applied), we reshaped it and applied Simple RNN to maintain the sequence
order with 32 units. This type of composite structure makes our model highlight
because CNN can identify the Super Features (the exact bug) shown in Fig. 5,
and RNN maintains the sequence order. Another dense layer is applied to gen-
erate the output layer of predicted classes, where the kernel size is equal to the

Fig. 5. Super features extraction


CVD - Common Vulnerability Detector 157

number of targeted classes. Finally, slash the output with the softmax activation
function to make the output smooth and normalized.

exp(xi )
sof tmax(x)i =  (3)
j exp(xj )

Equation 3, softmax is a multiclass normalized function which returns the


arguments of maximum. As the sigmoid function can only normalize binary
classes, so the concept behind using softmax function. Another softmax function
behavior makes the class predicted classed probabilistic; the sum of all output
classes is precisely equal to 1 (Probability Matrix). From the output, we can
quickly determine the predicted class by finding the highest probability. For
each time of back-propagation, the loss is calculated and tried to minimize. As
we are trying multiclass classification, the Categorical Cross-entropy function is
used.

L(y, ŷ) = − y(x) log ŷ(x) (4)
x

From Eq. 4 loss function, y is the actual label, and ŷ is the output’s predicted
label. As well as minimizing loss, we use Adam [25] as the optimization algorithm
per epoch.

5 Experiments and Results


For comparing performance, the number of training parameters for all models are
made equal. According to identify our contribution, the optimized convolutional
recurrent neural network for source code analysis, some parameters are defined.
To assure the perfect optimization amount, environments are enormous concerns.
Table 1 describes the environment setup for this study. The required time per
epoch or each epoch’s duration is found minimum 28 s to maximum 31 s for all
the models described in Sect. 4, the Methods.
We use some statistical formulas and functions Accuracy, precision, recall,
and F1 score for the performance analysis of the training and testing period.
Accuracy comes from the similarities of predicted output and actual output.
True Positive (TP), True Negative (TN), False Positive (FP), and False-Negative
(FN) are defined to find the performance.
TP
P recision = (5)
TP + FP
TP
Recall = (6)
TP + FN
P recision · Recall
F 1Score = 2 · (7)
P recision + Recall
TP + TN
Accuracy = (8)
TP + TN + FP + FN
158 S. Siddique et al.

Table 1. Experimental setup of environment

Operating System: Cinnamon 4.4.5(Linux)


CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20 GHz
Number of CPU(s): 4
Bus: 2248
Cache Memory L1d: 32KiB
Cache Memory L1i: 32KiB
Cache Memory L2: 256KiB
Cache Memory L3: 3MiB
System Memory: 8GiB

The importance of F1 score measurement in software security is a big fac-


tor [32]. Deep learning-based Classification model performance evaluations are
divided into two parts. One is the training phase and the other is the testing
phase. Cosine Similarity Eq. 9, Area Under the Curve (AUC) of accuracy, and
Losses are evaluated in the training phase.
n
A·B Ai × Bi
Similarity(A, B) = =
n i=12
n (9)
A×B i=1 Ai × i=1 Bi
2

Hamming Loss [9] is calculated for test data samples. The Jaccard Score
introduced by Paul Jaccard, helps to find positive definiteness from actual data
to the predicted data [8]. Cohen Kappa Score [13] is also evaluated to find our
designed model’s inter-rater reliability in the testing phase.

5.1 Performances of Static Classification


According to the formula of accuracy Eq. 8, Precision Eq. 5, Recall Eq. 6, and
F1-Score Eq. 7, Table 2 shows the Decision Tree and SVM Classifier gives the
highest 93.80% and 93.70% performance.

Table 2. Classification performances for static algorithms

Algorithm Precision Recall F1 score Accuracy (%)


K Nearest Neighbors 0.858 0.842 0.810 81.20
Decision Tree 0.931 0.947 0.937 93.80
Random Forest 0.900 0.928 0.915 91.52
Logistic Regression 0.648 0.686 0.717 74.71
SGD Classifier 0.933 0.949 0.936 93.70
Naive Bayes 0.763 0.904 0.739 76.77
SVM Classifier 0.931 0.933 0.937 93.73
CVD - Common Vulnerability Detector 159

So Static Classification algorithms give us a good performance. But there


are some as usual barriers. The feature extraction we used here, is described in
Subsect. 4.1, and in feature extraction static Algorithm 1.
• Memory Limitations are identified as a significant barrier. Every single word
is assumed as feature. This means if the training samples increase, we need
extra memory to make binary word features.
• Time Complexity is defined as another limitation for Static Classification
algorithms. As the source code patterns are varied, we have to generate many
binary word features. To train static models with a large number of word
features needs extra processing power.

5.2 Performance of Deep Learning Models


In the training phase, we evaluate the performances of Deep learning algorithms
and generate Fig. 7, per epoch evaluation. Training accuracy and loss are mea-
sured here. Learning Rate is evaluated as 0.001 according to Fig. 6a, which pro-
vides the best performance. All algorithms are iterated 100 epochs for training
propose. For identifying the perfect number of units for our recurrent neural
network, we perform different units and find the most suitable unit size shown
in Fig. 6b. These parameters are applied to make our model optimized and fast
learner. Using more than 32 units in RNN cannot improve the performance at
all, but increases the number of training parameters and extra processing power,
why we choose our model with 32 RNN units.

Fig. 6. Performance optimization based of defined parameter

Table 3 indicates the comparison of accuracy, loss, AUC, and the cosine sim-
ilarity for the training samples. From Table 3 we can see, GRU has the worst
accuracy of 78.23%, and CVD has the best performance of 96.64% of vulnera-
bility detection. Among all the algorithms, CNN has the nearest good accuracy.
160 S. Siddique et al.

The same result we see in the loss too. CVD has the lowest loss of 0.067, and
GRU has the highest error of 0.452. CNN has a loss of 0.086, nearest to our
designed CVD model. And the same goes for the AUC and cosine similarities.
CVD is the best among all of them.

Table 3. Training experimental performance for deep learning algorithms

Algorithm Loss AUC Cosine similarity Accuracy (%)


GRU 0.452 0.976 0.833 78.23
LSTM 0.137 0.998 0.957 94.30
Simple RNN 0.155 0.997 0.949 93.43
CNN 0.086 0.998 0.967 95.44
CVD 0.067 0.999 0.973 96.64

Fig. 7. Performance of training phase

Figure 7 defines our designed Convolutional Recurrent Neural Network shows


the smoothest performance as the source code analyzer CVD Model. Our deep
learning model’s coolest satisfaction acquires 95% accuracy from 10 to 12 epochs,
where the nearest algorithm CNN gets 95% accuracy after conducting more
than 70 epochs. None of the other algorithms can achieve 95% accuracy for
100 epochs. As the number of training parameters for all algorithms are shaped
equal, we get the same epoch’s equivalent duration. So, the model we designed
is a comparatively swift source code learner than others.
Table 4 defines the experimental testing performance for deep learning algo-
rithms. For the best performance, our target is to minimize the loss for better
accuracy. In terms of Hamming Loss, the CVD model is the lowest one, which
means better performance. CVD’s hamming loss is the lowest about 0.043, which
CVD - Common Vulnerability Detector 161

means better performance in error minimization. GRU cannot minimize losses


so much, which decreases the performance of GRU. According to Precision,
Recall, and F1 Score, our Common Vulnerability Detector model CVD performs
stronger than all other algorithms.

Table 4. Testing experimental performance for deep learning algorithms

Algorithm Hamming Precision Recall F1 score Accuracy Jaccard Cohen


loss score Kappa score
GRU 0.218 0.735 0.749 0.732 0.781 0.647 0.672
LSTM 0.067 0.899 0.960 0.926 0.932 0.866 0.902
Simple RNN 0.083 0.761 0.794 0.775 0.916 0.709 0.873
CNN 0.054 0.796 0.818 0.807 0.945 0.763 0.918
CVD 0.043 0.949 0.973 0.964 0.961 0.929 0.932

The F1 score of CVD is about 96.40%, which is assembled our model’s ulti-
mate performance. The nearest F1 Score is obtained 92.60% by the LSTM model.
According to the Cohen Kappa’s reliability score, CNN achieves 91.80%, elapsed
about 93.20% by CVD. The recall of CVD is about 97.30% and high recall means
that an algorithm returns most of the relevant results.

Fig. 8. Normalized confusion matrix of CVD

Figure 8 is the normalized form of predicted and actual labels confusion


matrix. There are the probabilities of output. The rate of true-positive (TF)
provides a magnificent result of an average 95.72% performance. True-negative
162 S. Siddique et al.

rates interpret the risks of using our model in reality, the maximum rate of true-
negative (TN) is about 5.6% for both CWE-22 and CWE-89. If any sample does
not contain bugs but marked as vulnerable by our system, is specified the rate
of false-positive. The maximum rate of false-positive (FP) is 1.20% for CWE-22.

6 Conclusion and Future Work

This paper presents an effective vulnerability detection model CVD (Common


Vulnerability Detector) to detect OOP source codes such as C# using Deep
Learning Algorithms like CNN and RNN. We designed an OOP parser and a
code analyzer for OOP Languages to do this. For this model, we also provided
a wholly optimized CRNN approach. The model can detect six common vulner-
abilities with an accuracy of 96% in short training time, which is higher than
any other Deep Learning-based model. We compared CVD with different exist-
ing algorithms like CNN, LSTM, GRU, and RNN, but CVD outperformed all
of them. This makes this model superior and more effective than these existing
models.
For future work, our main target will be to cover more programming lan-
guages to the model. CVD is a very straightforward model. It can be used for
any language we want with a slight adjustment. But in this paper, we imple-
mented it just for C#. In the future, we look forward to adjusting CVD so that
it will be able to process all the common and popular languages at a time. Right
now, CVD can detect 6 common vulnerabilities from source codes successfully.
But in the future, we look forward to improving it so that it will be able to
detect any common vulnerabilities with even higher accuracy. Also right now,
CVD is limited in analyzing just source codes to detect vulnerabilities. But in
further research, we look forward to making CVD detect vulnerabilities directly
from the executable files. Finally, CVD proposed several new implementations
and a completely new model in vulnerability detection that has an outstanding
performance for source code analysis. By implementing all these future works,
we hope CVD will be able to work with the complex real-world environment
soon.

References
1. Common Vulnerabilities Exposures (CVE) (2017). https://cve.mitre.org. Accessed
18 Oct 2020
2. Common Weakness Enumeration (CWE) (2017). https://cve.mitre.org. Accessed
18 Oct 2020
3. Efficiently Achieve Compliance With C# Testing Tools for.NET Development
(2020). https://www.parasoft.com/products/parasoft-dottest. Accessed 18 Oct
2020
4. Identify all vulnerabilities in your source code (2020). https://www.parasoft.com/
products/parasoft-dottest. Accessed 18 Oct 2020
CVD - Common Vulnerability Detector 163

5. Bengio, Y., LeCun, Y., Henderson, D.: Globally trained handwritten word rec-
ognizer using spatial representation, convolutional neural networks, and hidden
Markov models. In: Advances in Neural Information Processing Systems, pp. 937–
944 (1994)
6. Black, P.E.: A software assurance reference dataset: thousands of programs with
known bugs. J. Res. Nat. Instit. Stand. Technol. 123, 1 (2018)
7. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Pro-
ceedings of COMPSTAT 2010, pp. 177–186. Springer (2010). https://doi.org/10.
1007/978-3-7908-2604-3 16
8. Bouchard, M., Jousselme, A.-L., Doré, P.-E.: A proof for the positive definiteness
of the Jaccard index matrix. Int. J. Approximate Reason. 54(5), 615–626 (2013)
9. Butucea, C., Ndaoud, M., Stepanova, N.A., Tsybakov, A.B., et al.: Variable selec-
tion with hamming loss. Ann. Stat. 46(5), 1837–1875 (2018)
10. Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability
detection: are we there yet. IEEE Trans. Softw. Eng. 1 (2021)
11. Chernis, B., Verma, R.: Machine learning methods for software vulnerability detec-
tion. In: Proceedings of the Fourth ACM International Workshop on Security and
Privacy Analytics, pp. 31–39 (2018)
12. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur-
rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
13. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur.
20(1), 37–46 (1960)
14. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional net-
works for natural language processing. arXiv preprint arXiv:1606.01781, 2 (2016)
15. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
16. Fang, Y., Han, S., Huang, C., Runpu, W.: TAP: a static analysis model for PHP
vulnerabilities based on token and deep learning technology. PLoS ONE 14(11),
e0225196 (2019)
17. Fang, Y., Liu, Y., Huang, C., Liu, L.: FastEmbed: predicting vulnerability exploita-
tion possibility based on ensemble machine learning algorithm. PLoS ONE 15(2),
e0228439 (2020)
18. Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely
sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)
19. Fukunaga, K., Narendra, P.M.: A branch and bound algorithm for computing k-
nearest neighbors. IEEE Trans. Comput. C-24(7), 750–753 (1975)
20. Guaman, D., Sarmiento, P.A., Barba-Guamán, L., Cabrera, P., Enciso, L.: Sonar-
qube as a tool to identify software metrics and technical debt in the source code
through static analysis. In: 7th International Workshop on Computer Science and
Engineering, WCSE, pp. 171–175 (2017)
21. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference
on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
22. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (1997)
23. Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression,
vol. 398. John Wiley & Sons (2013)
24. Huang, G., Li, Y., Wang, Q., Ren, J., Cheng, Y., Zhao, X.: Automatic classification
method for software vulnerability based on deep neural network. IEEE Access 7,
28291–28298 (2019)
25. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
164 S. Siddique et al.

26. Le, T.H.M., Chen, H., Babar, M.A.: Deep learning for source code modeling and
generation: models, applications, and challenges. ACM Comput. Surveys (CSUR)
53(3), 1–38 (2020)
27. LeCun, Y.: Deep learning & convolutional networks. In: 2015 IEEE Hot Chips 27
Symposium (HCS), pp. 1–95. IEEE Computer Society (2015)
28. Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detec-
tion. arXiv preprint arXiv:1801.01681 (2018)
29. Lin, G., Wen, S., Han, Q.-L., Zhang, J., Xiang, Y.: Software vulnerability detection
using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
30. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028,
cs.CL/0205028 (2002)
31. Manadhata, P.K., Wing, J.M.: An attack surface metric. IEEE Trans. Softw. Eng.
37(3), 371–386 (2010)
32. Pendleton, M., Garcia-Lebron, R., Cho, J.-H., Shouhuai, X.: A survey on systems
security metrics. ACM Comput. Surv. (CSUR) 49(4), 1–35 (2016)
33. Sharma, V.: An analytical survey of recent worm attacks. Int. J. Comput. Sci.
Network Secur. (IJCSNS) 11(11), 99–103 (2011)
34. Siddique, S., Ahmed, T., Talukder, M.R.A., Uddin, M.M.: English to Bangla
machine translation using recurrent neural network. Int. J. Future Comput. Com-
mun. 9(2) (2020)
35. Wu, F., Wang, J., Liu, J., Wang, W.: Vulnerability detection with deep learning.
In: 2017 3rd IEEE International Conference on Computer and Communications
(ICCC), pp. 1298–1302. IEEE (2017)
36. Xinogalos, S.: Studying students’ conceptual grasp of OOP concepts in two interac-
tive programming environments. In: Lytras, M.D., et al. (eds.) WSKS 2008. CCIS,
vol. 19, pp. 578–585. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-
540-87783-7 73
37. Zagane, M., Abdi, M.K., Alenezi, M.: Deep learning for software vulnerabilities
detection using code metrics. IEEE Access 8, 74562–74570 (2020)
A Survey of Reinforcement Learning
Toolkits for Gaming: Applications,
Challenges and Trends

Charitha Sree Jayaramireddy, Sree Veera Venkata Sai Saran Naraharisetti,


Mohamad Nassar, and Mehdi Mekni(B)

University of New Haven, West Haven, CT 06516, USA


[email protected]
http://laser.newhaven.edu

Abstract. The gaming industry has become one of the most exciting
and creative industries. The annual revenue has crossed $200 billion in
recent years and has created a lot of jobs globally. Many games are using
Artificial Intelligence (AI) and techniques like Machine Learning (ML),
Reinforcement Learning (RL) gained popularity among researchers and
game development community to enable smart games involving AI-based
agents at a faster rate. Although, many toolkits are available for use,
a framework to evaluate, compare and advise on these toolkits is still
missing. In this paper, we present a comprehensive overview of ML/RL
toolkits for games with an emphasis on their applications, challenges,
and trends. We propose a qualitative evaluation methodology, discuss
the obtained analysis results, and conclude with future work and per-
spectives.

Keywords: Game design and development · Artificial Intelligence ·


Machine Learning · Reinforcement Learning · Deep Learning

1 Introduction
Computer gaming has positioned itself as an important source of audiovisual
education and entertainment due to its dynamism and accessibility, in addition
to increasing the imagination of players. It is a fast growing market showing
a global revenue increase of 8.7% from 2019 to 2021 to reach $218.7 billion
in 2024 [38]. Many games have multiple non-player characters (NPCs) which
play with the player, against them or take a neutral position within the game.
They take an essential part in video games to increase the player experience and
should therefore be supplied with a fitting behavior by creating a fitting Artificial
Intelligence (AI) for them [57]. They can take multiple roles like providing a
challenge for the player to fight against or representing a trusted ally with whom
they fought many battles [62]. It is therefore important, that the field of game
design and development finds new ways to build their intelligence and let them
play their role inside the game [64].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 165–184, 2023.
https://doi.org/10.1007/978-3-031-18461-1_11
166 C. S. Jayaramireddy et al.

There are different AI techniques in use in modern computer games. Espe-


cially, ever since the 21st century, various sorts of video games, online or offline,
have undergone rapid changes with the development of artificial and computa-
tional intelligence [21]. The roots of AI application in game design and devel-
opment can be traced back to 1950s when Claude Shannon (The Information
Theory) and Alan Turing (Theory of Computation) began to write AI logic for
chess programs [50]. In 1997, the famous computer “Deep Blue” that repre-
sented the pinnacle of AI techniques beat the chess Master Garry Kasparov in
a publicized match [31].
It is widely accepted that decision making and pattern recognition are basic
skills for humans; however, it can be challenging for computers. Sequential
decision-making is a core topic in Machine Learning (ML). Moreover, a sequence
of decisions taken to achieve a given goal in an environment evolves the concept
of Reinforcement Learning (RL). The ability to let the AI decide on its own
is a fascinating concept, and it is progressively being worked on in every field
including gaming [56].
Yannakakis and Togelius [63] identified various research areas standing out
within the application of AI in the gaming field. Their work aimed to offer
an higher-level overview of AI applications in gaming and was more about the
interactions among these applications as well as the influences they had on each
other. One critical limitation of this work is that it does not capture the recent
advances in ML and RL and hence does not provide a current source to study
AI applications in game design and development. More recently, Shao et al.
[51] provided a survey of the progress of Deep RL methods and compared their
main techniques and properties. A major shortcoming of this study is that it
exclusively focuses on Deep RL and leave the scientific community without a
current state-of-the-art of ML and RL applications specific to game design and
development.
Motivated by the quality of these initiatives, this paper aims to address the
indicated limitations and presents an insight into AI implementation in game
development with an emphasis on ML and RL toolkits. It also addresses the
lack of a comprehensive evaluation framework to support the game development
community. In this study, we examine the applications of ML and RL toolkits
in gaming, their challenges, as well as their trends.
The remainder of this paper is organized as follows; Sect. 2 provides an
overview of evolution the global gaming industry. Section 3 introduces the fun-
damental concepts of AI and its sub-fields. Section 4 details the state-of-the-art
of available ML and RL toolkits. Section 5 presents our qualitative evaluation
methodology articulated around a specific set of technical criteria. Section 6 out-
lines the key evaluation analysis findings. Finally, Sects. 7 and 8 discuss this study
and conclude with future work.

2 Overview of the Global Gaming Industry


The rise of computer games industry date back to 1970s with the introduction
of arcade machines and game consoles [42]. As computer components became
A Survey of RLTs for Gaming 167

more affordable, companies began to explore such market opportunities in game


design and development [13,45]. Video games are a generic term for all types
of digital games, played and used on some type of screen. This includes arcade
machines, handheld devices, game consoles (i.e., Xbox, PlayStation, Game Boy),
and computer games [22]. Stanford University in the USA hosted the first gam-
ing tournament in 1972 giving rise to competitive video games [28]. Following
attempts to increase the popularity of gaming were made during the 1980s and
1990s with the organization of national tournaments and world championships.
Companies such as Atari or Nintendo used these events as a marketing tool to
promote its video games, while fostering a gaming culture [15].
During the 1990s, with the development of the internet and further multi-
player capabilities, video games experienced significant growth, making it pos-
sible to not only to connect but to compete with external players [44]. Further
multiplayer tournaments began proliferating, as well as the tournament organi-
zations across the globe (i.e., Cyberathlete Professional League (CPL) and the
AMD Professional Gamers League (PGL) in the USA, the Deutsche Clanliga
(DeCL) in Germany, among many others in different countries and over the
years) [46]. Asia-Pacific is easily the world’s biggest region by games revenues,
with $88.2 billion in 2021 alone, making up 50.2% of all game revenues. With
its contribution of $45.6 billion, China is by far the primary driver here. North

Fig. 1. An overview of the global gaming market [38]


168 C. S. Jayaramireddy et al.

America remains 2021’s second-biggest region, boasting game revenues of $42.6


billion (mainly from the U.S.) (see Fig. 1a).
The recent pandemic has had a profound impact on game development and
publishing in terms of delays, which are affecting revenues across the board in
2021-mostly on the console side but also on PC. Compared to mobile, console and
PC games tend to have bigger teams, higher production values, and more cross-
country collaborations (see Fig. 1b). There will be close to 3.0 billion players
across the globe in 2021. This is up 5.3% year on year from 2020, showcasing
that 2020’s gaming boom has led to a lasting increase in players, with room
for further growth (see Fig. 1c). Looking ahead, the global number of players
will pass the 3-billion milestone next year in 2022. This number will continue to
grow at a 5.6% of the compound annual growth rate (2015–2024) to 3.3 billion
by 2024 (see Fig. 1d).
Along with the growth of the global gaming industry and advancements of
AI research, the need of figuring out tough problems in existing game design
and development using current benchmarks for designing, developing and train-
ing AI models (see Fig. 1) has also increased. However, as these challenges are
“solved”, the need for novel interactive environments, engaging gameplay, and
smart NPCs arises. Yet, creating such environments is often time-intensive and
requires specialized computational and AI domain knowledge. In the following
section, we introduce the fundamental AI and related-ML concepts aiming at
boosting the game design and development field.

3 Artificial Intelligence Concepts


Artificial Intelligence (AI) and Machine Learning (ML) are very closely related
and connected. Because of this relationship, when we study AI and ML con-
cepts, we are really looking into their interconnection. AI is the capability of
a computer system to mimic human cognitive functions such as learning and
problem-solving. Through AI, a computer system uses math and logic to simu-
late the reasoning that people use to learn from new information and make deci-
sions. In the following subsections, we provide an overview on Machine Learning,
Reinforcement Learning, and Deep Reinforcement Learning concepts.

3.1 Machine Learning


ML is the art of making computer programs learn from experience. A computer
program is said to learn from experience E with respect to some class of tasks T
and performance measure P , if its performance at tasks in T , as measured by P ,
improves with experience E [34]. For example, the task T can be playing checkers,
the experience E is playing thousands of checkers games, and the performance
P is the fraction of games won against human opponents. We can divide the
learning problems into three classes:
– Learning is called supervised if the experience E takes the form of a labeled
dataset (x, y), the task is learn a function that maps x to y,
A Survey of RLTs for Gaming 169

– Learning is called unsupervised if E takes the form of an unlabeled dataset.


The task is to learn underlying structure,
– Learning is called reinforced when the experience E takes the form of state-
action pairs and corresponding rewards and next state. The task is to maxi-
mize future rewards over a number of time steps.

Tasks are usually described in terms of how ML should process a data item
(i.e. an example). If the desired behavior is to assign the input data item to one
category among several, this is a classification task, e.g. object recognition. Other
examples of tasks are machine translation, transcription, anomaly detection, etc.
[11,33].

3.2 Reinforcement Learning (RL)

Reinforcement Learning (RL) is particularly interesting for playing games since


its task involves interaction with an environment, by committing actions and
receiving rewards for these actions [59]. In RL, the experience is a set of episodes.
Each episode is a sequence of tuples (State, Action, Reward, Next State), the
performance measure is the discounted total reward, and the task basically con-
sists on playing (Fig. 2). A more precise description of playing is adopting a
policy that maps states or representations of states of the game to actions. If
this mapping takes the form of a neural network, a deep one in particular, we
refer to Deep Reinforcement Learning (DRL).

Fig. 2. Classic agent-environment loop [16]


170 C. S. Jayaramireddy et al.

3.3 Deep Reinforcement Learning (DRL)


Given an agent that interacts with an environment through percepts (observa-
tions) and actions, the goal of reinforcement learning is to find an optimal policy
π ∗ that maximizes the expected total sum of rewards the agent receives during
a run, while starting from an initial state s0 ∈ S. Usually, the performance of a
given policy π is evaluated as:
 τ 

t
eval(π|s0 ) = Eρ(π) γ r(st+1 ) = Eρ(π) [R0 |s0 ]
t=0

where γ is a discount factor, and the expectation is over all the possible runs
(or traces) allowed by the policy π. R0 is the total reward for t = 0. Among
the most popular algorithms to reach an optimal policy in this context are value
iteration and Q-learning.
Value iteration assumes that the reward model and the transition model are
known a priori or passively learned [32]. The reinforcement learning problem is
then reduced to a Markovian Decision Process (MDP) that can be represented
using Bellman equations. Policy iteration assumes similar ideas but uses the fact
that convergence to the optimal policy happened long before the convergence of
the utility values.
Q-learning is called off-policy learning and actively learns a utility function
for (State, Action) pairs by alternating exploitation and exploration actions [51]:

Q(st , at ) = E [Rt |(st , at )]

and the policy is extracted as follows:

π ∗ (s) = argmax Q(s, a)


a

By combining these ideas from reinforcement learning with the recently re-
invented neural networks a new set of algorithms emerges and is dubbed DRL.
One of the seed contributions in this area is value learning. In [36] a Con-
volutional Neural Network (CNN) was trained to play Atari with a variant of
Q-learning. The CNN approximates the utility function of Q-learning based on
raw pixels for input and an estimation of future reward as output. The loss
function for value learning is

L = E [Qreal (s, a) − Qpredicted (s, a)]

where Qpredicted (s, a) is the output of the neural network and Qreal (s, a) is the
actual Q value:
Qreal (s, a) = r + γ max

Q(s , a )
a

Another approach is policy learning where the policy is learned directly


through training a neural network and without passing through value learning.
Policy learning is shown to be very successful at addressing challenges of (1) large
A Survey of RLTs for Gaming 171

Table 1. Evolution of DRL for playing board games

Domain Knowledge
DRL Go Chess Shogi Atari Human Domain Known
Play Knowledge Rules
   
ALpha Go [52]
 
Alpha Go Zero [54]
   
Alpha Zero [53]
   
Mu Zero [47]

or continuous action space such as in self-driving, and (2) stochastic transition


and reward models. Policy learning is based on a set of policy gradient methods
with the goal of learning a probability distribution over the actions given a
state P (a|s). The training is performed through continuous running of episodes
and simply increasing the probability of actions that resulted in high reward,
and decreasing the probability of actions that resulted in low reward. The loss
function is:
L = E [− log P (a|s)R]

3.4 Applications in Gaming


ML, RL and DRL are heavily used in gaming to develop not only competitive
agents but also collaborative agents and NPCs. Alpha Go beated the top human
player at Go in 2016. DeepMind introduced AlphaZero in 2017, a single system
that taught itself through self-play how to master the games of chess, Shogi
(Japanese chess), and Go [1]. MuZero, a general-purpose algorithm, was able
to master Go, Chess, Shogi and Atari without needing to be told the rules,
thanks to its ability to plan winning strategies in unknown environments [47]. A
summary of these algorithms as per [3] is depicted in Table 1.
Similarly, AlphaStar, a multi-agent RL system, was developed to play Star-
Craft II at Grandmaster level [7]. OpenAI developed Dota 2 AI agents, called
OpenAI Five, and made them learn by playing over 10,000 years of games against
themselves. The agents demonstrated the ability to defeat world champions in
Dota 2 [5]. Using the same RL model as OpenAI Five boosted with additional
techniques, OpenAI trained a pair of neural networks to solve the Rubik’s Cube
with a human-like robot hand. Facebook and Carnegie Mellon built the first
AI-based game that beats pros in 6-player poker [2].

4 Reinforcement Learning Toolkits


4.1 Unity ML-Agents
The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source
project that enables games and simulations to serve as environments for training
172 C. S. Jayaramireddy et al.

intelligent agents [6]. The training of agents is performed using ML techniques


including reinforcement learning, imitation learning, and neuroevolution [27].
There are 3 main kinds of objects in a learning environment in Unity ML-Agents:
– Agent: Each Agent can have a unique set of states and observations, take
unique actions within the environment, and receive unique rewards for events
within the environment. An agent’s actions are decided by the brain it is
linked to.
– Brain: Each Brain defines a specific state and action space, and is responsible
for deciding which actions each of its linked agents will take.
– Academy: Each environment contains a single academy which defines the
scope of the environment, in terms of engine configuration, frameskip, and
global episode length.
With Unity ML-Agents toolkit, a variety of training scenarios are possible,
depending on how agents, brains, and rewards are connected. Despite the lack of
detailed studies on Unity ML-Agents, a few games have been implemented using
Unity and it’s ML-Agents package where the training has been done using rein-
forcement learning including imitation learning and self-play. Figure 3 illustrates
the Unity ML-Agents Learning Environment [26]. An AI-based agent has been
implemented in Connect4 game using Unity ML-Agents [8]. The agent train-
ing was performed using the Proximal Policy Optimization (PPO) algorithm.
Moreover, a RL model using Hierarchical Critics (RLHC) algorithm has been
implemented in Unity ML-Agents which performance was compared with the
PPO model using two different competitive games - Soccer and Tennis [17].

Fig. 3. Unity ML-agents learning environment [26]


A Survey of RLTs for Gaming 173

4.2 OpenAI
OpenAI is a research lab whose mission is to ensure that artificial general intel-
ligence benefits all of humanity [4]. OpenAI provides various tools to support
applications of RL and ML in scientific research and game design and develop-
ment.

OpenAI Gym. Gym is a open-source toolkit for developing and comparing


reinforcement learning algorithms [14]. The OpenAI Gym toolkit encompasses a
collection of tasks, called environments, including Atari games, board games, as
well as 2D and 3D physical simulations for serious games [55]. It is used to train
agent by implementing and comparing various ML and RL algorithms using
shared interfaces. Therefore, OpenAI Gym is mainly used for standardization
and benchmarking purposes.

OpenAI Safety Gym. Safety Gym is a suite of environments and tools for
RL agents with safety constraints implemented while training. While training
the RL agents, safety is not much focused, but in certain aspects, safety is an
important concern and is to be considered. To address the safety challenges while
training the RL agents and to accelerate the safe exploration research, OpenAI
introduced Safety Gym. It consists of two components:
– An environment builder for creating a new environment by choosing from a
wide range of physics elements, goals and safety requirements.
– A suite of pre-configured benchmarks environments to choose from.
Safety Gym uses the OpenAI Gym for instantiating and interfacing with
the RL environments and MuJoCo physics simulator to construct and forward-
simulate each environment [43].

OpenAI Baselines. OpenAI Baselines is a set of high-quality implementa-


tions of RL algorithms. These algorithms make it easier for the research com-
munity to replicate, refine, and identify new ideas, and create baselines to build
research on top of. Such algorithms include Deep Q-Network (DQN) and its vari-
ants, Actor Critic using Kronecker-Factored Trust Region (ACKTR), Advantage
Actor Critic (A2C), and Asynchronous Advantage Actor Critic (A3C) [20].

OpenAI Universe. OpenAI universe is an extension of gym. It provides abil-


ity to train and evaluate agents on a wide range of simple to real-time complex
environments. It has unlimited access to many gaming environments. Using Uni-
verse, any program can be turned into a Gym environment without access to
program internals, source code, API’s as universe works by launching the pro-
gram automatically behind a virtual network computing remote desktop. With
support from EA, Microsoft Studios, Valve, Wolfram, and many others, openAI
has already secured permission for Universe AI agents to freely access games
174 C. S. Jayaramireddy et al.

and applications such as Portal, Fable Anniversary, World of Goo, RimWorld,


Slime Rancher, Shovel Knight, SpaceChem, Wing Commander III, Command
& Conquer: Red Alert 2, Syndicate, Magic Carpet, Mirror’s Edge, Sid Meier’s
Alpha Centauri, and Wolfram Mathematica.

OpenAI Gym Retro. OpenAI Gym Retro enables the conversion of classic
retro games into OpenAI Gym compatible environments and has integration for
around 1000 games. The emulators used in OpenAI Gym Retro support Libretro
API which allows to create games and support various emulators [39]. It is useful
primarily as a means to train RL on classic video games, though it can also be
used to control those video games using Python scripts.

4.3 Petting Zoo

Petting Zoo is a python library for conducting research in multi-agent environ-


ments. Petting Zoo is a multi-agent version of OpenAI Gym. What OpenAI
Gym has done with single agent reinforcement learning environments, Petting
Zoo was developed with the goal of doing the same with multi-agent environ-
ments. PettingZoo’s API, while inheriting many features of OpenAI Gym, is
unique amongst Multi Agent Reinforcement Learning (MARL) APIs. Petting-
Zoo model environments as Agent Environment Cycle (AEC) games, in order to
be able to cleanly support all types of multi-agent RL environments under one
API and to minimize the potential for certain classes of common bugs. Petting
Zoo includes 63 default environments [58].

4.4 Google Dopamine

Dopamine is a TensorFlow based research framework for fast prototyping of rein-


forcement learning algorithms. Dopamine supports multiple agents like DQN,
SAC and these are implemented using JAX which is a Python library for high-
performance ML research. Dopamine supports Atari environments and OpenAI’s
MuJoCo environments [18].

5 Evaluation Methodology
In this study, we propose a qualitative evaluation methodology that uses a set of
eleven specific technical criteria (See the following subsections). Each candidate
ML/RL toolkit introduced in Sect. 4 is evaluated based on the following qualita-
tive data collection techniques: (1) Game design and development experts inter-
views; (2) Technical experimentation and observations; and (3) Documentation
including scientific publication and technical reports. Moreover, the outcomes of
the common ML, RL, DRL algorithms implemented in the identified toolkits is
summarized in Table 2.
A Survey of RLTs for Gaming 175

Table 2. Common ML/RL/DRL algorithms implemented in RL toolkits

Algorithms Category Unity ML-Agents OpenAI Petting Zoo Dopamine


A2C [35] policy gradient, on-policy
ACER [60] policy gradient, off-policy
ACKTR [61] policy gradient, on-policy
DDPG [29] policy gradient, off-policy
DQN [37] value-based, off-policy
GAIL [25] policy gradient, off-policy
PPO [49] policy gradient, on-policy
SAC [23] policy gradient, off-policy
C51 [10] value-based, off-policy
Rainbow [24] value-based, off-policy
IQN [19] value-based, off-policy
D4PG [9] policy gradient, on-policy UKN
PGQ [41] policy gradient, off-policy
TRPO [48] policy gradient, on-policy

5.1 Portability

Portability in ML/RL toolkits is the usability of the same toolkit in different


environments. The pre-requirement for portability is the generalized abstraction
between the toolkit logic and its interfaces. When a ML/RL toolkit with the
same functionality is developed for several environments, portability is the key
issue for development cost reduction.

5.2 Interoperability

Interoperability refers to the capability of different ML/RL toolkits to communi-


cate with one another and with game engines freely and easily. Toolkits that are
interoperable exchange information in real-time, without the need for specialized
or behind-the-scenes coding.

5.3 Performance

The training speed of an agents in a ML/RL depends on the complexity and


analysis of the algorithm used to train that agent. Booth et al. provide a com-
parison study of different algorithms including PPO in ML-Agents and A2C,
ACKTR and PPO2 algorithms of OpenAI Baselines [12].

5.4 Multitask Learning

Multi-task learning is an ML/RL approach in which we try to learn multiple


tasks simultaneously, optimizing multiple loss functions at once. Rather than
176 C. S. Jayaramireddy et al.

training independent models for each task, we allow a single model to learn
to complete all of the tasks at once. In this process, the model uses all of the
available data across the different tasks to learn generalized representations of
the data that are useful in multiple contexts .

5.5 Multi-agent Environments


An environment might contain a single agent or multiple agents. In case of
multiple agents, each agent might have a different set of actions to perform and
the agents might need interaction between them as the training goes on [40].
This requires a different training methodology from training a single agent (see
Fig. 4).

Fig. 4. Multi-agent model [40]

5.6 Usability

Usability is a measure of how well a specific user in a specific context can use
a ML toolkit to design and develop games effectively, efficiently and satisfacto-
rily. Game designers usually measure a toolkit design’s usability throughout the
development process-from wireframes to the final deliverable-to ensure maximum
usability.

5.7 Documentation and Support

ML/RL toolkit documentation is written text or illustration that accompanies


toolkits or is embedded in the source code. The documentation either explains
how the toolkit operates or how to use it. Documentation is an important part of
game design and development when using ML/RL toolkits. Types of documenta-
tion include; (1) Requirements - Statements that identify attributes, capabilities,
characteristics, or qualities of a toolkit, (2) Architecture/Design - Overview of
A Survey of RLTs for Gaming 177

the toolkit design and includes relations to an environment and construction


principles to be used, (3) Technical - Documentation of code, algorithms, inter-
faces, and APIs, (4) End user - Manuals for the end-user, administrators and
support staff, and (5) Marketing - How to market the product and analysis of
the market demand.

5.8 Learning Strategies

The learning strategies are the different techniques ML/RL toolkits and frame-
works use to train the agents in game design and development. These strategies
are translated through machine learning algorithms including:
– Naı̈ve Bayes Classifier Algorithm (Supervised Learning - Classification) based
on Bayes’ theorem and classifies every value as independent of any other value.
It allows to predict a class/category, based on a given set of features, using
probability.
– K Means Clustering Algorithm (Unsupervised Learning - Clustering) is a type
of unsupervised learning, which is used to categorise unlabelled data, i.e. data
without defined categories or groups. The algorithm works by finding groups
within the data, with the number of groups represented by the variable K. It
then works iteratively to assign each data point to one of K groups based on
the features provided.
– Support Vector Machine Algorithm (Supervised Learning - Classification)
analyses data used for classification and regression analysis. It essentially
filter data into categories, which is achieved by providing a set of training
examples, each set marked as belonging to one or the other of the two cate-
gories. This algorithm then works to build a model that assigns new values
to one category or the other.
– Linear Regression (Supervised Learning/Regression) is the most basic type of
regression. Simple linear regression allows us to understand the relationships
between two continuous variables.
– Logistic Regression (Supervised learning - Classification) focuses on estimat-
ing the probability of an event occurring based on the previous data provided.
It is used to cover a binary dependent variable, that is where only two values,
0 and 1, represent outcomes.
– Artificial Neural Networks (Reinforcement Learning) comprises ‘units’
arranged in a series of layers, each of which connects to layers on either side.
ANNs are essentially a large number of interconnected processing elements,
working in unison to solve specific problems.
– Random Forests (Supervised Learning - Classification/Regression) is an
ensemble learning method, combining multiple algorithms to generate better
results for classification, regression and other tasks. Each individual classifier
is weak, but when combined with others, can produce excellent results. The
algorithm starts with a ‘decision tree’ (a tree-like graph or model of decisions)
and an input is entered at the top. It then travels down the tree, with data
being segmented into smaller and smaller sets, based on specific variables.
178 C. S. Jayaramireddy et al.

– Nearest Neighbours (Supervised Learning) The K-Nearest-Neighbour algo-


rithm estimates how likely a data point is to be a member of one group or
another. It essentially looks at the data points around a single data point to
determine what group it is actually in.

5.9 Reward Strategy


Reward functions describe how the agent “ought” to behave. It is an incentive
mechanism that tells the agent what is correct and what is wrong using reward
and punishment. The goal of agents in RL is to maximize the total rewards.
Sometimes we need to sacrifice immediate rewards in order to maximize the
total rewards. Reward strategy depends on the parameters a game developer
setup during the creation of a game environment.

5.10 Precision and Recall


Precision is one indicator of a machine learning model’s performance - the quality
of a positive prediction made by the model. Precision refers to the number of true
positives divided by the total number of positive predictions (i.e., the number
of true positives plus the number of false positives). It helps us to measure the
ability to classify positive samples in the model. Precision and recall are two
important model evaluation metrics. While precision refers to the percentage of
relevant results, recall refers to the percentage of total relevant results correctly
classified by ML/RL algorithm. Recall helps measure how many positive samples
were correctly classified by the model.

5.11 Visual Observations


Visual observation extends ML/RL toolkits to allow both novice and expert
game developers to quickly and easily build and deploy highly accurate and
explainable ML/RL models for agent in games using image-based data. Obser-
vation gathers data through visual or technological means. Visual observation is
’direct’ allowing game developers to witness the agents’ behaviours firsthand in
its environment.

6 Evaluation Analysis
Table 3 illustrates the outcomes of the proposed qualitative evaluation analysis
with respect to the technical criteria detailed in Sect. 5. It is important to note
that OpenAI is an open-source platform and Unity is a commercial platform.
Nevertheless, Unity offers its ML Agent as a open-source toolkit. With respect to
the proposed set of eleven technical criteria, it is obvious that Unity ML-Agents
toolkit provides full support of most of these criteria with some limitations with
regards to Multitask Learning and Learning strategies. On the other hand, Ope-
nAI including its various tools, Petting Zoo, and Google Dopamine suffer from
A Survey of RLTs for Gaming 179

Table 3. Overview of the evaluation of reinforcement learning toolkits

Unity ML-Agt OpenAI Petting Zoo Dopamine


Gym Safety Gym Univ Gym Retro
Portability
Interoperability
Performance
Multitask Lear.
Multi-Agent Env.
Doc. & Support
Learning Strategies
Reward Strategy
Visual Observations
(Fully supported) (Partially supported) (Not supported)

a critical lack of Visual Observations support. Moreover, OpenAI and its tools
fail to fully support Multi-Agent Environments.
OpenAI Gym and Unity ML-Agents underlying software architectures are
very similar and both provide comparable functionalities to game developers. In
the scientific community, OpenAI has a larger popularity compared to Unity ML-
Agents as it is developed with the intent of developing, analysing and comparing
reinforcement learning algorithms where as Unity’s main purpose is to develop
and produce enterprise-level games. OpenAI Gym and Unity ML-Agents have
been used widely for implementation of RL algorithms in the recent years. Ope-
nAI Gym doesn’t restrict itself for gaming, and has been used in various streams
like telecommunications, optical networks and other engineering fields. Because
of its wide range of options, OpenAI Gym has been used widely than Unity ML-
Agents to perform research and establish ML/RL models related benchmarking
results.
The training of the game agents can be performed both in Gym and Unity.
However, Gym only supports reinforcement learning for training the agents,
whereas with ML-Agents, it is possible to train the games using reinforcement
learning, imitation learning, and curriculum learning. The comparison between
Unity ML-Agents PPO and OpenAI Baselines’ PPO2 has proved this latter has
scored 50% higher while training 14% slower. The Actor Critic using Kronecker-
Factored Trust Region (ACKTR) algorithm and the Advantage Actor Critic
(A2C) algorithm of the OpenAI Baselines trained 33% faster than Unity ML
Agent [12].
Unity has a rich visual platform which is most helpful in building the environ-
ments even with a little programming experience. It has components designed
for each asset and can be easily configured. On the other hand, OpenAI Gym
is compatible with Tensorflow and provides rich graphs. To train more robust
agents that interact at real-time to dynamic variations of the environment such as
changes to the objects’ attributes, Unity provides randomly sample parameters
of the environment during training (also called Environment Parameter Ran-
180 C. S. Jayaramireddy et al.

domization). This technique is based on Domain Randomization which enable


training agents by randomized rendering. Unity ML-Agents also allows the use
of multiple cameras for visual observation. This enables agents to learn and
integrate information from multiple visual streams.

7 Discussion

Unity ML-Agents offers a rich visual interface to create environments and place
assets. Consequently, it offers more usability for game developers. Moreover,
the abundant technical and functional documentation, case studies, tutorials,
and technical support increase the popularity of this platform among the game
design and development community. The OpenAI Gym platform allows users to
compare the performance of their ML/RL algorithms. In fact, the aim of the
OpenAI Gym scoreboards is not to design and develop games, but rather to
foster the scientific community collaboration by sharing projects and enabling
meaningful ML/RL algorithm benchmark [16].
On one hand, Unity ML-Agents toolkit allows multiple cameras to be used for
observations per agent. This enables agents to learn to integrate information from
multiple visual streams. This technique leverages CNN to learn from the input
images. The image information from the visual observations that are provided
by the CameraSensor are transformed into a 3D Tensor which can be fed into
the CNN of the agent policy. This allows agents to learn from spatial regularities
and terrain topology in the observation images. In addition, it is possible to use
visual and vector observations with the same agent with Unity ML-Agents. This
powerful feature provides access to vector observations such as raycasting, real
time visualization, and parallelization. Such a feature is designed with the intent
of rapid AI agents implementation in video games, not for scientific research. This
hinders its application to more realistic, complex and real-world use cases and
serious games.
OpenAI Gym lacks the ability to configure the simulation for multiple agents.
In contrast, Unity ML-Agents supports dynamic multi-agent interaction where
agents can be trained using RL models through a straightforward Python API. It
also provides MA-POCA (MultiAgent POsthumous Credit Assignment), which
is a novel multi-agent trainer.
Petting-zoo provides a multi-agent policy gradient algorithm where agents
learn a centralized critic based on the observations and actions of all agents.
However, it suffers from a performance limitation when dealing with large-scale
multi-agent environments. In fact, the input space of Q grows linearly with the
number of agents N [30].
Finally, Unity ML-Agents does not support multitask learning. However,
it offers multiple interacting agents with independent reward signals sharing
common Behavior Parameters. This technique offers game developers the ability
to mimic multitask learning by implementing a single agent model and encode
multiple behaviors using HyperNetworks.
A Survey of RLTs for Gaming 181

8 Conclusion and Future Work


In this paper, we provided an overview of the main ML and RL toolkits for game
design and development. OpenAI and its rich suite of tools provide a solid option
for AI-based agent implementation and training with respect to a large panel
of supported RL algorithms. Yet, Unity ML-Agents remains a recommended
toolkit for rapid AI-based game development using limited yet pre-trained RL
models. The proposed qualitative evaluation methodology used a set of spe-
cific technical criteria. Each candidate toolkit has been evaluated based on the
following qualitative data collection techniques including interviews, observa-
tions, and documentation. Qualitative methodologies provide contextual data to
explain complex issues by explaining the “why” and “how” behind the “what”.
However, the limitations of a such methodology include the lack of generalizabil-
ity, the time-consuming and costly nature of data collection in addition to the
difficulty and complexity of objective data analysis and interpretation.
To address the limitations of our qualitative evaluation approach, our future
work will focus on empirical and quantitative evaluations to verify, validate, and
confirm our qualitative findings. A mixed method design with both qualitative
and quantitative data will involve statistical assessments of existing RL toolkits
to measure complexity, CPU and memory usage, scalability, and other relevant
software quality attributes.

References
1. AlphaZero: shedding new light on chess, shogi, and go. https://deepmind.com/
blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go
2. Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker. https://
ai.facebook.com/blog/pluribus-first-ai-to-beat-pros-in-6-player-poker/
3. MIT 6.S191: Introduction to deep learning. https://introtodeeplearning.com/
4. OpenAI
5. OpenAI five defeats dota 2 world champions. https://openai.com/blog/openai-
five-defeats-dota-2-world-champions/
6. Unity machine learning agents
7. Arulkumaran, K., Cully, A., Togelius, J. : Alphastar: an evolutionary computa-
tion perspective. In: Proceedings of the Genetic and Evolutionary Computation
Conference Companion, pp. 314–315 (2019)
8. Baby, N., Goswami, B.: Implementing artificial intelligence agent within connect 4
using unity3D and machine learning concepts. Int. J. Recent Technol. Eng. 7(6S3),
193–200 (2019)
9. Barth-Maron G., et al.: Distributed distributional deterministic policy gradients.
arXiv preprint arXiv:1804.08617, 2018
10. Bellemare, M. G., Dabney, W., Munos, R.: A distributional perspective on rein-
forcement learning. In: International Conference on Machine Learning, pp. 449–
458. PMLR (2017)
182 C. S. Jayaramireddy et al.

11. Bertens, P., Guitart, A., Chen, P. P., Periáñez, Á.: A machine-learning item recom-
mendation system for video games. In: 2018 IEEE Conference on Computational
Intelligence and Games (CIG), pp. 1–4. IEEE (2018)
12. Booth J., Booth, J.: Marathon environments: multi-agent continuous control
benchmarks in a modern video game engine. arXiv preprint arXiv:1902.09097
(2019)
13. Bornemark, O.: Success factors for e-sport games. In: Umeå’s 16th Student Con-
ference in Computing Science, pp. 1–12 (2013)
14. Borovikov, I., Harder, J., Sadovsky, M., Beirami, A.: Towards interactive training
of non-player characters in video games. arXiv preprint arXiv:1906.00535 (2019)
15. Borowy, M., et al.: Pioneering eSport: the experience economy and the marketing
of early 1980s arcade gaming contests. Int. J. Commun. 7, 21 (2013)
16. Brockman, G., et al.:Openai gym. arXiv preprint arXiv:1606.01540 (2016)
17. Cao, Z., Lin, C. -T.: Reinforcement learning from hierarchical critics. IEEE Trans.
Neural Netw. Learn. Syst. (2021)
18. Castro, P. S., Moitra, S., Gelada, C., Kumar, S., Bellemare, M. G.: A Research
framework for deep reinforcement learning, dopamine (2018)
19. Dabney, W., Ostrovski, G., Silver, D., Munos, R.: Implicit quantile networks for dis-
tributional reinforcement learning. In: International conference on machine learn-
ing, pages 1096–1105. PMLR (2018)
20. Dhariwal, P., et al.: OpenAI Baselines, Szymon Sidor (2022)
21. Frank, A. B.: Gaming AI without AI. J. Defense Mod. Simul., p.
15485129221074352 (2022)
22. Moreno, S. E. G., Montalvo, J. A. C., Palma-Ruiz, J. M.: La industria cultural
y la industria de los videojuegos. JUEGOS Y SOCIEDAD: DESDE LA INTER-
ACCIÓN A LA INMERSIÓN PARA EL CAMBIO SOCIAL, pp. 19–26 (2019)
23. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maxi-
mum entropy deep reinforcement learning with a stochastic actor. In: International
conference on machine learning, pp. 1861–1870. PMLR (2018)
24. Hessel, M., et al.: Rainbow: combining improvements in deep reinforcement learn-
ing. In: Thirty-second AAAI conference on artificial intelligence (2018)
25. Ho, J., Ermon, S.: Generative adversarial imitation learning. Adv. Neural Info.
Proc. Syst. 29 (2016)
26. Juliani, A., et al.: Unity: a general platform for intelligent agents. arXiv preprint
arXiv:1809.02627 (2018)
27. Lanham, M.: Learn Unity ML-Agents-Fundamentals of Unity Machine Learning:
Incorporate New Powerful ML Algorithms Such as Deep Reinforcement Learning
for Games. Packt Publishing Ltd., Birmingham (2018)
28. Li, R.: Good Luck Have Fun: The Rise of eSports. Simon and Schuster, New York
(2017)
29. Lillicrap, T. P.: Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971 (2015)
30. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I.: Multi-
agent actor-critic for mixed cooperative-competitive environments. arXiv preprint
arXiv:1706.02275 (2017)
31. Lyle, D., et al.: Chess and strategy in the age of artificial intelligence. In: Lai, D.
(eds) US-China Strategic Relations and Competitive Sports, pages 87–126. Pal-
grave Macmillan, Cham (2022). https://doi.org/10.1007/978-3-030-92200-9 5
32. Mekni, M.: An artificial intelligence based virtual assistant using conversational
agents. J. Softw. Eng. Appl. 14(9), 455–473 (2021)
A Survey of RLTs for Gaming 183

33. Mekni, M., Jayan, A.: Automated modular invertebrate research environment using
software embedded systems. In: Proceedings of the 2nd International Conference
on Software Engineering and Information Management, pp. 85–90 (2019)
34. Mitchell, T. M., et al.: Machine learning (1997)
35. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Inter-
national Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
36. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602 (2013)
37. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature
518(7540), 529–533 (2015)
38. Newzoo: Global games market report (2021)
39. Nichol, A., Pfau, V., Hesse, C., Klimov, O., Schulman J.: Gotta learn fast: a new
benchmark for generalization in RL. arXiv preprint arXiv:1804.03720 (2018)
40. Nowé, A., Vrancx, P., De Hauwere, Y. M.: Game theory and multi-agent reinforce-
ment learning. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning, pp.
441–470. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-27645-3 14
41. O’Donoghue, B., Munos, R., Kavukcuoglu, K., Mnih, V.: Combining policy gradi-
ent and Q-learning. arXiv preprint arXiv:1611.01626 (2016)
42. Palma-Ruiz, J. M., Torres-Toukoumidis, A., González-Moreno, S. E., Valles-Baca,
H. G.: An overview of the gaming industry across nations: using analytics with
power bi to forecast and identify key influencers, p. e08959. Heliyon (2022)
43. Ray, A., Achiam, J., Amodei, D.: Benchmarking safe exploration in deep reinforce-
ment learning, p. 7. arXiv preprint arXiv:1910.01708 (2019)
44. Saiz-Alvarez, J.M., Palma-Ruiz, J.M., Valles-Baca, H.G., Fierro-Ramı́rez, L.A.:
Knowledge management in the esports industry: sustainability, continuity, and
achievement of competitive results. Sustainability 13(19), 10890 (2021)
45. Samara, F., Ondieki, S., Hossain, A. M., Mekni, M.: Online social network inter-
actions (OSNI): a novel online reputation management solution. In: 2021 Interna-
tional Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6.
IEEE (2021)
46. Scholz, T. M., Scholz, T. M., Barlow: eSports is Business. Springer (2019)
47. Schrittwieser, J., et al.: Mastering atari, go, chess and shogi by planning with a
learned model. Nature 588(7839), 604–609 (2020)
48. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.:. Trust region policy
optimization. In: International Conference on Machine Learning, pp. 1889–1897.
PMLR (2015)
49. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
50. Shabbir, J., Anwer, T.: Artificial intelligence and its role in near future (2018)
51. Shao, K., Tang, Z., Zhu, Y., Li, N., Zhao, D.: A survey of deep reinforcement
learning in video games. arXiv preprint arXiv:1912.10944 (2019)
52. Silver, D., et al.: Mastering the game of go with deep neural networks and tree
search. Nature 529(7587), 484–489 (2016)
53. Silver, D., et al.: A general reinforcement learning algorithm that masters chess,
shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
54. Silver, D., et al.: Mastering the game of go without human knowledge. Nature
550(7676), 354–359 (2017)
55. Silver, T., Chitnis, R.:. PDDLGym: Gym environments from PDDL problems.
arXiv preprint arXiv:2002.06432 (2020)
56. Sweetser, P., Wiles, J.: Current AI in games: a review. Australian J. Intell. Info.
Proc. Syst. 8(1), 24–42 (2002)
184 C. S. Jayaramireddy et al.

57. Tazouti, Y., Boulaknadel, S., Fakhri, Y.: Design and implementation of ImA-
LeG serious game: behavior of non-playable characters (NPC). In: Saeed, F., Al-
Hadhrami, T., Mohammed, E., Al-Sarem, M. (eds.) Advances on Smart and Soft
Computing. AISC, vol. 1399, pp. 69–77. Springer, Singapore (2022). https://doi.
org/10.1007/978-981-16-5559-3 7
58. Terry, J., et al. Pettingzoo: Gym for multi-agent reinforcement learning. Adv. Neu-
ral Inf. Proc. Syst. 34 (2021)
59. Tucker, A., Gleave, A., Russell, S.: Inverse reinforcement learning for video games.
arXiv preprint arXiv:1810.10593 (2018)
60. Wang, Z., et al.: Sample efficient actor-critic with experience replay. arXiv preprint
arXiv:1611.01224 (2016)
61. Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., Ba, J.: Scalable trust-region method
for deep reinforcement learning using Kronecker-factored approximation. Adv.
Neural Inf. Proc. Syst. 30 (2017)
62. Yannakakis, G. N.: Game AI revisited. In: Proceedings of the 9th Conference on
Computing Frontiers, pp. 285–292 (2012)
63. Yannakakis, G.N., Togelius, J.: A panorama of artificial and computational intel-
ligence in games. IEEE Trans. Comput. Intell. AI in Games 7(4), 317–335 (2014)
64. Yohanes, D.N., Rochmawati, N.: Implementasi algoritma collision detection dan
a*(a star) pada non player character game world of new normal. J. Inf. Comput.
Sci. (JINACS) 3(03), 322–333 (2022)
Pre-trained CNN Based SVM Classifier for Weld
Joint Type Recognition

Satish Sonwane(B) , Shital Chiddarwar, M. R. Rahul, and Mohsin Dalvi

Visvesvaraya National Institute of Technology, S Ambazari Rd, Ambazari, Nagpur,


Maharashtra 440010, India
[email protected], [email protected]

Abstract. Manual recognition and classification of weld joints in real-time using


welding images is idiomatic, takes skill, and might be prejudiced. Also, because
most welding robot applications are taught and played, they must be reconfigured
each time they engage in a new duty. This takes time, and welding settings must
be improvised for each new weld job. Hence, this study addresses these concerns
by proposing an alternate way of automatically recognizing weld joint types.
This paper suggests an effective way to classify the weld joint type using the
feature extraction technique. This research aims to create a fusion model that
uses sophisticated Convolutional Neural Networks (CNN) and Support Vector
Machines (SVM) to recognize Welding joints from a dataset. The suggested hybrid
model incorporates the essential characteristics of both; the CNN and the SVM
classifier. In this fusion model, CNN is an automated feature extractor, while
SVM serves as a classifier. The model is trained and tested using the Kaggle Weld
joint dataset (for Butt and Tee Joint) and an in-house dataset (for Vee and lap
weld joint). The collection comprises a variety of weld joint photos captured from
various perspectives. CNN’s receptive field aids in the automated extraction of the
most distinguishing aspects of these images. The experimental findings show that
the suggested framework is successful, with a recognition accuracy of 99.7% over
the mentioned dataset. Accuracy is determined using the k-fold cross-validation
method, where k = 10.

Keywords: Welding joint type recognition · SVM classifier · Feature


extraction · Convolutional neural networks

1 Introduction
Since its inception, welding using robots has always been an essential part of Advanced
Manufacturing techniques [1]. Robot-assisted welding holds the key to making the oper-
ation more precise, cutting turnaround time, increasing worker safety, and lowering
resource waste [2]. Welders are currently required to monitor and supervise the opera-
tion while equipment executes welding. They must adapt themselves to the procedure
in specific ways based on the geometry of the parts to be connected. Four labels classify
practical methods of parts connection: Butt, Vee, Lap, or Tee joint [3]. Determining the
form of the weld joint is necessary before extracting weld joint features and directing

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 185–197, 2023.
https://doi.org/10.1007/978-3-031-18461-1_12
186 S. Sonwane et al.

the robot to follow it without human involvement [4]. Furthermore, the algorithms used
to determine location data for different weld joints vary; also, the relevant parameters
guiding the process differ depending upon the type of weld joint [5]. Physically updat-
ing the criteria and system elements according to the weld joint just before the welding
session is very inefficient work. As a result, the system’s productivity and ease of use
will be significantly enhanced if it can recognize the weld joint autonomously before
beginning the welding.
Welders still oversee and monitor robotic welding, which is a more advanced variant
of mechanized welding in which machines perform the welding. In some cases, robots
find it difficult, if not impossible, to adapt to complex circumstances as precisely and fast
as humans. Automated robotic welding strives to solve this issue and is now an important
field of research in industrial automation [6]. In the area of Weld joint type identification,
a few methods have been suggested, such as SVM [7, 8], deep learning classifiers [9],
Ensemble learning [10], Local Thresholding [11] and Hausdorff distance [12]. However,
even though such techniques have exhibited satisfactory detection performance, weld
joint identification remains an ongoing research subject that requires the development
of novel strategies and processes for enhancing recognition performance, run-time, and
computing cost.
Few researchers have presented conforming single weld joint recognition algorithms
in recent years. Shah et al. [11] studied the detection and identification of butt weld joints
using image segmentation and local thresholding. Fan et al. [13] researched the recog-
nition of narrow butt weld joints using a structured light vision sensor. Zou et al. [14]
studied feature recognition of lap joints using the TLD object tracking algorithm and
CCOT algorithm. In later works, Fan et al. [15] considered image-based recognition of
fillet joints. The authors employed weld likelihood calculation using a custom convo-
lutional kernel. Due to the single type of weld joint scope, these methods were meant
to identify just one type. However, there may be several weld joint types in peculiar
welding situations. As a result, recognizing different weld connections is critical in a
real welding scenario.
Recently, some researchers investigated multiple kinds of weld joint identification.
Xiuping et al. [3] recognized 4 types of weld joints using features extracted from
Laser structured light vision and Probabilistic Neural Network (PNN). The authors
pre-processed images using reduction, Wavelet Transform (WT) and binarization. The
positional element connection of the extremities and the junctions of weld joints was
used for weld joint categorization [9]. This method works; however, it has difficulty dis-
criminating amongst different weld joints with similar elemental features. Wang et al.
[10] used laser stripes features and BP-Adaboost and KNN-Adaboost to recognize six
types of weld joints. The method suffers from the fact that it requires costly hardware and
is computationally and mathematically intensive. Li et al. [12] used Hausdorff distance
as a match for measuring the match of laser stripes from the standard template. This
technique, however, suffers from computation cost issues and is not flexible.
Work has been done in academia to extract handcrafted characteristics. However,
actively extracting characteristics is resource hungry and needs manual expertise. Fur-
thermore, the handcrafted feature extraction technique involves a compromise between
economy and precision since computing unimportant features may raise computing
Pre-trained CNN Based SVM Classifier for Weld Joint 187

expenses, leading to poor identification performance of the system. Non-handcrafted


feature extraction is the extraction of features directly from raw pictures. This method
removes the requirement for foreknowledge.
The idea underlying transfer learning for classification tasks is that if a network is built
on an extensive and diverse dataset, it can successfully function as a baseline framework
of the visual environment. Therefore, one can use these learnt feature maps instead of
starting from zero by training a large model on a huge dataset. Considering this, we have
developed a pre-trained CNN-SVM classifier system for Weld joint identification. The
research aims to use CNN to extract features from input weld joint photos of a dataset.
These learnt characteristics are subsequently submitted to the SVM classifier for the
suggested weld joint type recognition experiment. The main benefit of adopting the CNN
model is that it takes advantage of the spatial information in the input and is resistant
to simple transformations such as rotation and translation [16]. On the other hand, the
multi-layer perceptron models never consider extensive topological information from
the input and are therefore unsuitable for complicated situations.
The current effort is driven by the SVM classifier’s performance when presented
with image features for classification [17]. Fan et al. [8] established a framework for
designing an SVM classifier by generating the subset of features from the weld joint ends;
it surpasses other recognition accuracy and computational cost methods but is tedious to
apply. Zeng et al. [7] used feature maps to enhance the identifiable details by extracting
two types of features. SVM is used to create a weld joint type identification system
that reduces processing costs while increasing recognition accuracy. Five fold cross-
validation is performed to discover the ideal SVM model parameters. The advantage
of the proposed method over the previous enumerated by [8] and [7] is that the model,
instead of trying to encode features directly, learns them from examples of inputs and
desired outputs. It considerably saves the mathematical chicanery and expensive imaging
and scanning equipment to get images from which features could be extracted. Table 1
summarizes the approaches used by various authors and the limitations of their approach.
This paper’s key contribution is as follows:

1. A weld joint type recognition approach is proposed to enhance the robotic welding
automation level based on extracted features from the ResNet18 Pool5 layer.
2. An SVM-based weld joint type classification model is built to detect different types
of weld joints.

2 Proposed Methodology
2.1 Pre-Trained CNN
For the categorization of weld joint type datasets, a Pre-Trained CNN based SVM Classi-
fier is developed. The presented hybrid model incorporates the essential characteristics
of both; the CNN and the classifier. Figure 1 shows the approach considered in this
paper. A CNN comprises numerous fully linked layers and uses supervised learning.
CNN operates similarly to humans and can learn invariant local characteristics quite
effectively. Therefore, it can extract the most distinguishing data from raw weld joint
photos. Convolutional neural networks are deep learning network architectures that learn
188 S. Sonwane et al.

Table 1. Related research and limitation

Reference Approach Limitations


[3] WT and PNN Limited capability with T Joint and
Corner Joint
[8] Laser sensor assisted single feature Tedious application, requirement of
extraction and SVM costly hardware
[7] Laser sensor assisted two feature Tedious application, requirement of
extraction and SVM costly hardware, Mathematically
intensive
[9] Vision sensor with silhouette Mapping Difficulty discriminating amongst
different weld joints with equal
elemental features
[10] Structured light vision and ensemble Computationally and
learning mathematically intensive
[11] Image segmentation and local Single weld joint detection
thresholding
[12] Hausdorff distance for measuring the Inflexible
match of laser stripes from the standard
template
[13] Laser light and Fuzzy PID controller Single weld joint detection
[14] TLD object tracking algorithm and CCOT Single weld joint detection
algorithm
[15] Weld likelihood calculation, preselection Single weld joint detection
and reexamination

directly from data, removing manual feature extraction requirements. A kernel/filter is


employed in the proposed system to extract the most recognizable characteristics from
the raw input photos. Each layer’s output is fed to the input of the following layer [18].
Convolutional layers are the foundational components of CNNs. A convolution is
just the deployment of a filter, which leads to an output called activation. Continuous
exposure of the same filter to an input produces a feature map (see Fig. 2), which
displays the locations and intensities of an identified feature in input, such as an image.
Convolutional neural networks are distinguished by rapidly extracting a large set of
features in parallel, specific to a training dataset, within the constraints of a certain
predictive modelling objective, such as image classification [19]. Hence, very particular
characteristics can be spotted wherever input photos are produced.
The convolutional layer comprises a filter, for example, 7 × 7 × 3. The layer is used to
“slide” over the horizontal and vertical directions of an input picture, calculating the dot
product of the input image region and the weight learning parameters. This will result in a
2D activation map composed of the filter’s responses at certain places. Consequently, the
pooling layer shrinks the size of the input pictures based on the results of a convolution
Pre-trained CNN Based SVM Classifier for Weld Joint 189

Fig. 1. Procedure of the CNN-SVM modelling method

filter. Consequently, the model’s number of parameters is reduced, known as down-


sampling. Finally, an activation function introduces nonlinearities into the algorithm
[17].

Fig. 2. Sliding of convolution filter to obtain image features

2.2 ReLU
The ReLU function is the most often utilized activation function these days. ReLU
is preferred over tanh and sigmoid because it accelerates stochastic gradient descent
190 S. Sonwane et al.

convergence compared to the other two functions [20]. Furthermore, unlike tanh and
sigmoid, which need substantial calculation, ReLU is done by simply setting matrix
values to zero. Figure 3 shows the activation of the ReLU function as an example.

Fig. 3. ReLU is a linear function that outputs negative values as zero.

2.3 ResNet18

Pre-trained CNN architecture used in this study is ResNet18 [21]. ResNet18 comprises
72 layers consisting of 18 deep layers. This network’s architecture was designed to oper-
ate a high number of convolutional layers efficiently. The main principle behind ResNet
is using jump links, also known as bypass connections [22]. These bypasses generally
work by jumping across several layers and establishing bypasses between them. Creat-
ing these bypass connections solved the common problem of fading gradients that deep
networks face. By repurposing the activations from the prior layer, these bypass connec-
tions eliminate the vanishing gradient problem. The link-skipping technique shrinks the
network, allowing it to learn quicker. Because of its complicated, layered design and the
fact that the layers receive input from several levels and output from numerous layers,
the network is classified as a Directed acyclic graph (DAG) network.

2.4 SVM

Vapnik [23] developed the binary classifier SVM. Its goal is to discover the best fitting
hyperplane. f(w, x) = w · x + b to distinguish two categories in a target dataset, with
features x ∈ Rm . The parameters are learned using SVM by solving (Eq. 1).
1 p 
min z T z + P max(0, 1 − zi (wT xi + b)) (1)
p i=1

where,
zT z-Manhattan mean (L1 mean),
P- Penalty parameter,
z -actual label,
wT x + b-predictor function.
Pre-trained CNN Based SVM Classifier for Weld Joint 191

L2-SVM (Eq. 2) outputs stabler outcomes.


1 p 
 2
min ||z||22 + P max(0, 1 − z1 wT xi + b ) (2)
p i=1

where,
z2 -Euclidean mean (L2 mean), with the hinge loss squared.
Each data item in the SVM is represented as a point in the n-dimensional plane (where
n represents the total of characteristics). SVM seeks to describe multi-dimensional
datasets in a space where data items are separated by hyperplane into distinct classes. On
unseen data, the SVM classifier can minimize the generalization error. SVM is effective
for binary classification but bad for noisy data. However, multiple SVM can be used
in a program in the one-versus-one form to achieve multi-classification. We have used
the Error-Correcting output codecs (ECOC) multi-class model classifier in this study
[24]. This paper employs SVM as a classifier and substitutes CNN’s pooling layer. CNN
serves as a feature extractor. The SVM classifier uses the attributes of the input weld
joint types acquired in the ‘pool5’ layer as input. The SVM classifier is trained using
these newly produced picture features. Finally, the learned SVM classifier is utilized to
recognize the weld joints.

3 Implementation and Findings

All experiments in this study were conducted on a PC with Intel i5 12th gen CPU @ 2.50
GHz and 8 GB of DDR4 RAM. The following stages are included in the experimental
setup:

3.1 Data Set Used

The dataset used in this study combines the Kaggle dataset1 [25] and a custom dataset
generated by the authors. Table 2 shows dataset distribution as per various weld joints.

Table 2. Data distribution of various weld joint types

Butt Joint V Joint T Joint Lap Joint


550 550 550 550

The images from the Kaggle dataset and those from the dataset generated by the
authors are in.png format of the size 640 by 480 pixels. Figure 4 shows sample weld
joint images from the dataset.

1 https://www.kaggle.com/datasets/derikmunoz/weld-joint-segments.
192 S. Sonwane et al.

Fig. 4. Sample weld joint images from the dataset

3.2 Feature Extraction

This approach begins with a pre-trained system and only changes the last layer parameters
from which we get predictions when performing feature extraction. Because we use the
pre-trained CNN as a static feature extractor and simply change the output layer, it’s
called feature extraction. There’s no need to train the entire network. Figure 5 depicts
the first 25 characteristics acquired by the final pooling layer (‘pool5’) for illustration
purposes.
The suggested approach derives attributes from input photos of weld joints. First, a
weld joint image is sent into the proposed hybrid model. The network accepts images
with a resolution of 224 by 224 by 3. The function computes the output probability for
every supplied specimen joint image. The input for the successive layers is formed by a
concatenation of the previous hidden layer’s results with trainable weights and a skew
factor [21].
Similarly, a characteristic map is generated at the pool5 layer. Finally, the character-
istics map is reduced to a single data block. For the recognition task, the SVM classifier
is supplied this vector with distinct attributes.

3.3 Classification Methodology

After completing the feature extraction steps, the SVM classifier is used to classify weld
joint pictures. SVM classifier training was carried out using feature vectors recorded
in matrix form. The joints have been tested using the results of training. In contrast,
the automatically produced features in the hybrid CNN-SVM model are supplied to the
SVM classifier for training and evaluating the weld joint dataset. An error-correcting
output codes (ECOC) model simplifies a three-class classification challenge to a col-
lection of binary classification problems [26]. The coding scheme is a matrix whose
elements decide which classes each binary learner learns, i.e., how the multi-class issue
is converted to a sequence of binary problems.
Pre-trained CNN Based SVM Classifier for Weld Joint 193

Fig. 5. First 25 features learned by pool5 layer

Using hyperparameter optimization, SVM parameters such as the box constraint


level and kernel function are determined because they directly influence SVM classifier
performance. The proposed model employs the Cubic function as a kernel with a box
constraint value of 1. Also, this study uses OVO (One Vs One) coding design. This
design covers all possible permutations of class pair assignments. Figure 6, 7 show the
confusion matrix for Validation and testing, respectively. In both the figures, the columns

Fig. 6. Validation confusion matrix


194 S. Sonwane et al.

on the figure’s right show a True positive rate (TPR) and a False-negative rate (FNR).
Figure 8 shows the sample output of the program.

Fig. 7. Testing confusion matrix

Fig. 8. Classification output of the program


Pre-trained CNN Based SVM Classifier for Weld Joint 195

4 Conclusion and Future Scope


This research suggests a pre-trained CNN-based SVM classifier model for weld joint
type detection, including automated feature creation with CNN and output prediction
with SVM. In identifying weld joint pictures, the model incorporates the advantages of
pre-trained-CNN and SVM classifiers. Furthermore, the strategy highlights the use of
non-handcrafted features rather than designed features.
According to the experimental findings, our proposed approach achieved a classi-
fication accuracy of 99.7% for the custom dataset. Table 3 shows that our technique
is compared to some related research work, and the experimental findings demonstrate
that the proposed technique achieved competitive results in classification performance.
Furthermore, experiments proved that the method has practical application in automated
Robotic welding due to its ease of use and requirement of modest computational power
and that equipment costs may be lower than earlier techniques.

Table 3. Comparative image recognition results.

Feature extraction Algorithm Accuracy


Features from Pool5 layer of ResNet18 passed on to SVM classifier (This paper) 99.7%
Reference [7] 98.4%
Reference [8] 89.2%

The pre-trained CNN based SVM classifier study is in its early stages and may be
developed further. In the future, the suggested model can be enhanced to recognize dif-
ferent weld joints like the corner, single V, U, and Double U. Furthermore, to improve
classification performance, specific optimization strategies may be examined. Finally,
the method can be cross-validated using different materials under different lighting con-
ditions. For further extension of the study, the model’s performance under the influence
of splash noise will be tested and improved. In addition, the findings could be enhanced
if the image preparation method used on the datasets and the fundamental CN Network
was more complex than what is used in this research.

References
1. Reisgen, U., Mann, S., Middeldorf, K., Sharma, R., Buchholz, G., Willms, K.: Connected,
digitalized welding production—industrie 4.0 in gas metal arc welding. Welding in the World
63(4), 1121–1131 (2019). https://doi.org/10.1007/s40194-019-00723-2
2. Mahadevan, R., Jagan, A., Pavithran, L., Shrivastava, A., Selvaraj, S.K.: Intelligent welding
by using machine learning techniques. Mater. Today Proc. 46, 74027410 (2021). https://doi.
org/10.1016/j.matpr.2020.12.1149
3. Xiuping, W., Fan, X., Ying, F.: Recognition of the Type of Welding Joint Based on Line
Structured Light Vision, pp. 4403–4406 (2015)
4. Chen, X., Chen, S., Lin, T., Lei, Y.: Practical method to locate the initial weld position using
visual technology. Int. J. Adv. Manuf. Technol. 30(7–8), 663–668 (2006). https://doi.org/10.
1007/s00170-005-0104-z
196 S. Sonwane et al.

5. Hong, T.S., Ghobakhloo, M., Khaksar, W.: Robotic Welding Technology 6. Elsevier (2014).
https://doi.org/10.1016/B978-0-08-096532-1.00604-X
6. Zhang, Y.M., Feng, Z., Chen, S.: Trends in intelligentizing robotic welding processes. J.
Manuf. Process. 63, 1 (2021). https://doi.org/10.1016/j.jmapro.2020.11.012
7. Zeng, J., Cao, G.Z., Peng, Y.P., Huang, S.D.: A weld joint type identification method for
visual sensor based on image features and SVM. Sensors (Switzerland) 20(2), 471 (2020).
https://doi.org/10.3390/s20020471
8. Fan, J., Jing, F., Fang, Z., Tan, M.: Automatic recognition system of welding seam type based
on SVM method. Int. J. Advanced Manufacturing Technol. 92(1–4), 989–999 (2017). https://
doi.org/10.1007/s00170-017-0202-8
9. Tian, Y., et al.: Automatic identification of multi-type weld seam based on vision sensor with
silhouette-mapping. IEEE Sens. J. 21(4), 5402–5412 (2021). https://doi.org/10.1109/JSEN.
2020.3034382
10. Wang, Z., Jing, F., Fan, J.: Weld seam type recognition system based on structured light vision
and ensemble learning. In: Proceedings 2018 IEEE International Conference Mechatronics
Autom. ICMA 2018, no. 61573358, pp. 866–871 (2018). https://doi.org/10.1109/ICMA.2018.
8484570
11. Shah, H.N.M., Sulaiman, M., Shukor, A.Z., Kamis, Z., Rahman, A.A.: “Butt welding joints
recognition and location identification by using local thresholding,” robot. Comput. Integr.
Manuf. 51, 181–188 (2018). https://doi.org/10.1016/j.rcim.2017.12.007
12. Li, Y., Xu, D., Tan, M.: Welding joints recognition based on Hausdorff distance. Gaojishu
Tongxin/Chinese High Technol. Lett. 16(11), 1129–1133 (2006)
13. Fan, J., Jing, F., Yang, L., Long, T., Tan, M.: A precise seam tracking method for narrow
butt seams based on structured light vision sensor. Opt. Laser Technol. 109, 616–626 (2019).
https://doi.org/10.1016/j.optlastec.2018.08.047
14. Zou, Y., Chen, T.: Laser vision seam tracking system based on image processing and contin-
uous convolution operator tracker. Opt. Lasers Eng. 105(January), 141–149 (2018). https://
doi.org/10.1016/j.optlaseng.2018.01.008
15. Chen, S., Liu, J., Chen, B., Suo, X.: Universal fillet weld joint recognition and positioning
for robot welding using structured light. Robot. Comput. Integr. Manuf., 74, 102279 (2021).
https://doi.org/10.1016/j.rcim.2021.102279
16. Tang, Y.: Deep Learning using Linear Support Vector Machines (2013). https://doi.org/10.
48550/ARXIV.1306.0239
17. Agarap, A.F.: An Architecture Combining Convolutional Neural Network (CNN) and Support
Vector Machine (SVM) for Image Classification, pp. 5–8 (2017). http://arxiv.org/abs/1712.
03541
18. Jiang, S., Hartley, R., Fernando, B.: Kernel support vector machines and convolutional neural
networks. 2018 Int. Conf. Digit. Image Comput. Tech. Appl. DICTA, pp. 1–7 (2019). https://
doi.org/10.1109/DICTA.2018.8615840
19. Ahlawat, S., Choudhary, A.: Hybrid CNN-SVM classifier for handwritten digit recognition.
Procedia Comput. Sci. 167(2019), 2554–2560 (2020). https://doi.org/10.1016/j.procs.2020.
03.309
20. Kaggle: Rectified Linear Units (ReLU) in Deep Learning. https://www.kaggle.com/dansbe
cker/rectified-linear-units-relu-in-deep-learning
21. Hantos, N., Iván, S., Balázs, P., Palágyi, K.: Binary image reconstruction from a small number
of projections and the morphological skeleton. Ann. Math. Artif. Intell. 75(1–2), 195–216
(2014). https://doi.org/10.1007/s10472-014-9440-8
22. MathWorks: ResNet-18 convolutional neural network - MATLAB resnet18 - MathWorks
India. https://in.mathworks.com/help/deeplearning/ref/resnet18.html
23. Cortes, C., Vapnik, V.: Support-vector network. IEEE Expert. Syst. their Appl. 7(5), 63–72
(1992). https://doi.org/10.1109/64.163674
Pre-trained CNN Based SVM Classifier for Weld Joint 197

24. MathWorks: ClassificationECOC. https://in.mathworks.com/help/stats/classificationecoc.


html
25. "Weld-Joint-Segments | Kaggle. https://www.kaggle.com/datasets/derikmunoz/weld-joint-
segments Accessed 1 Apr 2022
26. MathWorks Inc.: Fit multi-class models for support vector machines or other classifiers (2018).
https://in.mathworks.com/help/stats/fitcecoc.html Accessed 1 Apr 2022
A Two-Stage Federated Transfer Learning
Framework in Medical Images
Classification on Limited Data:
A COVID-19 Case Study

Alexandros Shikun Zhang1(B) and Naomi Fengqi Li2


1
Department of Computer Science, New Mexico State University,
Las Cruces, NM 88005, USA
[email protected]
2
Las Cruces, NM 88005, USA

Abstract. COVID-19 pandemic has spread rapidly and caused a short-


age of global medical resources. The efficiency of COVID-19 diagnosis
has become highly significant. As deep learning and convolutional neu-
ral network (CNN) has been widely utilized and been verified in analyz-
ing medical images, it has become a powerful tool for computer-assisted
diagnosis. However, there are two most significant challenges in medical
image classification with the help of deep learning and neural networks,
one of them is the difficulty of acquiring enough samples, which may
lead to model overfitting. Privacy concerns mainly bring the other chal-
lenge since medical-related records are often deemed patients’ private
information and protected by laws such as GDPR and HIPPA. Feder-
ated learning can ensure the model training is decentralized on different
devices and no data is shared among them, which guarantees privacy.
However, with data located on different devices, the accessible data of
each device could be limited. Since transfer learning has been verified
in dealing with limited data with good performance, therefore, in this
paper, We made a trial to implement federated learning and transfer
learning techniques using CNNs to classify COVID-19 using lung CT
scans. We also explored the impact of dataset distribution at the client-
side in federated learning and the number of training epochs a model
is trained. Finally, we obtained very high performance with federated
learning, demonstrating our success in leveraging accuracy and privacy.

Keywords: COVID-19 detection · Deep learning · Transfer learning

1 Introduction
Caused by severe acute respiratory syndrome coronavirus-2, coronavirus disease
2019 (COVID-19) has become an ongoing pandemic after it was found at the
N. F. Li—Independent Scholar.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 198–216, 2023.
https://doi.org/10.1007/978-3-031-18461-1_13
A Two-Stage Federated Transfer Learning Framework 199

end of the year 2019, due to the fast spread and infection rate of the COVID-
19 epidemic, the World Health Organization (WHO) designated it a pandemic
[51]. It will be critical to find tools, processes, and resources to rapidly identify
individuals infected.
According to several previous researches [9,20,28,53,65,67], computed
tomography (CT) offers a high diagnostic and prognostic value for COVID-19,
CT scans of individuals with COVID-19 often revealed bilateral lung lesions com-
prised of ground-glass opacity [2] and in some cases, abnormalities and changes
were observed [41]. Since CT scans are a popular diagnostic technique that is
simple and quick to get without incurring the significant expense, incorporating
CT imaging into the development of a sensitive diagnostic tool may expedite
the diagnosis process while also serving as a complement to RT-PCR [2,21,72].
However, utilizing CT imaging to forecast a patient’s individualized prognosis
may identify prospective high-risk individuals who are more likely to develop
seriously and need immediate medical attention. Researchers have realized that
developing effective methods to assist diagnosis has become critical to their suc-
cess.
As a key machine learning method, deep learning has evolved in recent years
and has achieved astonishing success in the field of medical image processing
[25,44,61]. Because of the superior capability of convolutional neural networks
(CNNs) in medical image classification, researchers have begun to concentrate
their attention on the application of CNNs in order to address an increasing num-
ber of medical image processing issues using deep learning, and some previous
researches have demonstrated the great capability of CNNs being implemented
in computed assisted diagnosis [5,16,49,57,63,64,76]. Some previous researches
have also achieved exciting results in COVID-19 classification [11,18,22,23,30–
32,35,47,55,59], however, since medical records of patients have always been
deemed as privacy and protected by laws such as GDPR in the European Union
(EU) and HIPPA in the US, collecting data needed for building high-quality clas-
sifiers such as CT scans becomes extremely difficult. In some other researches
[10,15,19,38,48,56,73,77], the authors built their Covid detection or classifica-
tion techniques utilizing federated learning. Federated learning is a decentral-
ized computation approach for training a neural network [6,24,46,68], which
is able to address privacy concerns in training neural networks. In federated
learning, rather than gathering data and keeping data in one place for central-
ized training, participating clients can process their own data and communicate
model updates to the server, where the server collects and combines weights
from clients to create a global model [6,24,46,68]. Although federated learning
can be used to handle privacy concerns, since data are distributed and located
in clients’ devices, the data size that can be accessed by each client may be lim-
ited, which may compromise the overall model performance. Transfer learning
is designed to address the issue caused by limited data, which transfers knowl-
edge learned from source task to target task [52,66,70], with transfer learning,
previous researches have also achieved decent Covid classification or detection
performance [3,4,14,17,29,33,34,42,50,54,69,75]. Even though all of the papers
200 A. S. Zhang and N. F. Li

Table 1. Number of data in different categories

Category Healthy Covid pneumonia Non-covid pneumonia


Number of data 10192 3616 4273

mentioned above have proposed different methods and frameworks for Covid
classification, there is still a lack of a framework that leverages both privacy
and accuracy by integrating two-stage transfer learning introduced in [76] and
federated learning [6,24,46,68]. Therefore, in this paper, we made a further trial
to leverage both accuracy and privacy in classifying COVID-19 CT images by
combining two-stage transfer learning [76] and federated learning techniques
[6,24,46,68].
In this paper, the datasets used are obtained from the COVID-19 Radiog-
raphy Database [13,58] and Chest X-Ray Images (Pneumonia) [36]. From these
two public databases, We obtained 10192 healthy CT scans, 4273 CT scans of
bacterial or viral pneumonia that are not caused by covid, and 3616 covid CT
scans, as are shown in Table 1, Fig. 1 and Fig. 2.

(a) Covid CT Scan Examples

(b) Non-Covid Pneumonia CT Scan Examples

(c) Healthy CT Scan Examples

Fig. 1. CT scan examples of COVID-19 infection, non-covid pneumonia, and healthy


ones

2 Our Contributions
We proposed a novel two-step federated transfer learning approach on classifying
lung CT scans, with the first step classifying Pneumonia from Healthy and the
A Two-Stage Federated Transfer Learning Framework 201

Fig. 2. Histogram of data size in different categories

second step differentiating Covid Pneumonia from Non-Covid Pneumonia, the


achieved accuracy over limited data is worth being focused on.
We took the privacy concerns into consideration by performing model train-
ing in a decentralized way with a federated learning technique. Since federated
learning requires data to remain on participated edges devices, combining fed-
erated learning and transfer learning further address the issue of limited data.
We thoroughly evaluated the model performance of centralized transfer learn-
ing and federated transfer learning by measuring sensitivity, specificity, as well
as ROC curves, and AUC values and showed that our proposed approach has
excellent capability to leverage between accuracy and privacy.

3 Paper Organization
The rest of this paper is organized as follows. In Sect. 4, we talk about the
deep learning, transfer learning, and federated learning methodologies, as well as
present the algorithm used in this paper. In Sect. 5, we present our experimental
results and discussion. We then made conclusion in Sect. 6, and talk about future
directions in Sect. 7.

4 Theory and Methodology


4.1 Deep Learning, CNNs and Transfer Learning
Deep learning techniques, such as convolutional neural networks (CNNs) [40], are
used to generate predictions about future data. Convolutional neural networks
(CNNs) contain several different layers such as convolutional layers, pooling
layers, and fully connected layers [43,74], each layer consists of many individ-
ual units known as neurons, which is a simulation of neurons in the human
brain nervous system [26,45]. Figure 3 shows a simple example architecture of
202 A. S. Zhang and N. F. Li

CNNs; in CNNs, each neuron takes input and performs weights calculation, and
passes calculation results to other neurons through activation function [43,60].
To construct a decent classifier, CNNs are trained on previously collected data
[8,37,39,71], although a large amount of high-quality data is an essential factor
in achieving better model training and testing performance, due to collecting
and labeling data being always resource-consuming, the entire model training
process could become less efficient until transfer learning [52,70] can handle this
issue. According to [52,70,76], transfer learning attempts to learn knowledge
source tasks and apply it to a target task, in contrast to the conventional model
training process, which attempts to learn each new task from scratch [52,70,76].
In this paper, we will utilize deep learning and transfer learning techniques to
assist us in the classification task.

input convolutional
layer layers fully connected
layers
output
layer

Fig. 3. Example of a CNN architecture

4.2 Federated Learning

As a distributed machine learning technique, federated learning allows machine


learning models to be trained using decentralized data stored on devices such as
mobile phones and computers [6,24,46,68], which solves the fundamental issues
of privacy, ownership, and localization of data.
A neural model may be trained using federated learning, and weights from
a large number of clients trained on their local datasets are aggregated by the
server and integrated to build an accurate global model [6,24,46,68].
In [46], the authors proposed FederatedAveraging, a method that is utilized
on the server in order to aggregate clients’ local updated weights and generate
weights for a global model. According to [46,68], current global model weights
A Two-Stage Federated Transfer Learning Framework 203

are sent to a set of clients at the beginning of each training round, clients start
training local models based on the weights received with their local accessible
data. In the particular case of t = 0, all clients start from the same weights
obtained from the server, which has either been randomly initialized or pre-
trained on other data, depending on the configuration.

4.3 Two-Stage Federated Transfer Learning Framework

The two-stage transfer learning method was first proposed by the authors in [76],
which achieved very high performance in classifying lung nodules. We further
proposed our two-stage federated transfer learning framework, which highly ref-
erences and is based on the algorithms proposed by the authors in [46,68], as is
shown in the following Algorithm 1. In the first stage, CT scans are classified into
Healthy and Pneumonia, while in the second stage, we further classify Pneumo-
nia into Covid Pneumonia and Non-Covid Pneumonia. At the beginning of the
framework, We first conducted stage one model training in a federated format,
and weights are saved as a loadable file for transfer learning use in stage two.
As is shown in Algorithm 1, training round te is the number of global feder-
ated training rounds given by the user, and federated averaging τe is the number
of local training epochs of each client before sending local weights to the Glob-
alServer, for the federated averaged weights calculation in each training round.
w(t) is the federated averaged weights obtained from calculation at the end of
training round t. Before the first training round, at t = 0, we initialize w(0)
to vector containing random values for stage one, and for stage two, we initial-
ize w(0) to the pre-trained weights obtained from stage one. At the beginning
of each training round at GlobalServer, then federated averaged weights from
previous training round w(t − 1) is sent to all clients, each client i start local
training on local data Di based on the received weights in Procedure TrainClient.
At training round t, after finishing local training for τe epochs, each client sends
their weights wτi e (t) to GlobalServer for federated averaged weights calculation.
As is discussed in [46,68], we also take the size of each client’s local dataset into
consideration and performed a weighted average for the calculation of federated
averaged weights w(t). If the currently running task is stage one, after all train-
ing rounds end, w(te ) is saved as a loadable file to be used in stage two. Please
note that for this two-stage federated transfer learning approach, the stage one
must be run prior to the stage two in order to generate pre-trained weights for
transfer learning in stage two.

5 Experiments and Results

5.1 Dataset Preparation

As the proposed federated transfer learning framework contains two stages, two
different datasets with overlapped data need to be prepared. To create the
dataset for stage one, we combined the aforementioned 3616 Covid CT scans
204 A. S. Zhang and N. F. Li

Algorithm 1: Two-Stage Federated Transfer Learning


1 Input: Total Training Round te , Federated Averaging Interval τe , Number of
Clients N , Stage Indicator S, Data D with size |D|, Learning Rate η, Batch
Size b;
2 Variable: Training Round Counter t, Local Training Epoch Counter τ , Clients
Index i;
3 Loss Function: l;
4 Output: w(te ).
5
6 Procedure GlobalServer:
7 if S ← stage one then
8 Initialize w(0) as a vector that contains random values;
9 else
10 if S ← stage two then
11 Initialize w(0) to pre-trained weights from stage one;
12 end
13 end
14 for t ← 1, 2 . . . , te do
15 Send w(t − 1) to all clients;
16 for i ← 1, 2 . . . , N do
17 wτi e (t) ← TrainClient(w(t − 1), i);
18 end
 i
|Di |·wτ (t)
19 w(t) ← N i←1 |D|
e
;  calculate federated averaged weights in the end of
this training round
20 end
21 if S ← stage one then
22 Save the final federated averaged weights w(te ) as loadable file;
23 end
24
25 Procedure TrainClient(w(t), i):
26 Receive w(t) from GlobalServer;
27 w0i (t + 1) ← w(t);  set the initial local model weights of t + 1 training round to
the received federated averaged weights
28 Initialize τ ← 1;
29 for τ ← 1, 2 . . . , τe do
30 wτi (t + 1) ← Optimizer(wτi −1 (t + 1), η, l, Di , b))  update weights based on
weights from previous epoch, learning rate, loss function, local dataset and
batch size using the chosen optimizer, such as gradient decent, SGD, Adam
31 end
32 Send wτi e (t + 1) to GlobalServer;
Note: This algorithm references and is based on algorithms
proposed in [46, 68]

and 4273 CT scans of non-covid bacterial or viral pneumonia into a new cate-
gory named Pneumonia, which contains 7889 CT scans in total, and the other
category is Healthy consists of 10192 CT scans. As for stage two, the 3616 covid
CT scans are in the category named Covid Pneumonia and the other category
Non-Covid Pneumonia contains 4273 CT scans of pneumonia that are not caused
by Covid, as is shown in Table 2, all CT scans are pre-processed into grayscale
and resized to 28 by 28 pixels when creating datasets, in order to be utilized by
LeNet model [40].
A Two-Stage Federated Transfer Learning Framework 205

Table 2. Dataset used in the two-stage federated transfer learning framework

Dataset for stage one Dataset for stage two


Category Healthy Pneumonia Covid pneumonia Non-covid pneumonia
Number of data 10192 7889 3616 4273

Table 3. Dataset used in stage one: classifying Healthy and Pneumonia and stage two:
classifying Covid Pneumonia and Non-Covid Pneumonia

Dataset size of training set: 80% Dataset size of testing set: 20%
Stage one 14464 Healthy 8108 3617 Healthy 2084
Pneumonia 6356 Pneumonia 1533
Stage two 6311 Non-covid pneumonia 3408 1578 Non-Covid Pneumonia 865
Covid pneumonia 2903 Covid pneumonia 713

5.2 CNN Model: LeNet

In this paper, we utilize LeNet as the model to first classify CT scans into Healthy
and Pneumonia in stage one, then classify Pneumonia into Covid Pneumonia
and Non-Covid Pneumonia in stage two. LeNet is one of the most classic CNN
architecture developed by Yann LeCun [40], which was used to classify data
from the MNIST dataset [40]. The LeNet architecture we used contains two
convolutional layers, two max-pooling layers, and two fully connected layers,
with softmax [7] being used in the output layer.

5.3 Experiment Results

In this paper, we conducted our experiments in simulation on a single computer


with a GTX 1070 Ti GPU, tensorflow [1] and keras [12] were utilized to construct
the CNN model during our experiments. We utilized 80% of the dataset as the
training set and the remaining 20% as the testing set, which is shown in Table 3.
Before performing our proposed federated transfer learning, we first imple-
ment two-stage centralized transfer learning as is discussed in [76]. Centralized
learning is the traditional training format where the dataset is located in only one
device, and the model is trained on all data points. The results of the centralized
learning format will be used as a based line to be compared with our proposed
two-stage federated transfer learning framework. As for the model training con-
figuration, to begin with, we train our model for stage one, the training epoch is
set to 20, the batch size is set to 32, and the learning rate is set to 0.001, for stage
two, since the previous weights from stage one are transferred, we reduce the
training epochs to 10, while the batch size and learning rate remain unchanged.
206 A. S. Zhang and N. F. Li

After training models in the centralized setting, we then start the training
model using the proposed federated transfer learning framework. In federated
learning, weights of all clients are sent to GlobalServer for federated averaging
[46] in each training round after being trained at the client side for certain
local training epochs, we then take the effect of federated averaging interval
into consideration. As is shown in Algorithm 1, in our proposed framework, the
federated averaging interval is controlling the number of epochs a local model
is trained at the client-side; in our experiments, we create five clients, and we
trained our models with the federated averaging interval being set from 1 to 10,
in order to explore how it relates to the performance. Data distribution at each
client may also become a key factor for overall performance; therefore, in our
experiments, we explore the influences of data distribution by training model in
two scenarios as is shown in Table 4: (1) distributing data in training set to each
client evenly, with 20% of data for each client, which is marked as balanced and
(2) distributing data to each client unevenly, with the five clients having access
to 30%, 25%, 20%, 15%, 10% of data respectively, which is marked as unbalanced.
Please note that in the federated model training process, the number of training
rounds is set to 20 in stage one and 10 in stage two, the learning rate is set
to 0.001, the batch size is set to 32 in both stages, which corresponds to the
parameters in the aforementioned centralized training.

Table 4. Balanced and unbalanced dataset distribution at 5 clients

Balanced Unbalanced
Client 1 20% Client 1 30%
Client 2 20% Client 2 25%
Client 3 20% Client 3 20%
Client 4 20% Client 4 15%
Client 5 20% Client 5 10%

To evaluate the performance, we tested our models on the testing set. Due to
data imbalance of different categories, traditional accuracy may be biased based
on the size of the dataset, and we then decide to utilize the ROC curve and AUC
value for a more robust model performance evaluation. ROC curves of models
trained with balanced data distribution are shown in Fig. 4, Fig. 5, Fig. 6 and
Fig. 7, and ROC curves of models trained with unbalanced data distribution are
shown in Fig. 8, Fig. 9, Fig. 10 and Fig. 11. All AUC values are recorded, and we
have also calculated precision, sensitivity, as well as specificity. When calculating
AUC, precision, and sensitivity, we consider Pneumonia as positive and Healthy
as negative in stage one, while in stage two, Covid Pneumonia is considered
as positive and Non-Covid Pneumonia is considered as negative. Precision is
calculated using the following Eq. 1,
A Two-Stage Federated Transfer Learning Framework 207

T rueP ositive
P recision = (1)
T rueP ositive + F alseP ositive
while sensitivity is calculated as is shown in Eq. 2,
T rueP ositive
Sensitivity = (2)
T rueP ositive + F alseN egative
and the following Eq. 3 calculates specificity.
T rueN egative
Specif icity = (3)
T rueN egative + F alseP ositive
The recorded confusion matrix values are shown in Table 5, and AUC, preci-
sion, sensitivity and specificity are shown in Table 6. Please note that rounding
has been applied to values in Table 6 in order to keep four decimals, resulting in
identical values shown in Table 6, which may not be equal to each other before
rounding.

Fig. 4. ROC curves of stage one with balanced data distribution

Fig. 5. Upper left zoom-in of ROC curves of stage one with balanced data distribution
208 A. S. Zhang and N. F. Li

Fig. 6. ROC curves of stage two with balanced data distribution

Fig. 7. Upper left zoom-in of ROC curves of stage two with balanced data distribution

5.4 Discussion

The results of experiments show that our proposed two-stage federated transfer
learning framework has achieved excellent accuracy in both stages. By compar-
ing balanced and unbalanced data distribution at the client-side, we can see
that dataset distribution at the client-side may not affect the overall model per-
formance in the current two-stage classification task. Additionally, we observed
that the models achieved very high classification performance in stage two even
with the federated averaging interval set to 1. However, the results of stage one
classification showed that the increase of federated averaging interval might help

Fig. 8. ROC curves of stage one with unbalanced data distribution


A Two-Stage Federated Transfer Learning Framework 209

Fig. 9. Upper left zoom-in of ROC curves of stage one with unbalanced data distribu-
tion

Fig. 10. ROC curves of stage two with unbalanced data distribution

Fig. 11. Upper left zoom-in of ROC curves of stage two with unbalanced data distri-
bution

the model achieve better performance, which could be observed from the sensi-
tivity values. However, the performance may not always be positively correlated
with the federated averaging interval, as too many local training epochs could
result in overfitting.
210 A. S. Zhang and N. F. Li

Table 5. Confusion matrix values of all models

Stage Training centralized Data dist. Fed. averaging interval TP TN FP FN


Stage one Setting N/A N/A 1423 1925 159 110
Stage two Centralized N/A N/A 700 858 7 13
Stage one Federated Balanced 1 1260 1964 120 273
2 1316 2005 79 217
3 1381 1980 104 152
4 1359 2005 79 174
5 1377 2010 74 156
6 1436 1959 125 97
7 1383 2006 78 150
8 1402 1992 92 131
9 1388 1978 106 145
10 1412 1976 108 121
Stage two Federated Balanced 1 698 850 15 15
2 701 857 8 12
3 699 857 8 14
4 696 859 6 17
5 697 861 4 16
6 703 853 12 10
7 698 857 8 15
8 700 858 7 13
9 700 854 11 13
10 700 861 4 13
Stage one Federated Unbalanced 1 1287 1972 112 246
2 1335 2002 82 198
3 1409 1989 95 124
4 1397 1978 106 136
5 1393 1989 95 140
6 1385 1974 110 148
7 1400 1980 104 133
8 1425 1981 103 108
9 1413 1976 108 120
10 1419 1986 98 114
Stage two Federated Unbalanced 1 689 857 8 24
2 697 857 8 16
3 704 861 4 9
4 699 859 6 14
5 698 858 7 15
6 697 861 4 16
7 695 859 6 18
8 697 860 5 16
9 702 859 6 11
10 702 858 7 11
A Two-Stage Federated Transfer Learning Framework 211

Table 6. AUC, pre. (precision), sen. (sensitivity) and spe. (specificity) values of all
models

Stage Training setting Data dist. Fed. averaging interval AUC Pre. Sen. Spe.
Stage one Centralized N/A N/A 0.9801 0.8995 0.9282 0.9237
Stage two Centralized N/A N/A 0.9992 0.9901 0.9818 0.9919
Stage one Federated Balanced 1 0.9626 0.9130 0.8219 0.9424
2 0.9761 0.9434 0.8584 0.9621
3 0.9790 0.9300 0.9008 0.9501
4 0.9812 0.9451 0.8865 0.9621
5 0.9829 0.9490 0.8982 0.9645
6 0.9807 0.9199 0.9367 0.9400
7 0.9836 0.9466 0.9022 0.9626
8 0.9818 0.9384 0.9145 0.9559
9 0.9793 0.9290 0.9054 0.9491
10 0.9820 0.9289 0.9211 0.9482
Stage two Federated Balanced 1 0.9986 0.9790 0.9790 0.9827
2 0.9987 0.9887 0.9832 0.9908
3 0.9980 0.9887 0.9804 0.9908
4 0.9972 0.9915 0.9762 0.9931
5 0.9977 0.9943 0.9776 0.9954
6 0.9981 0.9832 0.9860 0.9861
7 0.9987 0.9887 0.9790 0.9908
8 0.9987 0.9901 0.9818 0.9919
9 0.9992 0.9845 0.9818 0.9873
10 0.9986 0.9943 0.9818 0.9954
Stage one Federated Unbalanced 1 0.9655 0.9199 0.8395 0.9463
2 0.9779 0.9421 0.8708 0.9607
3 0.9816 0.9368 0.9191 0.9544
4 0.9808 0.9295 0.9113 0.9491
5 0.9815 0.9362 0.9087 0.9544
6 0.9796 0.9264 0.9035 0.9472
7 0.9821 0.9309 0.9132 0.9501
8 0.9847 0.9326 0.9295 0.9506
9 0.9808 0.9290 0.9217 0.9482
10 0.9825 0.9354 0.9256 0.9530
Stage two Federated Unbalanced 1 0.9988 0.9885 0.9663 0.9908
2 0.9981 0.9887 0.9776 0.9908
3 0.9978 0.9944 0.9874 0.9954
4 0.9983 0.9915 0.9804 0.9931
5 0.9974 0.9901 0.9790 0.9919
6 0.9972 0.9943 0.9776 0.9954
7 0.9981 0.9914 0.9748 0.9931
8 0.9994 0.9929 0.9776 0.9942
9 0.9979 0.9915 0.9846 0.9931
10 0.9979 0.9901 0.9846 0.9919
212 A. S. Zhang and N. F. Li

6 Conclusion
In this paper, we proposed the two-stage federated transfer learning framework
to address privacy concerns while achieving high accuracy. We also explored the
relationship between the performance and the number of epochs local models are
trained. The results of our experiments showed that the performance in terms
of accuracy of the proposed framework is surprisingly good compared to the
centralized learning.

7 Future Direction
In our current work, due to hardware limitations, the simulation experiments of
our proposed framework were only run on the LeNet model. Future endeavors
may be focusing on running the proposed framework on other much more com-
plicated CNNs, such as AlexNet [37], VGG [62], and ResNet [27]. In the future,
we may further explore the time or other resources consumed when increasing
the number of local training epochs at the client-side and focus on achieving
high accuracy in a resource-constrained environment.

References
1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous sys-
tems (2015). http://tensorflow.org/
2. Ai, T., et al.: Correlation of chest CT and RT-PCR testing for coronavirus disease
2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2), E32–E40
(2020)
3. Altaf, F., Islam, S., Janjua, N.K.: A novel augmented deep transfer learning for
classification of COVID-19 and other thoracic diseases from X-rays. Neural Com-
put. Appl. 33(20), 14037–14048 (2021)
4. Aslan, M.F., Unlersen, M.F., Sabanci, K., Durdu, A.: CNN-based transfer learning-
BiLSTM network: a novel approach for COVID-19 infection detection. Appl. Soft
Comput. 98, 106912 (2021)
5. Barbu, A., Lu, L., Roth, H., Seff, A., Summers, R.M.: An analysis of robust
cost functions for CNN in computer-aided diagnosis. Comput. Methods Biomech.
Biomed. Eng. Imaging Vis. 6(3), 253–258 (2018)
6. Bonawitz, K., et al.: Towards federated learning at scale: system design. Proc.
Mach. Learn. Syst. 1, 374–388 (2019)
7. Bridle, J.: Training stochastic model recognition algorithms as networks can lead
to maximum mutual information estimation of parameters. In: Advances in Neural
Information Processing Systems, vol. 2 (1989)
8. Çalik, R.C., Demirci, M.F.: Cifar-10 image classification with convolutional neural
networks for embedded systems. In: 2018 IEEE/ACS 15th International Conference
on Computer Systems and Applications (AICCSA), pp. 1–2. IEEE (2018)
9. Carotti, M., et al.: Chest CT features of coronavirus disease 2019 (COVID-19)
pneumonia: key points for radiologists. Radiol. Med. 125(7), 636–646 (2020)
10. Cetinkaya, A.E., Akin, M., Sagiroglu, S.: A communication efficient federated learn-
ing approach to multi chest diseases classification. In: 2021 6th International Con-
ference on Computer Science and Engineering (UBMK), pp. 429–434. IEEE (2021)
A Two-Stage Federated Transfer Learning Framework 213

11. Chen, J.I.-Z.: Design of accurate classification of COVID-19 disease in X-ray images
using deep learning approach. J. ISMAC 3(02), 132–148 (2021)
12. Chollet, F., et al.: Keras (2015). https://keras.io
13. Chowdhury, M.E.H., et al.: Can AI help in screening viral and COVID-19 pneu-
monia? IEEE Access 8, 132665–132676 (2020)
14. Das, N.N., Kumar, N., Kaur, M., Kumar, V., Singh, D.: Automated deep trans-
fer learning-based approach for detection of COVID-19 infection in chest X-rays.
IRBM (2020)
15. Dayan, I., et al.: Federated learning for predicting clinical outcomes in patients
with COVID-19. Nat. Med. 27(10), 1735–1743 (2021)
16. Duran-Lopez, L., Dominguez-Morales, J.P., Conde-Martin, A.F., Vicente-Diaz, S.,
Linares-Barranco, A.: PROMETEO: a CNN-based computer-aided diagnosis sys-
tem for WSI prostate cancer detection. IEEE Access 8, 128613–128628 (2020)
17. El Gannour, O., Hamida, S., Cherradi, B., Raihani, A., Moujahid, H.: Performance
evaluation of transfer learning technique for automatic detection of patients with
COVID-19 on X-ray images. In: 2020 IEEE 2nd International Conference on Elec-
tronics, Control, Optimization and Computer Science (ICECOCS), pp. 1–6. IEEE
(2020)
18. Elzeki, O.M., Shams, M., Sarhan, S., Abd Elfattah, M., Hassanien, A.E.: COVID-
19: a new deep learning computer-aided model for classification. PeerJ Comput.
Sci. 7, e358 (2021)
19. Feki, I., Ammar, S., Kessentini, Y., Muhammad, K.: Federated learning for
COVID-19 screening from chest X-ray images. Appl. Soft Comput. 106, 107330
(2021)
20. Francone, M., et al.: Chest CT score in COVID-19 patients: correlation with disease
severity and short-term prognosis. Eur. Radiol. 30(12), 6808–6817 (2020)
21. Gietema, H.A., et al.: CT in relation to RT-PCR in diagnosing COVID-19 in the
Netherlands: a prospective study. PLoS ONE 15(7), e0235844 (2020)
22. Gilanie, G., et al.: Coronavirus (COVID-19) detection from chest radiology images
using convolutional neural networks. Biomed. Signal Process. Control 66, 102490
(2021)
23. Gupta, A., Gupta, S., Katarya, R., et al.: InstaCovNet-19: a deep learning classifi-
cation model for the detection of COVID-19 patients using chest X-ray. Appl. Soft
Comput. 99, 106859 (2021)
24. Hard, A., et al.: Federated learning for mobile keyboard prediction. arXiv preprint
arXiv:1811.03604 (2018)
25. Haskins, G., Kruger, U., Yan, P.: Deep learning in medical image registration: a
survey. Mach. Vis. Appl. 31(1), 1–18 (2020). https://doi.org/10.1007/s00138-020-
01060-x
26. He, J., Yang, H., He, L., Zhao, L.: Neural networks based on vectorized neurons.
Neurocomputing 465, 63–70 (2021)
27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
28. Herpe, G., et al.: Efficacy of chest CT for COVID-19 pneumonia in France. Radi-
ology 298(2), E81–E87 (2020)
29. Horry, M.J., et al.: COVID-19 detection through transfer learning using multimodal
imaging data. IEEE Access 8, 149808–149824 (2020)
30. Hussain, E., Hasan, M., Rahman, M.A., Lee, I., Tamanna, T., Parvez, M.Z.:
CoroDet: a deep learning based classification for COVID-19 detection using chest
X-ray images. Chaos Solit. Fractals 142, 110495 (2021)
214 A. S. Zhang and N. F. Li

31. Ibrahim, A.U., Ozsoz, M., Serte, S., Al-Turjman, F., Yakoi, P.S.: Pneumonia clas-
sification using deep learning from chest X-ray images during COVID-19. Cogn.
Comput. 1–13 (2021)
32. Ibrahim, D.M., Elshennawy, N.M., Sarhan, A.M.: Deep-chest: multi-classification
deep learning model for diagnosing COVID-19, pneumonia, and lung cancer chest
diseases. Comput. Biol. Med. 132, 104348 (2021)
33. Jaiswal, A., Gianchandani, N., Singh, D., Kumar, V., Kaur, M.: Classification of
the COVID-19 infected patients using DenseNet201 based deep transfer learning.
J. Biomol. Struct. Dyn. 39(15), 5682–5689 (2021)
34. Katsamenis, I., Protopapadakis, E., Voulodimos, A., Doulamis, A., Doulamis, N.:
Transfer learning for COVID-19 pneumonia detection and classification in chest X-
ray images. In: 24th Pan-Hellenic Conference on Informatics, pp. 170–174 (2020)
35. Keidar, D., et al.: COVID-19 classification of X-ray images using deep neural net-
works. Eur. Radiol. 31(12), 9654–9663 (2021)
36. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases by
image-based deep learning. Cell 172(5), 1122–1131 (2018)
37. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, vol. 25 (2012)
38. Kumar, R., et al.: Blockchain-federated-learning and deep learning models for
COVID-19 detection using CT imaging. IEEE Sens. J. 21(14), 16301–16314 (2021)
39. Kussul, E., Baidyk, T.: Improved method of handwritten digit recognition tested
on MNIST database. Image Vis. Comput. 22(12), 971–981 (2004)
40. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
41. Lee, E.Y.P., Ng, M.-Y., Khong, P.-L.: COVID-19 pneumonia: what has CT taught
us? Lancet Infect. Dis. 20(4), 384–385 (2020)
42. Li, C., Yang, Y., Liang, H., Boying, W.: Transfer learning for establishment
of recognition of COVID-19 on CT imaging using small-sized training datasets.
Knowl.-Based Syst. 218, 106849 (2021)
43. Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J.: A survey of convolutional neural
networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn.
Syst. (2021)
44. Litjens, G., et al.: A survey on deep learning in medical image analysis. Med. Image
Anal. 42, 60–88 (2017)
45. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous
activity. Bull. Math. Biophys. 5(4), 115–133 (1943)
46. McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-
efficient learning of deep networks from decentralized data. In: Artificial Intelli-
gence and Statistics, pp. 1273–1282. PMLR (2017)
47. Narin, A., Kaya, C., Pamuk, Z.: Automatic detection of coronavirus disease
(COVID-19) using X-ray images and deep convolutional neural networks. Pattern
Anal. Appl. 24(3), 1207–1220 (2021)
48. Nguyen, D.C., Ding, M., Pathirana, P.N., Seneviratne, A., Zomaya, A.Y.: Feder-
ated learning for COVID-19 detection with generative adversarial networks in edge
cloud computing. IEEE Internet Things J. (2021)
49. Okamoto, T., et al.: Feature extraction of colorectal endoscopic images for
computer-aided diagnosis with CNN. In: 2019 2nd International Symposium on
Devices, Circuits and Systems (ISDCS), pp. 1–4. IEEE (2019)
50. Oluwasanmi, A., et al.: Transfer learning and semisupervised adversarial detection
and classification of COVID-19 in CT images. Complexity 2021 (2021)
A Two-Stage Federated Transfer Learning Framework 215

51. World Health Organization: Laboratory testing for coronavirus disease (COVID-
19) in suspected human cases: interim guidance, 19 March 2020. Technical report,
World Health Organization (2020)
52. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2009)
53. Parry, A.H., Wani, A.H., Shah, N.N., Yaseen, M., Jehangir, M.: Chest CT features
of coronavirus disease-19 (COVID-19) pneumonia: which findings on initial CT can
predict an adverse short-term outcome? BJR Open 2, 20200016 (2020)
54. Pathak, Y., Shukla, P.K., Tiwari, A., Stalin, S., Singh, S.: Deep transfer learning
based classification model for COVID-19 disease. IRBM (2020)
55. Pham, T.D.: A comprehensive study on classification of COVID-19 on computed
tomography with pretrained convolutional neural networks. Sci. Rep. 10(1), 1–8
(2020)
56. Qayyum, A., Ahmad, K., Ahsan, M.A., Al-Fuqaha, A., Qadir, J.: Collaborative
federated learning for healthcare: multi-modal COVID-19 diagnosis at the edge.
arXiv preprint arXiv:2101.07511 (2021)
57. Qiu, Y., et al.: A new approach to develop computer-aided diagnosis scheme of
breast mass classification using deep learning technology. J. X-ray Sci. Technol.
25(5), 751–763 (2017)
58. Rahman, T., et al.: Exploring the effect of image enhancement techniques on
COVID-19 detection using chest X-ray images. Comput. Biol. Med. 132, 104319
(2021)
59. Sakib, S., Tazrin, T., Fouda, M.M., Fadlullah, Z.M., Guizani, M.: DL-CRC: deep
learning-based chest radiograph classification for COVID-19 detection: a novel app-
roach. IEEE Access 8, 171575–171589 (2020)
60. Sharma, S., Sharma, S., Athaiya, A.: Activation functions in neural networks.
Towards Data Sci. 6(12), 310–316 (2017)
61. Shen, D., Guorong, W., Suk, H.-I.: Deep learning in medical image analysis. Annu.
Rev. Biomed. Eng. 19, 221–248 (2017)
62. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
63. Sun, W., Zheng, B., Qian, W.: Computer aided lung cancer diagnosis with deep
learning algorithms. In: Medical Imaging 2016: Computer-Aided Diagnosis, vol.
9785, pp. 241–248. SPIE (2016)
64. Tanaka, H., Chiu, S.-W., Watanabe, T., Kaoku, S., Yamaguchi, T.: Computer-
aided diagnosis system for breast ultrasound images using deep learning. Phys.
Med. Biol. 64(23), 235013 (2019)
65. Tenda, E.D., et al.: The importance of chest CT scan in COVID-19: a case series.
Acta Med. Indones. 52(1), 68–73 (2020)
66. Torrey, L., Shavlik, J.: Transfer learning. In Handbook of research on machine
learning applications and trends: algorithms, methods, and techniques, pp. 242–
264. IGI Global (2010)
67. Ufuk, F., Savaş, R.: Chest CT features of the novel coronavirus disease (COVID-
19). Turk. J. Med. Sci. 50(4), 664–678 (2020)
68. Wang, S., et al.: When edge meets learning: adaptive control for resource-
constrained distributed machine learning. In: IEEE INFOCOM 2018-IEEE Con-
ference on Computer Communications, pp. 63–71. IEEE (2018)
69. Wang, S.-H., Nayak, D.R., Guttery, D.S., Zhang, X., Zhang, Y.-D.: COVID-19 clas-
sification by CCSHNet with deep fusion using transfer learning and discriminant
correlation analysis. Inf. Fusion 68, 131–148 (2021)
216 A. S. Zhang and N. F. Li

70. Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big
Data 3(1), 1–40 (2016). https://doi.org/10.1186/s40537-016-0043-6
71. Wu, M., Chen, L.: Image recognition based on deep learning. In: 2015 Chinese
Automation Congress (CAC), pp. 542–546. IEEE (2015)
72. Xie, X., Zhong, Z., Zhao, W., Zheng, C., Wang, F., Liu, J.: Chest CT for typical
coronavirus disease 2019 (COVID-19) pneumonia: relationship to negative RT-
PCR testing. Radiology 296(2), E41–E45 (2020)
73. Yan, B., et al.: Experiments of federated learning for COVID-19 chest X-ray
images. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds.) ICAIS 2021. CCIS,
vol. 1423, pp. 41–53. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
78618-2 4
74. Zhang, Q.: Convolutional neural networks. In: Proceedings of the 3rd International
Conference on Electromechanical Control Technology and Transportation, pp. 434–
439 (2018)
75. Zhang, R., et al.: COVID19XrayNet: a two-step transfer learning model for the
COVID-19 detecting problem based on a limited number of chest x-ray images.
Interdiscip. Sci. Comput. Life Sci. 12(4), 555–565 (2020)
76. Zhang, S., et al.: Computer-aided diagnosis (CAD) of pulmonary nodule of thoracic
CT image using transfer learning. J. Digit. Imaging 32(6), 995–1007 (2019)
77. Zhang, W., et al.: Dynamic-fusion-based federated learning for COVID-19 detec-
tion. IEEE Internet Things J. 8(21), 15884–15891 (2021)
Graph Emotion Distribution Learning
Using EmotionGCN

A. Revanth(B) and C.P. Prathibamol

Department of Computer Science, Amrita Vishwa Vidyapeetham, Kerala, India


[email protected], [email protected]

Abstract. Emotion of a person can be identified by the patterns in


their ability to think, respond, communicate, or behave in a social envi-
ronment. It is highly influenced by the emotion, making decisions, and
exhibiting behaviors with the surroundings and other well-being. The
emotion identification of person is very much useful in the case medi-
cal domain. The problem in detecting the emotions from these data is a
single image can exhibit different emotions for different persons in their
own perspective. With the advancement in the field of computer vision,
the adoption of deep convolutional network paved a way to creating a
convolutional neural network model which can learn these emotions from
the input given and a graph convolutional network to estimate the prob-
ability distribution of whole data of emotions as well as the form the data
of each emotion separately. The distribution of each emotion can be con-
verted to a graph-based data so that it can be stored and used to train
new models without the need for long training time. These graph data
are more compatible and paves the way to perform further psychological
analysis to extract patterns in them.

Keywords: CNN · GCN · Modularity · Degree distribution · Sparse


diagonal matrix

1 Introduction

Emotion recognition is one of important problem in the field of psychology. Most


of the psychological research work involves the process of recording the real-time
human activities such as physical or mental condition. Recording those emotional
data using an HCI will be effective, but when the number of test samples is high,
this process will become more expensive. So, there is a need to obtain the max-
imum quality emotional data with the cost constraint. Even though expressing
an emotion depends on the mental condition of a person, these will be reflected
on the individual faces of the test subjects. Due to the innovation in technolo-
gies, various machine learning and deep learning algorithms are used in these
tests to capture the face emotion of the test subjects [12]. The commonly used
neural network architecture for image processing is convolution neural network
(CNN). CNN has the automatic feature selection capabilities, which makes it
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 217–229, 2023.
https://doi.org/10.1007/978-3-031-18461-1_14
218 A. Revanth and C. P. Prathibamol

very much suitable to work with the image-based data. For the emotion recog-
nition, the CNN will be trained with the set of images containing the faces of
person with different emotions. But the feature selection and the network layers
are hidden and randomly assigned in a trial-and-error method respectively [19].
This process is called as a supervised learning approach, as the training requires
the labels of different set of emotions in the dataset. The distribution formed
after training the CNN will be hidden and cannot be reused. So, when the data
increases, this will lead to the overfitting of the model.
The traditional algorithm for the emotion recognition mainly focuses on the
personalized emotion categories obtained from some psychological models such
as Mikel’s wheel. Facial emotion recognition is used in digital forensics to detect
the suspicious activity, Research in psychology etc. CNN has high accuracy state-
of-the-art model for image processing [6]. It enables automatic feature selection.
It outputs a single dimensional vector which is easily activated. Emotions are
highly subjective; hence CNN can be used. It can be used as CNN, but on the
generic graphs, followed by a linear transformation operation. It aggregates the
features of all adjacent nodes for each node. Specifically used to identify the
categorical relationships and semantic embeddings [11]. It automatically selects
the fixed number of neighboring nodes for each feature. As a human behavior,
expressing a personalized emotion differs from person to person. When the num-
ber of classes is large, high correlation between two or more emotions results in
misconception. The traditional CNN network used for image emotion (person-
ality) recognition has multiple layers and the number of layers in CNN should
be high to get better performance, which requires high computational cost [20].
Hence, it is also difficult to fine tune for a specific application. The distribution
obtained from the CNN after training is not in a proper form to analyze or
compare with other unspecified emotions. So here there is a need for a model
to consider the whole probability distribution of each emotion so that it can be
easily compared with the distribution of the training data.
The main motivation of this research is to ease the process of image emo-
tion recognition by determining the graph probability ground truth distribution
for the psychological analysis and to train other models built for the emotion
recognition [23]. These graph probability distributions are a general representa-
tion of each emotion in the psychological models, such as Mikel’s wheel. These
graph distributions are represented in the form of undirected graphs with dif-
ferent degrees, nodes, and edges. To extract more information, these graphs will
be analyzed further with some graph analysis tool.

2 Related Work
As our work in this paper targets at finding the emotion correlation graph and
its analysis, we will briefly discuss the following architectures:
Two Model Architecture. CNN-GCN architecture model consist of convo-
lutional neural network for feature extraction from the images and the graph
convolutional network for weight generation from the annotated text [3]. But
Graph Emotion Distribution Learning Using EmotionGCN 219

the problem with this model is it is a combination of two different models. One
for the feature selection which will be fed back as the input features of CNN
model. So here the training time will be large and if there is a need in training
with each emotion, the same process is repeated for each one which consumes
a lot of time [2]. And if the psychological model varies, the retraining should
be done to retain with the changes in the pattern. The major drawback in this
model is that the dataset should contain both the images and the text captions,
in which there should be an establishment of relation between the text corre-
sponding to the image. But it is difficult to scale this type of datasets for further
training [1].
Text based Models. Most of the emotion recognition model involves a text-
based annotation for determining the emotion from the image. This type of
process requires text-based models. Some of the most used text-based word
embedding models are word2vec model and the Glove model (Global vector).
These models are called as pretrained word vectors [8]. The word embedding
models are used to understand the context of words, whereas in the previous
models each word will be treated as a separate feature. So, if the vocabulary is
very large, then the number of features will also be increased in the previous
NLP models. In word embedding, each word will be represented as a vector.
From this, any word in the corpus can be represented with a single vector with
ten features. Here, the mapping of features to the words is not possible. The
process of every NLP model is to convert each word to a numeric. The word2vec
model can be used as a continuous bag of words models (CBOW) or a Skip
Gram model [5]. In the CBOW model the context words are predefined, and the
target word is predicted, but in the Skip Gram model, the target word is known,
and the context words are predicted from the target word. In the word2vec
model, only the local property of the words is considered, but in Glove model
the global property is considered [4]. For implementing these models, the Graph
convolutional network model is used [7].
Graph Convolutional Network. Graph convolutional networks are the archi-
tectures for the graph which are used in the process of message passing. This
algorithm is the same as that of the label propagation algorithm, which passes
the labels through the message passing, whereas the GCN algorithm passes the
input features through the message passing [8]. Hence, the GCN provides the
feature smoothing. GCN algorithm first gathers all the attribute vectors from the
neighboring nodes and perform some aggregation function. The feature size will
remain the same even after the aggregation is performed [9]. Then this aggre-
gated feature vector will be passed through the dense neural network layer. The
output from this layer is the new vector representation of that corresponding
node. This process will be repeated for every other node on that graph. If there
are more than one layers in the GCN, this updated vector will be aggregated
with the neighbors and passed to the dense layer to form the replacement vec-
tor of that node [10]. The dense layer processing is as the same as that of the
convolutional neural network, but the only difference is the pre-processing step
before each convolution operation using an aggregation function. Here, the size
220 A. Revanth and C. P. Prathibamol

of the node vectors coming out of the GCN layer is determined by the number
of units in the neural network layer [13].
The document modelling architecture uses the GCN for extracting the fea-
tures from the text. The major drawback in this model is that the dataset
should contain both the images and the text captions, in which there should be
an establishment of relation between the text corresponding to the image. But it
is difficult to scale this type of datasets for further training [14]. Here there will
be three separate models which forms a continuous pipeline. So due to this, the
training time will be high, and the dataset should contain the captions which
should be related to the image dataset [15].
The number of layers in the GCN should not be more, even if any optimiza-
tion algorithm is used. Since the probability distribution is expressed in terms of
linear data, when the dataset become large, the distribution will automatically be
considered as a Gaussian function and the difference between different personality
cannot be determined [16]. The word corpus is also a necessary factor required to
train the network, which is not available is most of the raw datasets we obtained
from the images. This model doesn’t involve any human emotions. [17]

3 Method

3.1 Preprocessing

Pre-processing the given data is necessary and mandatory for this model. As
model architecture consist of only GCN, it will accept only a graph-based data
as a feature. So in the pre-processing step, the given dataset of images will be
converted into network graph data structure. This is possible by converting all
the pixel intensities values to graph node and forming edges between them and
removing the nodes with self-loops using feature selection process [18,25]. The
algorithm for pre-processing be as follows,

1. Label encode the given dataset according to their classes.


2. Converting the images to grayscale, since color has no effect in emotion recog-
nition.
3. Replace each pixel intensities with the average intensities on the axis with a
constraint threshold.
4. The sparse diagonal matrix for the intensities will be determined and con-
verted to networkx graph object.
5. Remove nodes without edges to get the training data.

Figure 1 gives the pictorial representation of how the image is being con-
verted.

3.2 Algorithm

The architecture of this algorithm consists of a set of single layers GCN which
are used to determine the graph distribution of each emotion. This approach
Graph Emotion Distribution Learning Using EmotionGCN 221

Table 1. Observation from the analysis

No Properties Angry Happy Sad Surprise


1 Nodes 784 784 784 784
2 Edges 250706 267144 242263 300470
3 Average degree 639.556 681.49 618.018 766.505
4 Average weighted degree 555.401 596.786 534.257 677.929
5 Network diameter 2 2 3 2
6 Graph density 0.817 0.87 0.789 0.979
7 Modularity 0.074 0.053 0.091 0.014
8 Connected components 1 1 1 1
9 Average clustering coefficients 0.898 0.924 0.9 0.982
10 Average path length 1.183 1.13 1.211 1.021
11 Number of clusters formed 2 2 2 3

considers only four emotions such as angry, happy, sad and surprise. By splitting
the data based on the labels and training separately, the final graph distribution
can be obtained for each emotion. These four models will be evaluated separately
to analyze their performance. A separate two-layer GCN model is used to train
with the whole dataset of all emotions, which can be compared with the ground
truth distribution to determine how well the dataset perform with the GCN
model. The ground truth distribution is nothing, but the graph trained on the
consecutive dense layer [21]. The algorithm of single layer and two-layer GCN
are explained below:
1. For an image Ii , its feature fi will be determined by fi = F (Ii ).
2. The emotion correlation matrix g will be obtained from the sparse diagonal
matrix from the graph and fed to GCN.
3. Here the operation performed by GCN be,
   
f H(l) , A = σ AH(l) W(l) (1)

4. From this, the final weight matrix W = G (Wp , g) can be determined.


5. Compute the loss and proceed to back propagation.
6. These optimized parameters will be updated to the GCN model.
7. This process will be repeated for a constant number of epochs.
8. After the training is completed, the emotion distribution graph will be gen-
erated for each emotion in the emotion wheel.    
9. The final form of multilayer GCN be Z = Softmax Aσ  AXW  (0)
W(1) .
Where, A  is the normalized adjacent matrix D̂− 12 ÂD̂− 12 .

3.3 Analysis of Algorithm


From [10], let the graph generated from the pixels of the image dataset be G =
(V, E) where, V is the vertex or node in the graph and E be the edge. This
222 A. Revanth and C. P. Prathibamol

Fig. 1. The pre-processing stages of EmotionGCN. The dataset used for the training
consist only of the monochrome images. But even if the images are color an exception
handling algorithm is used to convert the three channel pictures (Red, Green, and
Blue) to single channel image (Black and White). Then each pixel will be replaced by
a certain node whereas the arrangement of the nodes along with the edges forms a
three-dimensional graph representation of the image. The distance between object in
the image is represented with a real distance between the graphs. The edges and nodes
which didn’t have more correlation to the face (Surroundings) will be removed using a
threshold [22].

graph structure will be in the form of sparse diagonal matrix. This will produce
a node level matrix in which each row represents the output feature fi . Here,
the operation perform by every GCN layer is:
   
f H(l) , A = σ AH(l) W(l) (2)

where,
– H(l) be the input node feature matrix.
– W(l) be the transformation matrix with learnable weights.
– σ be the nonlinear activation function.
In practice, the performance of the single layer GCN will be not up to the
mark. So multiple GCN layers can be stacked to form the multilayer GCN where
the output feature from the ith GCN layer will normalize and fed to the (i + 1)th
GCN layer. The output feature vectors from the final GCN layer will be applied
to a nonlinear activation function such as softmax or reLu to perform the task
specified. [24] The output from the two-layer GCN will be in the form:
   
Z = Softmax Aσ  AXW (0)
W(1) (3)

where, A is the normalized adjacent matrix D̂− 2 ÂD̂− 2 .


1 1

After the graph convolution operation is done, the next thing is to find the
emotion correlation graph. This graph can be constructed by following one of
Graph Emotion Distribution Learning Using EmotionGCN 223

Fig. 2. The complete workflow of EmotionGCN algorithm. The preprocessed sets of


images converted to the graph structure will be used as the training data for the single
layer graph convolutional network. The convolution operation leads to a single graph
with certain number of edges. These graphs are the distributions which are saved and
analyzed using NetworkX and gephi.

the psychological models. Each psychological model represents a different set of


emotions. Here angry, happy, sad, surprise emotions will be considered based
upon the Mikel’s wheel. The emotion correlation matrix can be determined by
finding the distance matrix for each emotion separately using the cosine distance
metric. Then the self-connections are removed and form a non-directed graph.
This graph is called the emotion correlation graph for emotion, and the process is
224 A. Revanth and C. P. Prathibamol

Fig. 3. The evaluation metrics used in the models. a) gives the confusion matrix of the
two-layer GCN model which is trained on all the four different emotions (angry, happy,
sad, surprise). b) is the confusion matrix of the fully connected neural network model
trained on all four emotions. c) is the confusion matrix of the multi-layer convolutional
neural network model, which is trained on the all four different emotions.

repeated for all other emotions in the set [26,27]. This method similar to finding
the probability of ith emotion given j th emotion p(i—j) which is defined as,
 
1 d(i, j)2
p(i | j) = √ exp − (4)
2πσ 2σ 2

where, d(i, j) is the pairwise-cosine distance between the emotions i and j.


When the input matrix E and correlation graph g is fed to the two layer GCN
network, then the two layer GCN can be expressed as,
 
W(1) = ReLU  gEU(0) (5)
 
W(2) = Softmax gW(1) U(1) (6)

where,
Graph Emotion Distribution Learning Using EmotionGCN 225

Fig. 4. Degree distributions of four different emotions angry, happy, sad and surprise.

– g be the normalized version of g.


– U(0) , U(1) be the learnable weights in GCN.
– W(2) be the final output weights of distribution prediction.
After every GCN layer operation, the weight matrix should be normalized before
applying it to another GCN layer. In this approach, l2 normalization along the
row of weight matrix W is performed. The l2 normalization of weight matrix W
be,
wij
wij =  (7)
d 2
j=1 wij

4 Experimentation
4.1 Dataset
The experiments are conducted in the dataset called as FER-2013. This dataset
consists of, 35685 examples of 48 by 48-pixel gray scale images of faces. There
are seven emotions in this dataset and for this research only four emotions are
considered due to the shortage of resources. The four different emotions used for
this research is angry, happy, sad and surprise. Among this data, 80% of the data
will be considered as the training data and 20% of the data will be considered as
the test data [28]. All the images in these datasets are monochromatic. But an
explicit inclusion of color to grayscale conversion is present in the pre-processing
algorithm in order to avoid the error.
226 A. Revanth and C. P. Prathibamol

4.2 Evaluation Metrics


The evaluation of the emotion distribution is done by commonly used metrics
such as confusion matrix, Accuracy and further analysis and benchmarking will
be performed by finding the parameters of the graph using some standard graph
algorithms such as Average degree, average weighted degree, network diameter,
graph density, modularity, connected components, average clustering coefficients,
average path length, number of edges, nodes and clusters formed.

4.3 Implementation Details


The first step of implementation is to pre-process the given image data to a
graph-based data. In pre-processing, each of the image data in all the four classes
such as happy, sad, anger, surprise will be first converted to a grayscale since the
color has no effect on emotion recognition. Then the pixels of the monochrome
images will be converted to the corresponding nodes connected with the edges.
The self-loops and the nodes without any edges will be removed to reduce the
complexity of input. Then the training of four separate single layer GCN models
will be done with the graph network of four different emotions. After training,
a single graph distribution representing each emotion label in the dataset is
obtained. After the self-loop removal and the threshold filtering of nodes on
them gives the distribution with optimum number of nodes and edges.
For small networks, using the networkX is sufficient for visualization. But for
the larger networks, it is impossible to visualize that large graph. Gephi is an
open-source software which has an inbuilt 3D render engine which is used to visu-
alize and analyze large network graphs. The 3D render engine has the capability
to render the graphs and their data structure in real-time. It is used to explore,
manipulate, clusterize, analyze, spatialize, filter, and export any kind of graphs.
It consists set of inbuilt algorithms such as degree centrality, HITS, clustering
to perform it in the graph. Before applying any inbuilt graph processing algo-
rithms, we need to visualize how the nodes and edges of the graph are present.
For this, the degree distribution of each emotion distribution are determined in
Fig. 4. Figure 2 gives the top to bottom approach from the post-preprocessing
step to analysis.

4.4 Experimental Results


Two-layer GCN is trained on all four different emotions, and the ground truth
distribution can be determined by training a dense layer (fully connected net-
work) on all four different types of emotions. The confusion matrix of both the
models are given in the Fig. 3. Comparing to the ground truth distribution, the
two-layer GCN performs well in classifying the emotions based on the confu-
sion. To compare with the state-of-the-art CNN, the same data is trained on
the optimized CNN model with two convolution layers. The confusion matrix
of this CNN model is given in Fig. 3. On comparing this confusion matrix, the
CNN model performs better in classification than the GCN model as the CNN
Graph Emotion Distribution Learning Using EmotionGCN 227

model architecture is design particularly to solve the problem of image process-


ing, and it also has the automatic feature selection capability which helps it
to boost the performance comparing to the GCN. However, distribution cannot
be obtained using CNN model. The GCN model can also be improved by fine-
tuning the model and adding multiple layers. But choosing the model depends
on the application for which it is used. For the deployment-oriented end to end
application, CNN can be used to perform efficient classification. For perform-
ing psychological research analysis on the different set of emotions to detect the
patterns in the emotion distribution or to visualize the distribution in which the
model is trained, the GCN is the better option. For the purpose of analysis, the
eleven different properties of each emotion graph are identified and recorded as
given in the Table 1

5 Conclusion and Future Work


5.1 Conclusion
In this paper, the correlation between the emotions is used to propose the Emo-
tionGCN model using graph convolutional neural network. From the results
obtained from the pre-processed graphs of FER2013 dataset, the four differ-
ent emotions can be easily classified from the degree distribution obtained and
the number of clusters and formation of clusters under a minimum threshold
of nodes. From this cluster pattern and the degree distribution, finding three-
dimensional graphical properties from the images will be done efficiently. In the
future, on one hand, instead of dealing with only four emotions, the number
of emotion classes can be increased as per some psychological models such as
Mikel’s wheel model and the number of layers in the graph convolutional layer
can be increased to better detect the correlation between the emotion.

References
1. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment
ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM
International Conference on Multimedia, pp. 223–232 (2013)
2. Chiang, W.L., Liu, X., Si, S., Li, Y., Bengio, S., Hsieh, C.J.: Cluster-GCN: an
efficient algorithm for training deep and large graph convolutional networks. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pp. 257–266 (2019)
3. Miranda-Correa, J.A., Abadi, M.K., Sebe, N., Patras, I.: Amigos: a dataset for
affect, personality and mood research on individuals and groups. IEEE Trans.
Affect. Comput. 12(2), 479–493 (2018)
4. Farnadi, G., et al.: Computational personality recognition in social media. User
Model. User-Adap. Inter. 109–142 (2016). https://doi.org/10.1007/s11257-016-
9171-0
5. Gao, H., Zhengyang, W., Shuiwang, J.: Large-scale learnable graph convolutional
networks. In: Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp. 1416–1424 (2018)
228 A. Revanth and C. P. Prathibamol

6. Gautam, K.S., Senthil Kumar, T.: Video analytics-based facial emotion recognition
system for smart buildings. Int. J. Comput. Appl. 43(9), 858–867 (2021)
7. Giannopoulos, P., Perikos, I., Hatzilygeroudis, I.: Deep learning approaches for
facial emotion recognition: a case study on FER-2013. In: Hatzilygeroudis, I.,
Palade, V. (eds.) Advances in Hybridization of Intelligent Methods. SIST, vol.
85, pp. 1–16. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-66790-
41
8. Grattarola, D., Alippi, C.: Graph neural networks in tensorflow and keras with
spektral. arXiv preprint arXiv:2006.12138 (2020)
9. Jonathon, S.H., Paul, H.L.: Automatically annotating the mir flickr dataset: exper-
imental protocols, openly available data and semantic spaces. In: Proceedings of
the International Conference on Multimedia Information Retrieval, pp. 547–556
(2010)
10. He, T., Xiaoming, J.: Image emotion distribution learning with graph convolutional
networks. In: Proceedings of the 2019 on International Conference on Multimedia
Retrieval, pp. 382–390 (2019)
11. Keshari, T., Palaniswamy, S.: Emotion recognition using feature-level fusion of
facial expressions and body gestures. In: 2019 International Conference on Com-
munication and Electronics Systems (ICCES), pp. 1184–1189. IEEE (2019)
12. Kumar, M.P., Rajagopal, M.K.: Facial emotion recognition system using entire
feature vectors and supervised classifier. In: Deep Learning Applications and Intel-
ligent Decision Making in Engineering, pp. 76–113. IGI Global (2021)
13. Li, G., Zhang, M., Li, J., Lv, F., Tong, G.: Efficient densely connected convolutional
neural networks. Pattern Recognit. 109, 107610 (2021)
14. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document
modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017)
15. Melekhov, I., Juho, K., Esa, R.: Siamese network features for image matching. In:
2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378–383.
IEEE (2016)
16. Pennington, J., Richard, S., Christopher, D.M.: Glove: global vectors for word
representation. In: Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
17. Pinson, M.H., Choi, L.K., Bovik, A.C.: Temporal video quality model accounting
for variable frame delay distortions. IEEE Trans. Broadcast. 60(4), 637–649 (2014)
18. Prathibhamol, C.P., Ashok, A.: Solving multi label problems with clustering and
nearest neighbor by consideration of labels. In: Advances in Signal Processing
and Intelligent Recognition Systems. AISC, vol. 425, pp. 511–520. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-28658-7 43
19. Raj, K.S., Kumar, P.: Automated human emotion recognition and analysis using
machine learning. In: 2021 12th International Conference on Computing Commu-
nication and Networking Technologies (ICCCNT), pp. 1–9. IEEE (2021)
20. Sachin Saj, T.K., Babu, S., Reddy, V.K., Gopika, P., Sowmya, V., Soman, K.P.:
Facial emotion recognition using shallow CNN. In: Thampi, S., Trajkovic, L., Li,
KC., Das, S., Wozniak, M., Berretti, S. (eds.) Machine Learning and Metaheuristics
Algorithms, and Applications. SoMMA 2019. Communications in Computer and
Information Science, vol. 1203, pp. 144–150. Springer, Singapore (2019). https://
doi.org/10.1007/978-981-15-4301-2 12
21. Subramanian, R., Julia, W., Abadi, M.K., Vieriu, R.L., Winkler, S., Sebe, N.:
Ascertain: emotion and personality recognition using commercial sensors. IEEE
Trans. Affect. Comput. 9(2), 147–160 (2016)
Graph Emotion Distribution Learning Using EmotionGCN 229

22. Thushara, S., Veni, S.: A multimodal emotion recognition system from video.
In: 2016 International Conference on Circuit, Power and Computing Technologies
(ICCPCT), pp. 1–5. IEEE (2016)
23. Sai Prathusha, S., Suja, P., Tripathi, S., Louis, R.: Emotion recognition from facial
expressions of 4D videos using curves and surface normals. In: Basu, A., Das,
S., Horain, P., Bhattacharya, S. (eds.) IHCI 2016. LNCS, vol. 10127, pp. 51–64.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52503-7 5
24. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package
for graph neural networks. arXiv preprint arXiv:1909.01315 (2019)
25. Wang, X., Yufei, Y., Abhinav, G.: Zero-shot recognition via semantic embeddings
and knowledge graphs. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6857–6866 (2018)
26. Wang, Y., Yanzhao, X., Yu, L., Lisheng, F.: G-cam: graph convolution network
based class activation mapping for multi-label image recognition. In: Proceedings
of the 2021 International Conference on Multimedia Retrieval, pp. 322–330 (2021)
27. Wang, Y., Yanzhao, X., Yu, L., Ke, Z., Xiaocui, L.: Fast graph convolution network
based multi-label image recognition via cross-modal fusion. In: Proceedings of the
29th ACM International Conference on Information & Knowledge Management,
pp. 1575–1584 (2020)
28. Yang, J., Dongyu, S., Ming, S.: Joint image emotion classification and distribution
learning via deep convolutional neural network. In: IJCAI, pp. 3266–3272 (2017)
On the Role of Depth Predictions for 3D
Human Pose Estimation

Alec Diaz-Arias1(B) , Dmitriy Shin1 , Mitchell Messmore1 , and Stephen Baek1,2


1
Inseer Inc., Charlottesville, Virginia, USA
[email protected]
2
University of Virginia, Charlottesville, Virginia, USA

Abstract. Following the successful application of deep convolutional


neural networks to 2D human pose estimation, the next logical prob-
lem to solve is static 3D human pose estimation from monocular images.
While previous solutions have shown some success, they do not fully uti-
lize the depth information from the 2D inputs. With the goal of address-
ing this depth ambiguity, we build a system that takes 2D joint loca-
tions as input along with their estimated depth value and predicts their
3D positions in camera coordinates. Our system out performs compara-
ble frame-by-frame 3D human pose estimation networks on the largest
publicly available 3d motion data set, Human 3.6M. To provide further
evidence for the usefulness of predicted depth values in the 3D pose esti-
mation problem, we perform an extensive statistical analysis showing
that even with potentially noisy depth predictions there is still a statis-
tically significant correlation between the predicted depth value and the
true depth value.

Keywords: Convolutional neural network · Pose estimation · 3D ·


Depth · Monocular images · Machine learning

1 Introduction
3D Human Pose Estimation (HPE) is the process of producing a 3D body land-
marks from sensor input that matches the spatial position and configuration of
the individuals of interest. In the case of single view HPE, the sensor input is
one image or camera view containing human subjects, and the goal of HPE is to
predict the 3D coordinates of the joints from the subjects’ skeletons. There are
many approaches to single view human pose estimation, among them we high-
light two important dichotomies: Top-Down vs Bottom-up and frame-by-frame
vs sequence-to-sequence.
Top-down [4,25,27,32,35,39,40,48] approaches first create a bounding box
for each subject in an image and then apply a pose estimation module to the
bounded image. Bottom-up [12,18,26,29,49,53] approaches start by predicting
3D joint locations and then assign these key-points to individual actors using
clustering algorithms. Top-down approaches are more common for 3D HPE as
most publicly available data sets contain only one person per frame.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 230–247, 2023.
https://doi.org/10.1007/978-3-031-18461-1_15
Depth in 3D Pose Estimation 231

Frame-by-frame (also referred to as static) [4,12,18,25–27,29,32,35,39–41,


43,48,49,53] HPE networks predict 3D joint locations from only one input frame,
and thus can be applied in a broad context. However, when the input frame
comes from video sequence or human motion, the information of neighboring
frames is not taken into account. Sequence-to-Sequence networks [10,17,22,33,
54] take as input a of sequence of frames from video input and output a single
frame or whole sequence of 3d pose estimations. As a result of using neighboring
frame data, Sequence-to-Sequence networks often produce smoother outputs and
encode important temporal information, but they are in general more complex
and less applicable to single context images.
Depth plays an important part of the 3D reconstruction process and should
implicitly be learned by a network. However, depth information has only recently
[24,34,42] been explicitly used as input. This successful inclusion of depth infor-
mation has lead us to examine the effectiveness of predicted depth in 3D pose
estimation. While we hypothesize that depth predictions play an important role
in any 3D HPE network, for simplicity, in this work we focus on single view,
top-down, frame-by-frame networks. However in order to compare our results to
newer networks and overall a larger number of previous works, we compare our
results to both Frame-by-Frame and Sequence-to-Sequence networks. To support
the use of depth values, we conduct a hypotheses driven study of depth in the
3D HPE problem as outlined below. The value of this study is that it is the first
of its kind to explicitly focus on providing evidence of the importance of depth
information in 3D human pose estimation.

Hypothesis 1. Depth values are useful for two-dimensional to three-dimensional


reconstruction tasks. Intuitively, any information that is “orthogonal” to the loca-
tions in pixel coordinates can naturally improve the accuracy of a reconstruction
network. We hypothesize that including depth estimates for all joints will result
in increased accuracy. To test this theory, we feed a modified network, based on
the architecture in [25], joint locations in pixel coordinates along with estimated
depth values. This network achieves state-of-the-art results on the largest publicly
available 3D pose estimation benchmark, Human 3.6M (H36M [16]).

Hypothesis 2. Depth estimations that are more strongly correlated with the
z-coordinates of joint locations lead to a lower average mm error. A higher cor-
relation between depth values and joint z-coordinates indicates a lower degree
polynomial can be used to model this relationship. In this case, a lower capacity
network can be used to model this dependence, which is displayed in our investi-
gations of hypothesis one. When compared to the network developed in [27], our
network which utilizes depth values is less complex, yet produces more accurate
results.
We test and validate this hypothesis by calculating correlation and statisti-
cal significance levels for individual joints sampled by camera and action. Sub-
sequently we compare the same joints average mm error for high correlation
sub-samples versus low correlation sub-samples. This statistical analysis lends
232 A. Diaz-Arias et al.

credence to our hypothesis. We discuss short comings of the data such as occlu-
sions, and suggest directions to further explore this theory. Our experiments
provide evidence to the idea that the community should utilize correlation as a
metric by which to gauge the efficacy of depth estimators as to not lose sight of
depth maps desired utility.
The results of our investigation emphasize the role of depth in 3D reconstruc-
tion. Our contributions can be summarized as follows:

1. We introduce orthogonal data input (depth values) to supplement the 2D


data inputs into our 3D human pose estimation model.
2. We conduct an extensive correlation analysis that motivated and justifies the
use of depth information extracted from an RGB image.
3. We show that our network significantly outperforms previous top-down net-
works on the H36M ground truth detections.

1.1 Previous Work

The introduction of deep convolutional neural networks has led to a steady


increase in performance of depth map estimation, but significant problems still
exist both in quality and resolution of the depth maps. We do not seek to improve
upon the state of the art in the area of depth estimation, but we aim to demon-
strate that substantial improvement in this area would lead to large impact in
the area of 3D human pose estimation.
3D pose estimation is divided into two major camps, single and two staged
approaches. The single stage attempts to reconstruct the three-dimensional skele-
ton directly from the input image; while the two stage methods utilize the high
accuracy of 2D pose estimators to first locate the skeleton in pixel coordinates
and then learn the camera intrinsic parameters as well as depth, indirectly.
We note the works of [20,27,32] are based on the single stage approach while
[8,9,25,31,46] are based on the two stage approach.
In [20], the authors propose a multitask framework that simultaneously trains
the reconstruction and the pixel coordinate detection network. The work of [41]
introduced an over-complete autoencoder to embed the skeletal representation
in a high dimensional space, and this work has influenced our choice to embed in
a high dimensional space. Similarly, the work of [39] introduced what the author
refers to as a compositional loss to encourage the network to pay attention to
the joint connection structure. The works [25,31] both estimate the 2D pose
from the image and then directly regress the 3D pose. Pertinent to our design
choices, the authors of [25] proposed a simple architecture that was state-of-the-
art. Finally, we mention that [27] introduces a novel approach to multi-person
pose estimation by performing reconstruction on each subject individually in the
root centered camera reference frame while simultaneously learning the absolute
root positions to perform translation.
There is a distinct approach to all of the above. In the above papers and as
well as in our work, the networks directly estimate the joint locations in Cartesian
Depth in 3D Pose Estimation 233

coordinates. It is well known that points in three-space admit many representa-


tions, e.g. Euler angles, Polar Coordinates, etc. In [3,5,30,56] the authors choose
to estimate the angular representation of the skeleton directly from RGB images.
This on one hand has a possible advantage; not all joints have three degrees of
freedom in the angular representation – thus, they have lower dimensionality.
Furthermore, these approaches are less susceptible to varying limb lengths. We
have not experimented with this representation of data, as our depth estimates
that increase the accuracy of the network are related to the z-coordinate in the
camera reference frame.
Lastly, there has been recent work [24,34,42] in this direction utilizing
depth maps to aid in three-dimensional human pose estimation, although their
approaches have differed in many key ways. The work in [34] utilizes only a sin-
gle view depth map as an input and instead of directly regressing the 3D joint
locations, applies an iterative update to some mean pose. In [42], a notion of
weak depth supervision is introduced, i.e. a model that can accept either RGB
or RGB-D images as inputs and achieved state-of-the-art on MuPoTs-3D data
set by using a robust occlusion loss. In [24], the authors introduce a deep neu-
ral network called Deep Depth Pose and directly regresses the camera centered
three-dimensional pose. In our view, there are a couple main distinctions between
our approach and our predecessors. First, none of these approaches are utilizing
predicted depth maps and thus do not have the in-depth statistical analysis. In
our view this reduces the problem to 2D human pose estimation alongside learn-
ing the camera instrinsics, rather than focusing on the interpolation between
predicted depth and the true depth. Second, the 3D pose estimation networks
themselves in these examples are not light-weight in the sense that they use
RGB-D or 2.5D images as their input thus increasing the dimensionality.

2 Methodology
Our goal is to estimate three-dimensional body joint locations given a three-
dimensional input where the third dimension of the input is “orthogonal” to the
pixel coordinate joint locations. We will argue later that this increased dimen-
sionality is critical to our network’s success. Additionally, we show this extra
dimension is correlated with the z-coordinate, thus justifying its utility. Formally,
let x ∈ R3J denote our input and denote the output as y ∈ R3J , where J is the
number of joints to be estimated. We seek to learn a function f : R3J → R3J
that minimizes the joint reconstruction error:
J
1
min ||f (xi ) − yi ||22 .
f J i=1

In practice, xi may be obtained using an off-the-shelf 2D pose detector [7,44]


and a depth map estimation algorithm using monocular images. For simplicity,
in this work we obtain xi from ground truth labelings given in the H36M data
set. More precisely, let (u, v) denote the pixel coordinate location of joint ji
234 A. Diaz-Arias et al.

and let D denote the depth map, then xi = (u, v, D(u, v)). We estimate the
3D joint locations in the camera reference frame. We aim to approximate the
reconstruction function f using a neural networks as a function estimator.
Figure 1 provides a diagram with the basic building blocks of our network.
Our network is modeled after [25]. In this direction, the network uses low dimen-
sional input (i.e. the Cartesian coordinates of the skeleton) and is based on their
deep multi-layer neural network with batch-normalization (BN) and dropout
(Drop) to reduce over fitting. Our network has approximately 7 million param-
eters, but due to the low dimensional input the network is easily integrated into
a real-time system.

Fig. 1. The proposed 3D pose estimation network architecture, where BN denotes


batch-normalization and act denotes non-linear activation function.

2.1 Pixel, Camera and World Coordinates

Recall that given a 3D point Pu,v,w in homogeneous coordinates in the world


reference frame, the camera intrinsic and extrinsic parameters R ∈ SO(3) and
t ∈ R3 , and a projection P ∈ M3 (R), one can move to the camera reference frame
by Px,y,z = RPu,v,w + t. Furthermore, moving to pixel coordinates amounts to
applying the following perspective projection transformation: Pr,s = Pper (Px,y,z )
where ⎡ ⎤
fx 0 cx
Pper = ⎣ 0 fy cy ⎦ ,
0 0 1
with fx = sfx , fy = sfy , using f as the focal length. Furthermore, sx and
sy denoted the effective size and cx and cy are the coordinates of the principle
point.
Now 3D HPE seeks to reverse this process by starting with a 2D point (u, v)
in pixel coordinates and producing a 3D point in the pre-image that best approx-
−1
imates the joint’s true location. Since the pre-image Pper ((u, v)) is uncountable
for any pixel coordinate, our problem is extremely ill-posed. There is no ana-
lytical solution. We do note that converting back from pixel-coordinates to the
camera reference frame can be achieved using the following system of equations
Depth in 3D Pose Estimation 235

and are derived from the above equations:


fx
x= (r − cx )
z
fy
y = (s − cy )
z
z = zdepth .

We further note that there are four intrinsic parameters required for the
reconstruction (assuming only one camera is used), but the z-value varies depend-
ing on the pixel coordinate inputs. This is the major hurdle, in our opinion, to
achieve high quality reconstruction. We believe independent of the complexity
or novelty of the neural network topology these networks will under-perform any
counterpart using additional input that is “orthogonal” to the pixel level infor-
mation. In the presence of the true depth value the problem is still “ill-posed”,
but significantly more tractable. Furthermore, what we exploit is a noisy depth
value that at most has varying correlation for differing joints when considered
across our sampling procedure (moderately high to weak). Nonetheless we still
achieve state-of-the-art results on the H36M dataset.

2.2 Depth Estimation


We use a simple encoder-decoder architecture to estimate depth, which is a
common practice in depth estimation [13]. We train on the publicly available
NYU V2 [38] data set where the ground truth depth maps are generated using
the Kinect V2. A sample input image and output depth maps are shown in Fig. 2.
Recall that the focus of this paper is not to improve upon the current state-of-
the-art depth map estimation, but rather to generate depth maps with some
correlation and make a strong case for this being a missing component to the
reconstruction problem. We simply note that even with sub optimal depth values
a simple neural network with the additional orthogonal input can outperform
the previous state-of-the-art in the 3D human pose estimation. Therefore, 3D
pose estimation can be further improved by creating better quality depth maps,
i.e. depth map with stronger correlations with the z-coordinate from the camera
coordinate system.

Fig. 2. Example of depth map predicted on the H36M data set


236 A. Diaz-Arias et al.

We train the depth estimation network with respect to the following losses:
n
1 2
LMSE (y, ŷ) = y − ŷ2
n p=1

and the gradient loss defined over the depth image, i.e.,
n
1 2 2
Lgrad (y, ŷ) = ∇x (yp , ŷp )2 + ∇y (yp , ŷp )2 ,
n p=1

where y is the ground truth, ŷ is the predicted output, n is the number of joints,
and ∇x , ∇y are the gradients in the direction of x, y, respectively.

2.3 Data Preprocessing and Training Details


We apply normal standardization to the 2D inputs and 3D outputs of the net-
work. We subtract the mean pose and divide by the standard deviation pose of
the training data set. We further zero center the root joint (pelvis). A similar
normalization is done for the depth values.
We train our network for only 70 epochs using an Adam optimizer with an
initial learning rate of 0.01, and a batch size of 1,024. Initially, the weights are
generated using the Xavier uniform initialization scheme [14]. We implemented
our code using native TensorFlow [1] which takes around 392 ms for a forward
and backward pass per batch, and was trained using a NVIDIA GeForce RTX
2080 Max Q design. The forward pass takes around 4ms on the same RTX 2080
for a single sample. This implies that our network used in conjunction with a
standard off the shelf 2D pose detector could be implemented in real time as
a pixels to 3D coordinates system. One epoch of training on the entire H36M
dataset takes roughly 10 min. Hence our network achieves state-of-the-art results
while being a lower capacity network.

3 Results
Before moving to the numerical results, we first provide examples of our recon-
struction relative to the ground truth from differing camera angles and distinct
poses.

3.1 Experimental Protocol


H36M Data Set. There are many widely evaluation protocols used on the
H3.6M data set. We report using Protocol 1 [9,25,27,28,39,40,47,55], the most
common method when reporting on ground truth 2D joint input. Protocol 1 uses
five subjects for training (S1,S5,S6,S7,S8) and two for testing (S9,S11) and uses
the mean per joint position error (MPJPE) [21] for evaluation. Within Protocol
1 there are two methods for calculating the position error: after aligning the root
Depth in 3D Pose Estimation 237

Fig. 3. Blue skeletons are ground truth poses from H36M while the corresponding red
skeletons are the predicted poses produced by our network. (Color figure online)

joints using a rigid alignment (RA) and without rigid alignment (W/O RA). We
note that when no rigid alignment is used, the root joint is centered for both the
prediction and the ground truth 3D pose.
We perform rigid alignment by finding R ∈ SO(3) and t ∈ R3 such that
J
2
(Rpi + t) − yi 2 is minimized where pi is the predicted joint location and yi
i=1
is the ground truth.
Results on H36M. Qualitative results on H36M are shown in Fig. 3, while
Table 1 provides a quantitative comparison between previous methods’ errors
on the H3.6M data set subdivided by each action. Amongst the frame-by-frame
methods, our network outperforms in a majority of the action sequences and
achieves the lowest average error by over 1mm. We also compare our network
to Sequence-to-Sequence methods. This approach is becoming more prevalent
as it use temporal information, i.e. neighboring video frames, to improve 3D
reconstruction results. Intuitively, the data of neighboring frames should provide
extra depth information to the network, but it is not as clear through which
mechanisms this is achieved. Given the advantage of temporal networks, our
frame-by-frame approach compares favorably to the state-of-the-art Sequence-to-
Sequence methods. Lastly, we note that our network architecture is largely based
on [25], yet with the inclusion of depth information provides a significant decrease
in 3D reconstruction error, namely over 10mm reduction in reconstruction error.

3.2 Correlation Analysis of Predicted Depth Vs Z-Coordinate

Next, we move to the correlation analysis that led us to using depth values of
specific joints as additional input. While noisy depth has intuitive utility and
238 A. Diaz-Arias et al.

Table 1. Detailed results on the H3.6M data set [16] under protocol 1. We report with
Rigid Alignment (RA) and without Rigid Alignment (W/O RA). MPJPE is subdivided
by action and the average across all actions is provided. We boldface the values that are
the best among frame-to-frame, while underlined values denote the best results among
all methods. We indicate the sequence-to-sequence methods using input window size
greater than 1 with †.

Methods Dir Dis Eat Gre Phon Pose Pur Sit SitD Smo Phot Wait Walk WalkD WalkP Avg
RA
Zhou [52] 37.8 49.4 37.6 40.9 45.1 41.4 40.1 48.3 50.1 42.2 53.5 44.3 40.5 47.3 39.0 43.8
Ours 23.5 27.2 27.6 27.2 26.2 27.9 24.8 27.9 41.4 32.9 39.4 28.6 20.1 37.2 25.0 29.1
W/O RA
Zhou [55] 54.8 60.7 58.2 71.4 62.0 65.5 53.8 55.6 75.2 112 64.2 66.1 51.4 63.2 55.3 64.9
Moreno [28] 53.5 50.5 65.8 62.5 56.9 60.6 50.8 56.0 79.6 63.7 80.8 61.8 59.4 68.5 62.1 62.2
Chen [9] 53.3 46.8 58.6 61.2 56.0 58.1 48.9 55.6 73.4 60.3 76.1 62.2 35.8 61.9 51.1 57.5
Pavllo [33] 47.1 50.6 49.0 51.8 53.6 61.4 49.4 47.4 59.3 67.4 52.4 49.5 55.3 39.5 42.7 51.8
Martinez [25] 37.7 44.4 40.3 42.1 48.2 54.9 44.4 42.1 54.6 58.0 45.1 46.4 47.6 36.4 40.4 45.5
Zheng [54] † 49.2 49.7 38.7 42.7 40.0 40.9 50.7 42.2 47.0 46.1 43.4 46.7 39.8 36.4 38.0 43.5
Hossain [15] 35.2 40.8 37.2 37.4 43.2 44.0 38.9 35.6 42.3 44.6 39.7 39.7 40.2 32.8 35.5 39.2
Lee [19] † 32.1 36.6 34.3 37.8 44.5 49.9 40.9 36.2 44.1 45.6 35.3 35.9 30.3 37.6 35.5 38.4
Pavllo [33] † 35.2 40.2 32.7 35.7 38.2 45.5 40.6 36.1 48.8 47.3 37.8 39.7 38.7 27.8 29.5 37.8
Cai [6] † 32.9 38.7 32.9 37.0 37.3 44.8 38.7 36.1 41.0 45.6 36.8 37.7 37.7 29.5 31.6 37.2
Xu [45] 35.8 38.1 31.0 35.3 35.8 43.2 37.3 31.7 38.4 45.5 35.4 36.7 36.8 27.9 30.7 35.8
Shan [36] † 34.8 38.2 31.1 34.4 35.4 37.2 38.3 32.8 39.5 41.3 34.9 35.6 32.9 27.1 28.0 34.8
Liu [23] † 34.5 37.1 33.6 34.2 32.9 37.1 39.6 35.8 40.7 41.4 33.0 33.8 33.0 26.6 26.9 34.7
Zhan [51] † 31.2 35.7 31.4 33.6 35.0 37.5 37.2 30.9 42.5 41.3 34.6 36.5 32.0 27.7 28.9 34.4
Zeng [50] † 34.8 32.1 28.5 30.7 31.4 36.9 35.6 30.5 38.9 40.5 32.5 31.0 29.9 22.5 24.5 32.0
Ours 29.9 33.1 32.3 30.8 33.5 37.1 27.1 34.5 48.7 35.7 49.3 37.3 23.5 40.6 27.8 34.8

ground truth depth maps have been used in the literature – it was unclear to
what extent noisy depth maps could aid in the reconstruction problem.
We conducted a statistical analysis to assess the extent of possible correlation
between depth values and the z-coordinate of joints in the camera reference
frame. We chose to sub-sample our data by camera and action as this allows us
to observe the extremes of correlation values. This method also better indicates
the affect of joint occlusion in the reconstruction. Thus, we believe to get an
accurate picture of how much correlation is present this is a natural sampling
procedure.
First, we explored descriptive statistics of the distributions to check for the
normality assumption. In the case that the distribution is non-normal, then we
are forced to apply non-parametric methods. Since accuracy of different normal-
ity tests can depend on sample size and other characteristics of the distribution,
we decided to perform several tests, namely, the Shapiro-Wilk, Andersen-Darling
and D’Agostino tests [2,11,37] for normality. The tests were done for data sam-
ples of sizes 500, 1,000, 5,000 and 100,000.
Tables 2 and 3 present the results of the tests for a distribution of 5,000 depth
values by specific joints in an attempt to achieve the most accurate p-values. As
shown in the Shapiro-Wilk and D’Agostino tests, the reported p-values war-
ranted rejection of the null hypothesis of the normality assumption, i.e., the
Depth in 3D Pose Estimation 239

p-values of almost all tests were lower than confidence level α = 0.05. There
were only two cases when the D’Agostino test produced p-values that indicated
that the distribution was normal (see underlined values in Table 2 and 3). All
values of statistic in the Andersen-Darling tests were greater than correspond-
ing critical values, which indicated violation of the normality assumption. The
consensus conclusion was that the distribution of depth values was not normally
distributed. This conclusion was also supported by skewness and kurtosis. All
kurtosis tests confidently rejected the null hypothesis that the shape of the dis-
tribution matched the shape of the normal distribution (e.g. peaked shape with
light tails). It can also be seen from histograms in Fig. 5 that some distributions
were multi-modal.

Table 2. Results of statistical tests for normality for depth values by joints. The upper
value for all tests is the test statistic. Lower value for shapiro-wilk and D’Agostino tests
is the p-value and for Andersen-Darling test is the critical value.

Joint Root RH RK RA LH LK LA Thor


Shapiro 0.99 0.97 0.99 0.98 0.99 0.98 0.98 0.98
p-value 3e–6 0.0 1e–4 0.0 3e–6 0.0 0.0 0.0
Andersen 3.09 7.58 1.99 5.69 3.31 4.93 4.46 4.78
p-value 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
D’Agostino 12.64 68.5 1.94 86.7 10.65 23.9 53.8 13.4
p-value 2e–3 0.0 0.37 0.0 5e–3 6e–6 0.0 1e–3

Table 3. Result of statistical tests for normality for depth values by joints cont

Joint Neck Nose Head LS LE LW RS RE RW


Shapiro 0.98 0.97 0.94 0.97 0.99 0.98 0.96 0.92 0.91
p-value 2e–6 0.0 0.0 0.0 4e–5 0.0 0.0 0.0 0.0
Andersen 4.17 10.4 24.3 9.9 2.67 4.18 12.94 29.92 29.04
p-value 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
D’Agostino 9.87 44.88 64.91 31.61 4.19 63.12 39.03 88.54 89.22
p-value 0.007 0.0 0.0 0.0 0.12 0.0 0.0 0.0 0.0

Normality tests for a distribution of z-coordinate values produced similar


results, and we concluded that the distribution was not normal. As a result
of these tests, we were forced to apply non-parametric methods to assess the
extent of correlation between depth values and z-coordinates. For this, we used
Spearman’s rank correlation and Kendall’s Tau rank correlation tests.
240 A. Diaz-Arias et al.

Table 4. Examples of moderately high spearman and kendall tau rank correlation.
RK, LK, and LA denote right-knee, left-knee and left-ankle respectively and (C,A)
denote camera and action

Jt. (C,A) RK(1,14) RK(4,16) LK(3,14) LA(4,16) LK(3,12) RK(3,14)


Spear 0.5925 0.5692 0.5574 0.5495 0.5487 0.5311
p-value 0.0 0.0 0.0 0.0 0.0 0.0
Kendall 0.4086 0.3838 0.3903 0.3817 0.3955 0.3791
p-value 0.0 0.0 0.0 0.0 0.0 0.0

Table 5. Examples of weak spearman and kendall tau rank correlation correlation that
are statistically insignificant. RA, LA, RW, LW, and LE denote right-ankle, left-ankle,
right-wrist, left-wrist and left-elbow respectively and (C,A) denotes the camera and
action.

Jt. (C,A) RA(2,5) LA(1,10) RW(3,3) LW(1,5) LE(2,6) LE(2,4)


Spear 0.0102 0.0097 0.0095 0.0089 0.0085 0.0078
p-value 0.51 0.52 0.32 0.56 0.47 0.57
Kendall 0.0080 0.0115 0.0058 0.0088 0.0081 0.0081
p-value 0.45 0.26 0.36 0.41 0.32 0.38

Table 6. Examples of high negative spearman and kendall tau rank correlation corre-
lation that are statistically significant. TH, RH, LE, RK, MH, and RW denote top of
head, right-hip, left-elbow, right-knee, mid-head and right wrist respectively and (C,A)
denotes the camera and action.

Jt. (C,A) TH(4,2) RH(2,10) LE(3,9) RK(1,9) MH(4,2) RW(4,2)


Spear –0.526 –0.489 –0.462 –0.459 –0.457 –0.455
p-value 0.0 0.0 0.0 0.0 0.0 0.0
Kendall –0.356 –0.339 –0.296 –0.320 –0.298 –0.297
p-value 0.0 0.0 0.0 0.0 0.0 0.0

From Table 4, 5, and 6 we see a wide variation of correlations and significance


levels. The whole dataset is partitioned by the 15 actions, 4 cameras, and 17
joints whose correlation statistics and significance levels were calculated. We
demonstrate above the extremes that are witnessed within the model that can
be attributed to the inherent noise to depth map estimation, which is not robust
in the presence of lighting and occlusion – to name a few. Nonetheless, a large
portion of the sub-samples have moderately high correlation that are statistically
significant at all levels. We note that around 80% of the joints when partitioned
by action and camera have statistically significant correlation at all levels, while
10% is significant at α = 0.05 and the remaining 10 is either insignificant or
significant at a higher level.
Depth in 3D Pose Estimation 241

We note that it is of interest that a small portion of the correlation values,


∼10%, were negative. This is not intuitive given that depth should increase rel-
ative to the z-coordinate increasing. We do note that most depth estimation
algorithms are susceptible to noise under varying lighting conditions, i.e. light-
ing is non-uniform. Thus, these negative correlations can attest to the need for
improvement in depth map prediction. Furthermore, another plausible explana-
tion for the negative correlations is that joints occluded by other body parts or
object will have a smaller predicted value.
The overall conclusion of our statistical correlation analysis is that depth
values for most actions and cameras have at least weak correlation, with a large
portion having moderate correlation as defined by correlation greater than 0.3.
We hypothesize that even the wide range of correlations is a substantial con-
tributor to the MJPJE reduction seen by our model, relative to the previous
methods. In future work, we aim to develop a probabilistic model to assess a
chance of occlusion, perhaps by analyzing skeleton angles and camera positions.
High probability values of occlusion will enable us to remove corresponding data
points to improve the assessment of the correlation. Furthermore, we believe a
concentrated effort by the community to generate depth maps that highly cor-
relate with the camera coordinates z-value will lead to a substantial reduction
in MJPJE.

Fig. 4. Correlation vs average mm error example plots, with best fit line demonstrating
negative correlation as desired

We close the section by commenting on Hypothesis 2. Figure 4 demonstrates


negative trend lines between the average millimeter error and the correlation, i.e.
as correlation increase the average error is decreasing. This provides evidence in
support of Hypothesis 2, and we believe that further improvement upon depth
map estimation; having a data set with uniform lighting; and omitting occlusions
will further support our hypothesis.
242 A. Diaz-Arias et al.

Fig. 5. Depth value distributions

4 Discussion

This work has begun to uncover the role of depth in the 3D human pose estima-
tion problem. Yet, from both an analytic and statistical perspective more work
is needed to fully understand the limit with which depth can play to improve
reconstruction. While we study the amount of correlation present between the
depth value and a joint’s z-coordinate, further investigation is required to under-
stand what is occurring in individual frames to cause certain joints to have lower
correlation as compared to others. We hypothesize that a large contributor to
an individual joint having vastly different correlation values is the result of joint
occlusions occurring due to camera location in relation to the action. It is well
known that depth maps are imprecise with respect to occlusions. The presence
of additional inputs, e.g. temporal information, can further shrink the search
space of the network leading to more robust 3D pose estimation.
Depth in 3D Pose Estimation 243

We uncovered that in the event of perfect correlation, i.e. taking the depth
values to be the ground truth z-coordinates, that our network can achieve 11mm
average joint reconstruction error on the validation set. This value acts as an
absolute minimum for reconstruction error. Thus there is room for up to a 70%
improvement for our network. We are optimistic that a better understanding of
the depth value z-coordinate relationship will help close this gap.

5 Conclusions
To the best of our knowledge, it is atypical to report the average millimeter
error on a per joint basis, so it is difficult to compare our network in these terms
to our predecessors. However, given of the substantial reduction we have seen
it is relatively safe to assume that on a per joint basis there was a substantial
lowering of the per joint errors. We note that joints with the highest correlation
values across the entire data set, e.g. the right-ankle (RA) with value 0.345,
does not necessarily imply that this joint will have the lowest millimeter error
(RA has error 48.79mm under protocol 1 while the minimum is observed at
9.99 mm for the right and left hips) which had correlation of 0.276. This is easily
explainable by the variances of the joint positions which are naturally much
larger for extremities compared to joints located near the root.
The proposed system outperforms previous frame-by-frame top-down 3D
pose estimators by a significant margin. It furthermore compares favorably to
state-of-the-art Sequence-to-Sequence methods when using input sequences of
length one frame. We hope that this study invigorates researchers to improve
upon the state-of-the-art for depth map estimation so the full potential human
pose estimation can be realized.

5.1 Future Work

In future work, we plan to further implement the concept of including estimated


depth information as input into 3D human pose estimation networks. Specifically,
we are interested in studying the role of depth in 3D pose estimation networks
that predict on submovies, i.e. sequences of frames, rather than single frames
[6,19,23,33,36,50,51,54]. Intuitively, depth is easier to gauge from submovies
compared to a single frame as movement of subjects provides more depth cues
to the network. Thus it appears these networks that take input of submovies
already take advantage of depth information, but to what extent? We plan to
investigate this question, and hypothesize further improvements can be made by
providing depth information along side the submovie input.

References
1. Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous dis-
tributed systems. arXiv preprint arXiv:1603.04467 (2016)
244 A. Diaz-Arias et al.

2. Anderson, T.W., Darling, D.A.: Asymptotic theory of certain ”goodness of fit”


criteria based on stochastic processes. Ann. Math. Stat. 23, 193–212 (1952)
3. Barrón, C., Kakadiaris, I.A.: Estimating anthropometry and pose from a single
uncalibrated image. Comput. Vis. Image Underst. 81(3), 269–284 (2001)
4. Benzine, A., Chabot, F., Luvison, B., Pham, Q.C., Achard, C.: Pandanet: anchor-
based single-shot multi-person 3d pose estimation. In: Proceedings of the IEEE
and CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2020
5. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It
SMPL: automatic estimation of 3D human pose and shape from a single image. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp.
561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1 34
6. Cai, Y., et al: Exploiting spatial-temporal relationships for 3d pose estimation
via graph convolutional networks. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 2272–2281 (2019)
7. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: realtime multi-
person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal.
Mach. Intell. 43(1), 172–186 (2019)
8. Chang, J.Y., Kyoung, M.L.: 2d–3d pose consistency-based conditional random
fields for 3d human pose estimation. Comput. Vis. Image Underst. 169, 52–61
(2018)
9. Chen, C.-H., Deva, R.: 3d human pose estimation= 2d pose estimation+ match-
ing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7035–7043 (2017)
10. Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks
for 3d human pose estimation in video. In: Proceedings of the IEEE and CVF
International Conference on Computer Vision, pp. 723–732 (2019)
11. D’Agostino, R.B.: Transformation to normality of the null distribution of g1.
Biometrika 679–681 (1970)
12. Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cucchiara, R.: Compressed volu-
metric heatmaps for multi-person 3d pose estimation. In: Proceedings of the IEEE
and CVF Conference on Computer Vision and Pattern Recognition, pp. 7204–7213
(2020)
13. Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view
depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N.,
Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham
(2016). https://doi.org/10.1007/978-3-319-46484-8 45
14. Glorot, X., Yoshua, B.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the thirteenth international conference on arti-
ficial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference
Proceedings (2010)
15. Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose
estimation. In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 68–84 (2018)
16. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6 m: large scale
datasets and predictive methods for 3d human sensing in natural environments.
IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
17. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamics
from video. In: Proceedings of the IEEE and CVF Conference on Computer Vision
and Pattern Recognition, pp. 5614–5623 (2019)
Depth in 3D Pose Estimation 245

18. Kundu, J.N., Revanur, A., Waghmare, G.V., Venkatesh, R.M., Babu, R.V.: Unsu-
pervised cross-modal alignment for multi-person 3D pose estimation. In: Vedaldi,
A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp.
35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0 3
19. Lee, K., Lee, I., Lee, S.: Propagating LSTM: 3d pose estimation based on joint
interdependency. In: Proceedings of the European Conference on Computer Vision
(ECCV), pp. 119–135 (2018)
20. Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep
convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H.
(eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015). https://
doi.org/10.1007/978-3-319-16808-1 23
21. Li, S., Zhang, W., Chan, A.B.:. Maximum-margin structured learning with deep
networks for 3d human pose estimation. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 2848–2856 (2015)
22. Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal con-
texts with strided transformer for 3d human pose estimation. IEEE Trans. Multi-
media (2022)
23. Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.C., Asari, V.: Attention mecha-
nism exploits temporal contexts: real-time 3d human pose reconstruction. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pp. 5064–5073 (2020)
24. Marin-Jimenez, M.J., Romero-Ramirez, F.J., Munoz-Salinas, R., Medina-Carnicer,
R.: 3d human pose estimation from depth maps using a deep combination of poses.
J. Vis. Commun. Image Representation 55, 627–639 (2018)
25. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for
3d human pose estimation. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 2640–2649 (2017)
26. Mehta, D., et al.: Xnect: real-time multi-person 3d motion capture with a single
RGB camera. ACM Trans. Graph. (TOG) 39(4), 1–82 (2020)
27. Moon, G., Chang, J.Y. and Lee, K.M.: Camera distance-aware top-down approach
for 3d multi-person pose estimation from a single RGB image. In: Proceedings of
the IEEE and CVF International Conference on Computer Vision, pp. 10133–10142
(2019)
28. Moreno-Noguer, F.: 3d human pose estimation from a single image via distance
matrix regression. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2823–2832 (2017)
29. Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In:
Proceedings of the IEEE and CVF International Conference on Computer Vision,
pp. 6951–6960 (2019)
30. Parameswaran, V., Chellappa, R.: View independent human body pose estimation
from a single perspective image. In: Proceedings of the 2004 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004,
vol. 2, pp. II–II. IEEE (2004)
31. Park, S., Hwang, J., Kwak, N.: 3D human pose estimation using convolutional
neural networks with 2d pose information. In: Hua, G., Jégou, H. (eds.) ECCV
2016. LNCS, vol. 9915, pp. 156–169. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-49409-8 15
32. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric
prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
246 A. Diaz-Arias et al.

33. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in
video with temporal convolutions and semi-supervised training. In: Proceedings of
the IEEE and CVF Conference on Computer Vision and Pattern Recognition, pp.
7753–7762 (2019)
34. Peng, B., Luo, Z.: Multi-view 3d pose estimation from single depth images. Tech-
nical report, Technical report, Technical report, Stanford University, USA, Report,
Course ... (2016)
35. Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-net++: multi-person 2d and 3d pose
detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1146–
1161 (2019)
36. Shan, W., Lu, H., Wang, S., Zhang, X., Gao, W.: Improving robustness and accu-
racy via relative information encoding in 3d human pose estimation. In: Proceed-
ings of the 29th ACM International Conference on Multimedia, pp. 3446–3454
(2021)
37. Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete
samples). Biometrika 52(3/4), 591–611 (1965)
38. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y.,
Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33715-4 54
39. Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In:
Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–
2611 (2017)
40. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression.
In: Proceedings of the European Conference on Computer Vision (ECCV), pp.
529–545 (2018)
41. Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction
of 3d human pose with deep neural networks. arXiv preprint arXiv:1605.05180
(2016)
42. Véges, M., Lőrincz, A.: Multi-person absolute 3d human pose estimation with
weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN
2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.
1007/978-3-030-61609-0 21
43. Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: HMOR: hierarchical multi-person
ordinal relations for monocular multi-person 3d pose estimation. In: Vedaldi, A.,
Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp.
242–259. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8 15
44. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines.
In: Proceedings of the IEEE conference on Computer Vision and Pattern Recog-
nition, pp. 4724–4732 (2016)
45. Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose esti-
mation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 16105–16114 (2021)
46. Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3d human pose esti-
mation in the wild by adversarial learning. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 5255–5264 (2018)
47. Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for
3d pose estimation from a single image. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4948–4956 (2016)
Depth in 3D Pose Estimation 247

48. Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estima-
tion of multiple people in natural scenes-the importance of multiple scene con-
straints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2148–2157 (2018)
49. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.I., Sminchisescu, C.: Deep network
for the integrated 3d sensing of multiple people in natural images. Adv. Neural Inf.
Process. Syst. 31 (2018)
50. Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.: SRNet: improving gener-
alization in 3d human pose estimation with a split-and-recombine approach. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12359, pp. 507–523. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58568-6 30
51. Zhan, Y., Li, F., Weng, R., Choi, W.: Ray3d: ray-based 3d human pose estimation
for monocular absolute 3d localization. arXiv preprint arXiv:2203.11471 (2022)
52. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convo-
lutional networks for 3d human pose regression. In: Proceedings of the IEEE and
CVF Conference on Computer Vision and Pattern Recognition, pp. 3425–3435
(2019)
53. Zhen, J., et al.: SMAP: Single-Shot Multi-person Absolute 3D Pose Estimation.
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12360, pp. 550–566. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58555-6 33
54. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose
estimation with spatial and temporal transformers. In: Proceedings of the IEEE
and CVF International Conference on Computer Vision, pp. 11656–11665 (2021)
55. Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Weakly-supervised transfer for 3d
human pose estimation in the wild. In: IEEE International Conference on Com-
puter Vision, ICCV, vol. 3, p. 7 (2017)
56. Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression.
In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 186–201. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-49409-8 17
AI-Based QOS/QOE Framework
for Multimedia Systems

Laeticia Nneka Onyejegbu(B) , Ugochi Adaku Okengwu, Linda Uchenna Oghenekaro,


Martha Ozohu Musa, and Augustine Obolor Ugbari

Computer Science Department, Faculty of Computing,


University of Port Harcourt, Port Harcourt, Nigeria
{laeticia.onyejegbu,ugochi.okengwu,linda.oghenekaro,martha.musa,
augustine.ugbari}@uniport.edu.ng

Abstract. Streaming applications have grown exponentially in recent years to


be the most dominant contributor to global internet traffic, and this is because
they satisfy a wide variety of customer needs such as video conferencing, video
surveillance, and stored-video streaming. Recent studies show that the telecom-
munication industries lose millions of dollars due to poor QoE experienced by end
users. Accenture also carried out a recent survey that showed that about 82% of
customer defection was due to poor QoE. In this work, a resultant QoS framework,
which delivered a QoE that can meet user’s expectation was developed. The system
quantitatively measured QoE on Multimedia using different variables that affects
(QoS). Deep learning algorithms and ground truth datasets were used for this work
to map the QoS features, such as delay, bandwidth, packet loss rate and throughput,
which serves as input, to the output QoE. Controlled experiment methodology and
the active learning approach was used. Multilayer Perceptron (MLP) and Deep
Belief Network (DBN) were used, to maintain, update and regulate the states of
the network model. The incorporation of RBM on DBN was done using artifi-
cial features obtained at the Restricted Boltzmann Machine (RBM) stage. This
research produced a deep learning hybrid approach to optimize QoS and QoE
in multimedia applications. A novel predictive QoE model where relevant QoS
parameters and how they influence QoE was also presented, and finally a rich
QoS-QoE dataset was presented, which can further be used as a framework to
ensure responsible AI in multimedia systems.

Keywords: QoS/QoE · Deep learning · Multimedia

1 Introduction
The increased availability of the Internet and new technologies such as LTE have enabled
the use of all types of services, including video streaming, which come with stringent
requirements in terms of network performance and capacity demands. Streaming appli-
cations have grown exponentially in recent years to be the most dominant contributor to
global internet traffic, and this is because they satisfy a wide variety of customer needs
such as video conferencing, video surveillance, e-learning and stored-video streaming.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 248–259, 2023.
https://doi.org/10.1007/978-3-031-18461-1_16
AI-Based QOS/QOE Framework for Multimedia Systems 249

With the proliferation of streaming applications, users are continuously raising their
expectation on better services while telecommunication industries jostle to ensure good
quality of experience (QoE), which is a principal measure of users’ perceived quality of
mobile Internet service. Consequently, the issue of adequately supporting all those users
and all those services is a nontrivial one for network operators, as users expect certain
levels of quality to be upheld. This leads to improving the Quality of Service (QoS) that
operators monitor and manage and the Quality of Experience (QoE) that the users get
through the implementation of deep learning models for optimization of network in the
overall network quality as experienced by telecommunication users [7]. QoS define the
overall characteristics of a service such as communication that affect the satisfaction
of user’s service needs [9]. QoS deals with performance aspects of physical systems in
which the QoS must be ensured by the network providers. QoE is the overall acceptabil-
ity of an application or service, as perceived subjectively by the end user including the
complete end-to-end system effect [5].
Recent studies show that telecommunication industry lose millions of dollars due
to poor QoE experienced by end users. Accenture also carried out a recent survey
that showed that about 82% of customer defection was due to poor QoE. Hence, with
185million active subscribers in Nigeria, as recorded in December 2019, by the Nigerian
Communications Commission (NCC), the task of improving the QoE of consumers is
inevitable. An improved QoE would generate huge revenue for the country and further
sustain trust and satisfaction of the consumers. The aim of this work is to quantitatively
measure the quality of user’s experience (QoE) on multimedia using different variables
that affects quality of service (QoS). That is Predicting QoE from QoS. The benefit of
this work when deployed is that it will help telecommunication industry to determine
factors that will help customers to have satisfaction while using multimedia system.

2 Review of Related Work

In literature some work has been done to improve the performance of streaming appli-
cations using deep learning. The research in [8], Data analysis on video streaming QoE
over mobile networks, K-mean clustering and logistic regression method, improve QoE
prediction accuracy. The author in [10] used genetic algorithm (GA) and random neu-
ral networks (RNN), to maximise QoE of video traffic streaming over LTE networks.
They used QoS; (SINR), delay and throughput QoE; MOS. Min delay and max delay as
input. [3], in their work titled the Effect of network QoS on user QoE for a mobile video
streaming service using H.265/VP9 codec, introduced two new QoS factor variables ini-
tial delay and buffering delay. Other QoS parameters used includes Packet loss, jitter and
bandwidth. In [7], Modelling QoE of internet video streaming by controlled experimen-
tation and machine learning, used bandwidth and latency parameters. Machine learning
was used to reduce the number of experimentation while maintaining accuracy.
The authors in [12] in their research Evaluating the quality of service in VOIP and
comparing various encoding techniques, used throughput, load, delay, MOS, jitter and
packet loss parameters. Small- and large-scale network Encoding techniques used are
G.711, 723 and 729. These techniques were used to evaluate QoS of VOIP technology.
250 L. N. Onyejegbu et al.

In his work [1], QOS For Multimedia Applications with Emphasize on Video Con-
ferencing, used Packet loss, E2E delay, packet delay variation metrics, to analyse QoS
performance and its effects when video is streamed over a GBR (Guaranteed bit rate)
and non-GBR bearers over LTE network standard.
Implementation of Basic QoS Mechanisms on Videoconferencing Network Model
[2], used Packet loss, delay, and Jitter To create a model for the quality-of-service perfor-
mance measurement based on the CARNet network technique. They used Scheduling
algorithm for implementation.
The author in [11], did a multi-dimensional prediction of QoE based on network-
related SIFs (System Influence Factors), using machine learning technique, in their work
titled Machine learning-based QoE prediction for video streaming over LTE network.
The QoS parameters used are delay, jitter, and loss.
The authors in their work [6], measured and predicted quality of experience of
DASH-based video streaming over LTE, used two machine learning models, to improve
and design DASH adaptive bitrate control algorithm over LTE network.
The authors in [4], Subjective and Objective QoE Measurement for H.265/HEVC
Video Streaming over LTE, examined the impact of media-related system on QoE for
video streaming and compared the results of subjective and objective QoE measurements.

3 Role of the Proposed Techniques Deep Belief Networks


and Multilayer Perceptron
Deep Belief Networks (DBN) which was conceived by Geoff Hinton is an alterna-
tive to backpropagation. Though network structures of DBN are identical to Multilayer
Perceptron (MLP) their training are different and its difference make it ideal for QOS
classification. DBN is a stack of Restricted Boltzmann Machine (RBM). RBM units are
divided into the visible and hidden layers where each unit on the visible layers is linked
to the unit in the hidden layer. In this layout, there is no relationship between each unit on
the same layer; also, there is no connection between visible groups or the hidden ones.
To address this issue, DBN compile multiple RBMs in such a way that the hidden layers
of one RBM become the visible layers for the one above it as illustrated in Fig. 1. DBN is
a probabilistic generative model that contains multiple layers of hidden variables, where
each layer can capture the correlation between the activities of the hidden feature on the
layer below as shown in Fig. 1. In the DBN, each layer consists of a set of units with a
binary or real-valued value. Although DBN has a hierarchical structure with high rep-
resentative power, it can be easily trained greatly through layer by layer for each RBM.
Another advantage of the DBN is that because it is a generative model, it can generate
samples based on the features that the model learns during training. DBN architecture
has the benefit that each layer learns more complex features than layers before it.
AI-Based QOS/QOE Framework for Multimedia Systems 251

Fig. 1. Deep belief networks architecture (Yulita, Fanany, and Arymurthy 2017)

3.1 Multilayer Perceptron

A fully connected multi-Layer neural network is called a Multi-Layer Perceptron (MLP).


A Multi-Layer Perceptron is a variant of the original Perceptron model proposed by
Rosenblatt in the 1950. It has one or more hidden layers between its input and output
layers. The neurons are organized in layers and the connections are always directed from
lower layers to upper layers. The neurons in the same layer are not interconnected as
seen in Fig. 2.

Fig. 2. Architecture of multi-layer perceptron (https://miro.medium.com)


252 L. N. Onyejegbu et al.

4 Experimental Methodology
Quantitative measurement was used in this work.

1. Data Collection Method: The datasets used for both QoS and QoE, were collected
online.
2. Sampling Method: Active sampling of the experimental space is adopted to ensure
lower training cost and uncompromised accuracy of the QoS-QoE Model.

5 Experimental Setups and Results

In this work, a quantitative measurement of Quality of User Experience (QoE) on


Multimedia using different variables that affects Quality of Service (QOS) was done.
The datasets used, was collected online from http://jeremie.leguay.free.fr/qoe. Thir-
teen QoE targets were available in the datasets used. Out of which seven, was
used to measure the users’ quality of experience (QoE) they include; [‘StartUpDe-
lay’, ‘StdDownloadRate’, ‘AvgBufferLevel’, ‘AvgQualityIndex ‘, ‘AvgVideoBitRate’,
‘AvgVideoQualityVariation’, and ‘AvgDownloadBitRate’].

Fig. 3. First few rows of the datasets (QoS and QoE). Source http://jeremie.leguay.free.fr/qoe/

Figure 3 shows the first row of the Qos and QoE datasets used in training the proposed
model.
Scatter plot for each QoS against the QoE features considered was plotted as shown
in Fig. 4 and 5. The scatter plot helped to visualize the relationship between the variables
and targets.
AI-Based QOS/QOE Framework for Multimedia Systems 253

Fig. 4. Scatter plot of QoS features against one QoE (StartUpDelay)

Fig. 5. Scatter plot of QoS features against one QoE (AvgVideoBitRate)

The correlation matrix was plotted to visualize and detect multicollinearity as seen
in Fig. 6. Some were found and since the features were much, Variance Inflation factor
(VIF) approach was employed to take care of that.
254 L. N. Onyejegbu et al.

Fig. 6. Correlation matrix plot

The rectified linear(relu) activation function was used for the prediction since it is
a regression problem. The regression of the model was plotted for all the QoS features
on each of the selected QoE. The actual versus predicted value for StartUpDelay and,
AvgBufferLevel are shown in Fig. 7 and 8, respectively.
From Fig. 7 and 8, it can be seen from the plot that multicollinearity exists among
the QoS features.

Fig. 7. Plot of actual versus predicted value for StartUpDelay


AI-Based QOS/QOE Framework for Multimedia Systems 255

Fig. 8. Plot of actual versus predicted value for AVGBufferLevel

Using VIF requires that the contribution of any feature whose VIF is above 5 or
tolerance less than 0.2 should be removed. Note, Tolerance = 1/VIF. The VIF analysis
was carried out and such features were identified, that is, those whose VIF is greater
than 5. They were removed to get a more accurate prediction.
From Fig. 9, we predicted the value of the startupdelay using our test data and
compared the predicted on the right with our actual value on the left, the ground truth
results. The predicted percentages are close to the actual ones.
From Fig. 10 after removing multicollinearity, the model behaved better than when
there was multicollinearity. From the prediction using some QoE measurements, we
were able to get an accuracy of 99% as shown in Fig. 10 and 11 for StartUpDelay and
AVGDownLoadBitRate, respectively using DBN.
A Multilayer Perceptron (MLP) was also built with two dense layers and all its
hyperparameters fine-tuned and fit. The result was visualized with that of the DBN
in Fig. 10 and 11. The observed result is equally good. There was varying degree of
accuracy with different QoE measurements. From the result and the plots, as shown in
Fig. 12 and 13 for MLP, greater accuracies were achieved with DBN than MLP.
Figure 12 and 13, shows the MLP plot of Actual versus Predicted value for
StartUpDelay and AvgVideoBitRate.
256 L. N. Onyejegbu et al.

Fig. 9. The actual and the predicted value of the Startupdelay

Fig. 10. Plot of actual versus predicted value for StartUpDelay using DBN
AI-Based QOS/QOE Framework for Multimedia Systems 257

Fig. 11. Plot of actual versus predicted value for AvgDownLoadBitRate

Fig. 12. Plot of actual versus predicted value for StartUpDelay using MLP

Fig. 13. Plot of actual versus predicted value for AvgVideoBitRate


258 L. N. Onyejegbu et al.

6 Performance Evaluation of the Model


The performance evaluation of the model was evaluated using R-Squared metrics.

7 Discussion of Result
In this work, we predicted QoE using the required QoS variables that gave good pre-
diction as seen in Fig. 10 and 11. This result is significant because multimedia service
providers need to work with the right variables for better outputs, thereby improving
quality of experience (QoE). The observed results are equally good, as we got various
degree of accuracy with different QoE measurements. We got greater accuracy using
DBN algorithm than MLP algorithm.

8 Conclusion
This research explored two deep learning hybrid techniques: Deep Belief Network
(DBN) and Multi-level Perceptron (MLP) to optimize QoS and QoE in multimedia
applications. A novel predictive QoE model where relevant QoS features and how they
influence QoE was presented, and finally a rich QoS-QoE datasets was presented, which
can be further used as a framework to ensure responsible AI in the area of multimedia
systems. Deep Belief Network algorithm performed better than Multi-Level Perceptron
algorithm.

9 Suggestion for Future Work and Recommendation


We suggest that further work, should be done on the prediction of QoE using QoS audio
features. This work should be deployed to the telecommunication industry for real time
usage.

References
1. Khalifeh, A., Gholamhosseinian, A., Hajibagher, N.Z.: QOS for multimedia applications with
emphasize on video conferencing. Modern Communication System and Networks. Halmstad
University (2011)
2. Caslay, L., Zagar, D., Job, J.: Implementation of basic Qos mechanisms on videoconferencing
network model. Techn. Gazette 19(1), 123–130 (2012)
3. Debajyoti, P., Vajirasak, V.: Effect of network QoS on user QoE for a mobile video streaming
service using H.265/VP9 codec. In: Procedia of Computer Science 8th International Confer-
ence on Advances in Information Technology, IAIT, Macau, China, vol. 111, pp. 214–222
(2017)
4. Baraković Husić, J., Baraković, S., Osmanović, I.: Subjective and objective QoE measurement
for H.265/HEVC video streaming over LTE. In: Avdaković, S., Mujčić, A., Mujezinović, A.,
Uzunović, T., Volić, I. (eds.) IAT 2019. LNNS, vol. 83, pp. 428–441. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-24986-1_34
5. Javier, L., Nieto, A.: Security and QoS Tradeoffs: towards a FI perspective (2014)
AI-Based QOS/QOE Framework for Multimedia Systems 259

6. Jia, K., Guo, Y., Chen, Y., Zhao, Y.:Measuring and predicting quality of experience of DASH-
based video streaming over LTE. In: 19th International Symposium on Wireless Personal
Multimedia Communications (2016)
7. Khokhar, M.J.: Modelling quality of experience of internet video streaming by controlled
experimentation and machine learning. Networking and Internet Architecture. Publisher Hal
(2020)
8. Wang, Q., Dai, H.-N., Wu, D., Xiao, H.: Data analysis on video streaming QoE over mobile
networks. EURASIP J. Wirel. Commun. Netw. 2018(1), 1 (2018). https://doi.org/10.1186/
s13638-018-1180-8
9. Rana, F., Ghani1, A., Sufiuh, A.: Quality of experience metric of streaming video: a survey.
Iraqi J. Sci. 59(3), 1531–1537 (2018)
10. Tarik, G., HadiLarijania, O., Ali, S.: QoE-aware optimization of video stream downlink
scheduling over LTE networks using RNNs and genetic algorithm. Proc. Comput. Sci. 94,
232 – 239 (2016). 11th International Conference on Future Networks and Communications
11. Tarik, B., Barakovic, S., Jasmina, B.: Machine learning-based QoE prediction for video
streaming over LTE network. In: 17th International Symposium Infoteh-Jahorina (2018)
12. Vadivelu, S.: Evaluating the quality of service in VIOP and comparing various encoding
techniques. Msc thesis. University of Bedfordshire (2011)
Snatch Theft Detection Using Deep Learning
Models

Nurul Farhana Mohamad Zamri1 , Nooritawati Md Tahir1,2,3(B) ,


Megat Syahirul Amin Megat Ali4 , and Nur Dalila Khirul Ashar5
1 School of Electrical Engineering, College of Engineering, Universiti Teknologi MARA,
Shah Alam, Selangor, Malaysia
[email protected]
2 Institute for Big Data Analytics and Artificial Intelligence (IBDAAI), Universiti Teknologi
MARA, Shah Alam, Selangor, Malaysia
3 Integrative Pharmacogenomics Institute (iPROMISE), Universiti Teknologi MARA, Shah
Alam, Selangor, Malaysia
4 Microwave Research Institute (MRI), Universiti Teknologi MARA, Shah Alam, Selangor,
Malaysia
5 School of Electrical Engineering, Universiti Teknologi MARA, Shah Alam, Johor, Malaysia

Abstract. It is vital to combat crimes by predicting and detecting the occur-


rence of crime, especially in urban cities. Hence this study proposed investigat-
ing the capability of six deep learning models, namely the AlexNet, GoogleNet,
ResNet18, ResNet50, ResNet101 and InceptionV3, in determining the most opti-
mum model for snatch theft detection. Two categories of databases comprising
13000 images of snatch theft and non-snatch activities were generated from 120
videos obtained from the Google and YouTube platforms. These images are fur-
ther used for training and testing these six DL models, along with data augmen-
tation implemented during training to avoid overfitting. However, it was found
that overfitting occurred based on training and testing accuracy plots, and hence,
it was decided to re-train the model using an early stopping method. Thus, upon
completion of re-training all six models, it was found that all six models showed
a good-fit condition, with ResNet 50 attaining the highest testing accuracy of
98.9% and 100% sensitivity. As for specificity, ResNet 101 showed the highest
value, precisely 97.7%.

Keywords: Deep learning neural network · Snatch theft detection · Overfitting

1 Introduction

Crime exhibits in many forms like drive-by killing, murder, drug trafficking, money
laundering, black market, fraud, and many more. There is a need to understand these
categories of crimes to reveal the most prevalent crimes and the states with the highest
frequencies of crime. This is a necessary precursor to further research into the prediction
and detection of crime in urban cities [1–3].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 260–274, 2023.
https://doi.org/10.1007/978-3-031-18461-1_17
Snatch Theft Detection Using Deep Learning Models 261

Crime analysis is a function that involves systematic analysis for identifying and
analyzing patterns and trends in crime. Information on crime patterns will help law
enforcement agencies deploy resources more effectively and help the officers identify
and seize the suspects. Crime analysis also plays a part in formulating solutions to
crime problems and developing crime prevention schemes. Crime analysts review crime
reports, arrest reports, and police calls for service to quickly diagnose emerging pat-
terns, series, and trends [4]. They analyzed these phenomena for all significant factors,
occasionally predicted or forecasted future circumstances, and produced documents like
reports, newsletters, and bulletins alerting authorities and related bureaus [5]. This is
also one of the strategies or action plans to combat crimes and law-breakings.
Conversely, it is also significant to comprehend why crimes sometimes happen in
specific regions. Crime is not randomly happening. It is intended, prearranged or on
purpose. Crime transpires if the victims’ space of activity transects with the criminal
activity space. The space of activity includes the daily locations like office, work, home,
shopping vicinity, school, entertainment areas and many more [7]. Crime will occur if
a site provides the opportunity to commit a felony and it exists within an offender’s
awareness space. Shopping malls, recreation areas, and restaurants have a higher crime
rate. Thus, the analyst can gain information on the theory of crime patterns in a more
systematic approach and further investigate and analyze behavior patterns.
One of the most common crimes in urban cities is street crime. Street crime is a
criminal act in public spaces. Examples of street crime are pickpocketing, mugging and
snatch theft [8]. Moreover, snatch theft is a criminal act that is forcefully stealing a
pedestrian’s personal belongings, such as a necklace, mobile phone, and handbag, using
run and rob tactics. Typically, two men work where one drives the motorcycles, and the
other does the stealing [9]. This event is a worrisome problem among pedestrians because
it can cause an accident and cause fear, ordeal and anxiety. Sometimes pedestrians lack
the sensitivity of their belongings; thus, it creates the criminal’s opportunity to conduct a
crime. The government authorities have made several efforts to raise awareness of street
crime, but the crime rate shows otherwise. As technology proliferates, the crime style
is also advancing. One of the approaches is surveillance, or observing the area can be
done using CCTV cameras. Many regions have already installed CCTV that allows users
to monitor or record their daily activities, but this can be tiresome to watch the whole
prolonged video of CCTV manually. They can easily overlook or miss too. Hence it can
be pretty challenging to detect the crime scene if it happens.
Recently, numerous researches have been done in crime areas. Each study used its
unique features that can be considered physical traits. The most commonly used parts
are facial information by Y. Xia et al. [10] and a scene of the incident by R. Mandal et al.
[11]. Classifying the incident area can provide an advantage because it often does not
require the subject’s awareness and attention. Since snatch theft offenders usually wear
helmets while committing a crime, extracting facial information cannot be applied.
As for the intelligent techniques, Artificial Intelligence (AI) [6] is one of the methods
that can be used for crime detection and pattern prediction. The AI can effortlessly learn
to classify snatch theft videos or images. The crime pattern of snatch theft can be used as a
feature in snatch theft classification. This paper is organized as follows; Sect. 1 presented
the introduction on crime activities, Sect. 2 elaborated previous researches related to
262 N. F. M. Zamri et al.

crimes and recognition. Next, Sect. 3 explained the theories of transfer learning methods
proposed on snatch theft detection and each algorithm used in this study. Section 4
detailed the experimental protocol along with data collection, optimization followed
by classification. Discussion on experimental analysis and results are given in Sect. 5.
Finally, Sect. 6 concludes our findings.

2 Related Work

Traditionally, the police officer will do a routine patrol in a random area. However, some
studies have been conducted to assist the authority officers in planning a strategy routine
patrol area [12–19]. As mentioned earlier, CCTV is extensively utilized for inspecting
public areas with a high potential for crimes. Here CCTV images are utilized for prob-
ing after the crime happened, specifically as part of post-incident investigation since the
CCTV recorded all activities, including the crimes that can be used as evidence. Addi-
tionally, CCTV has many cameras and thus could also control the existence of crimes.
The CCTV allows users to monitor activities at several locations at one time. Hence
their videotape can also be accessed remotely. However, this method of the surveillance
demands many human resources and is not economical. Hence, an intelligent technique
is vital to replace this conventional CCTV approach for monitoring, identifying or detect-
ing a subject or similar, which is time-consuming since the extensive videos from the
recorded CCTV need to be inspected to detect any anomaly activities.
Firstly, Norazlin et al. [20] utilized an optical flow vector to analyze the perpetrator’s
movement. For this study, the total optical flow is computed for every frame sequence. If
there are two people moved independently, the flow vector of each subject represents the
full optical flow. However, the optical flow vector value will gradually be lowered when it
comes to the intersection point between the two subjects. The movement skeleton’s flow
before the juncture is higher than during the intersection. Meanwhile, Suriani et al. [21]
used the sum of vector (SOV) flow movement to observe the interaction or overlapping.
The snatching action occurs as the subjects move toward one another but not during the
subjects are separated or scattered. For a movement to be considered reliable, the flow
is usually constant prior to and upon completion of the intersection. Nevertheless, it is
recognizably different once snatching activity occurs. The flow was rudely preoccupied at
the intersection. Here, the Kalman filter approach was utilized in determining the frames
once they started to crisscross and ended at the intersections [20]. As for classification, the
supervised SVM was used to validate both the snatch and non-snatch activity accuracy
based on the features extracted. In their study, the accuracy obtained was 91% [20],
while as reported by [21] that utilized the motion vector flow (MVF), the classification
accuracy obtained was 89.43%.
Conversely, deep learning (DL) is a promising method for predicting and detecting
crimes. Umair Muneer et al. [22] implemented DL to detect and predict snatch theft
crimes. The proposed method used was the Convolutional Neural Network (CNN) model
along with VGG19 algorithms. The image dataset collected was categorized as the snatch
and non-snatch. The proposed model obtained accuracy in detecting snatch theft at 81%.
The author found that numerous amounts of moving targets that imitated the movement
and speed in anomalous images are more challenging to detect. This study focuses on
Snatch Theft Detection Using Deep Learning Models 263

detecting and predicting the snatch theft event by developing a new DL-based model
using crime event videos. Therefore, based on the limitations and research gap identified
from previous work, several CNN models will be evaluated and validated: AlexNet,
GoogleNet, ResNet18, ResNet50, ResNet101, and InceptionV3. The best DL model
will be determined for snatch theft detection purposes.

3 Transfer Learning Techniques for Snatch Theft Detection

As we know, transfer learning is one of the machine learning techniques which utilizes
the knowledge learned in a specific setting and further enhances its performance during
modelling the associated or new sets of assignments [23]. With transfer learning, a
dense machine learning model can be established with a reasonable number of training
datasets since the model has gone through the pre-trained session. As much information
was transferred from the earlier task to the trained model during transfer learning.
In this study, a pre-trained series network and directed acyclic graph (DAG) networks
are used, namely fine-tuning algorithms as well as feature extraction algorithms. The
fine-tuning algorithm complements the feature extraction algorithm by fine-tuning and
updating the weights of the new dataset while maintaining the convolution base of pre-
trained CNN during the training process. The pre-trained DAG network uses both fine-
tuning and feature extraction algorithms, although the DAG network has an exclusive
hierarchical architecture. The hierarchical architecture of a series network makes it easier
for training and generalization.

3.1 Transfer Learning with Series Network

A pre-trained series network, specifically AlexNet as in Fig. 1, is selected to be remodeled


for detection of both normal and anomaly scenes. Generally, the final three layers of the
dense base in pre-trained CNN is assigned to configure for 1000 classes. Further, these
layers are replaced with three new layers for constructing the two categories of snatch
theft scenes as in Table 1, without adding any new layer to synchronize with the pre-
trained layers. The newly added layer’s learning rate hyperparameter is customary to a
higher number. This is to ensure that both the biases and weights of the entirely new
connected layer can learn sooner.

Fig. 1. Basic architecture of AlexNet [24]


264 N. F. M. Zamri et al.

Table 1. Transfer learning of series and DAG networks

Pre-Trained Series Network Remodelled Pre-Trained Series Network


Layer (end-3): Fully Connected Layer Layer (end+1): Fully Connected Layer
Weight Learning Rate 1 Weight Learning Rate 10
Bias Learning Rate 2 Bias learning Rate 10
Weight Regularisation Factor 1 Weight Regularisation Factor 1

Bias Regularisation Factor 0 Bias Regularisation Factor 0

Layer (end-2): Softmax Layer Layer (end+2): Softmax Layer

Layer (end-1): Output Layer Classification Layer (end+3): Output Layer Classification

Originally, the three last layers of AlexNet consist of a fc8, prob and output layer.
These layers can be used to classify up to 1000 classes. Therefore, the three last layer is
decided to remove and add a new three layer since it will be used to classify two categories
namely ‘anomaly’ representing for snatch theft occurrence and ‘normal’ otherwise.
Figure 2(a) shows the original layer contained in AlexNet. The three-layer labelled as,
fc1000, prob, and output will be removed and a new fully connected layer (FullyCon),
softmax and classification layer (classoutput) is added as in Fig. 2(b).

Fig. 2. (a) Original layer of AlexNet; (b) New three last layers for AlexNet
Snatch Theft Detection Using Deep Learning Models 265

3.2 Transfer Learning with DAG Network


There are five pre-trained DAG networks that include GoogLeNet as an example in
Fig. 3, Inception-v3, ResNet18, ResNet50 and ResNet101 selected to be remodeled for
detecting the anomaly and normal scenes. For the pre-trained DAG networks, the process
of transfer learning is the same as in Table 1. Similar to Alexnet, the last three layers of
dense base in pre-trained CNNs can be used to constitute for 1000 classes. These layers
are replaced with the three new layers as well for constructing two categories of snatch
theft scene. The three last layer is removed and new three layers are replaced and added
in classifying the two new categories, snatch theft as ‘anomaly’ or else as ‘normal’.

Fig. 3. Basic architecture of GoogLeNet [25]

3.3 Verifying Overfitting in DL Models for Snatch Theft Detection


Overfitting occurs when a statistical model fits precisely against its training data and
the algorithm cannot perform accurately against unseen data. The generalization of a
model to new data ultimately allows machine learning algorithms to make predictions
and classify data. As we know, machine learning algorithms are constructed that leverage
certain dataset for training the model. However, when the models are trained for too long
on sample data or when the model is too complex, the model memorizes the noise and fits
too closely to the training set and overfitting occurs which further unable for the model
to generalize to new data [26, 27] that may lead to imperfect classification. For instance,
Imanol Bilbao et al. [28] reported small training error and large validation error occur
at the same time which indicated overfitting. Therefore, in this study, the occurrence of
overfitting is prevented via data augmentation and early stopping method in determining
the optimum time and number of iterations for training the model.

3.4 Performance Measure Metrics for Snatch Theft Detection


The optimum deep transfer learning evaluated for snatch theft detection will be rated
based on the performance measures for both classes specifically accuracy (Acc), sen-
sitivity (Sens) and, specificity (Spec). Acc resembles the DL accuracy, while the Sens
represents the snatch theft scenario are classified as ‘anomaly’ and the Spec represents
the ‘normal’ scenario without occurrence of snatch theft are correctly classified [29]. The
datasets were divided into 70% for training and for testing is 30%. All trained models
were evaluated and tested using 1950 as unseen images for both classes as either ‘snatch
theft (anomaly)’ or ‘non-snatch (normal)’.
266 N. F. M. Zamri et al.

4 Experiment Protocol
In this study, the experimental analysis and results are implemented using the Lenovo
Legion 720T, desktop with 16 GB RAM and 8G graphic card. This include the data
acquisition stage, training and testing stages for all the CNN models.

4.1 Data Acquisition and Pre-processing

The process of data collection was divided into three parts: data searching, converting
into images, and data sorting. The YouTube and Google are some of the online platforms
used in this study for videos related to snatch theft. A thorough search was done for every
page on YouTube and Google. Overall, 120 videos related to snatch theft or otherwise
(normal) were identified and compiled. These videos range from a minimum of six
seconds to a maximum of eight seconds. The condition in searching the database for
snatch theft is that the scene must include an act of snatch utilizing a motorbike or
running. Figure 4 shows some examples of these two categories of activities. Overall, a
total of 13000 images were obtained upon converting into the frame of images.

Fig. 4. Example images acquired for the experimental and analysis

Next, the data was sorted into two categories: normal and anomaly. For example, as
depicted in Fig. 5, ‘normal’ is categorised as a person riding a motorcycle through a busy
road while ‘anomaly’ is during the snatch theft incident, which depicted the perpetrator
trying to snatch. Overall, there are 6500 images for each category.
Snatch Theft Detection Using Deep Learning Models 267

Fig. 5. Example of normal database (Left) and anomaly database (Right)

4.2 Data Augmentation for Snatch Theft Detection

As stated earlier, data augmentation can be used for preventing overfitting to the DL
models. In this study, the augmentation methods used are rotation, Y reflection, trans-
lation and scaling during data augmentation. Figure 6 shows some samples of data
augmentation used during training the DL models.

Normal Random Scaling

Random Translation Random Y Reflection

Fig. 6. Example Augmentation Method Used


268 N. F. M. Zamri et al.

5 Experimental Results and Discussion


This section discusses the results obtained along with the process of snatch theft clas-
sification using the CNN models namely AlexNet, GoogleNet, InceptionV3, ResNet18,
ResNet50, and ResNet101. Next, the comparison in accuracy and performance measure
was performed in order to find the best CNN models to classify snatch theft activities
as well as to confirm that overfitting is not occurred. A total of 6500 dataset of images
in each category, anomaly and normal, were used for all the CNN models. The database
was separated randomly into 70% for training and the remainder as testing. Table 2
showed the result for each CNN model. The ResNet101attained accuracy of 100% for
both training and testing. It is observed also that all the CNN models obtained 100% for
sensitivity except InceptionV3. This resembled that these CNN models are capable to
classify all non-snatch database as ‘normal’. Meanwhile based on specificity of 100%,
only ResNet101 can classify perfectly the snatch theft database as ‘anomaly’ while other
models have misclassified these snatch theft databases as ‘normal’ instead.

Table 2. Performance measure of each DL model with random data augmentation method during
training

DLNN Training accuracy Testing accuracy Sens (%) Spec (%)


model (%) (%)
AlexNet 99.7 99.3 100 98.6
GoogleNet 99.7 99.7 100 99.4
InceptionV3 99.7 99.7 99.6 99.8
ResNet18 98.7 98.8 100 97.6
ResNet50 99.8 99.8 100 99.6
ResNet101 100 100 100 100

Further, Fig. 7 shows an example of the training and testing accuracy and loss plots
for Alexnet and Googlenet. Although data augmentation was utilized, it is observed that
overfitting still occurs in these two models and the other four models as well. These
models were tested using 30% of the unseen database. Hence in this study, to overcome
the overfitting, each model was re-train using an early stopping criterion, which involves
Snatch Theft Detection Using Deep Learning Models 269

pausing the training process before the model starts learning the noise within the model.
Next, upon completion of re-train the models, the results obtained are as tabulated in
Table 3. It was observed that the highest testing accuracy owned by ResNet50, with
100% sensitivity, while highest specificity of 97.7% still belongs to ResNet101, that
showed similar results before re-trained.
Next, examples of plots of the re-train models, namely Alexnet and Googlenet, as in
Fig. 8, showed that upon re-train, the overfitting no longer occurred. Note that all models
have showed similar properties of testing lines that indicated a good fit of the testing
accuracy. Thus, it can be proven that using both data augmentation and early stopping
criterion successfully overcome the overfitting issue in these CNN models.

Table 3. Performance measure of each DL model with early stopping criterion to combat
overfitting

DLNN Iteration during Training accuracy Testing accuracy Sens (%) Spec (%)
model early stopping (%) (%)
AlexNet 701 89.1 89.1 100 82.5
GoogleNet 451 99.9 89.4 100 82.5
InceptionV3 803 95.4 93.9 100 89.1
ResNet18 721 97.9 92.4 100 86.8
ResNet50 930 96.2 98.9 100 96.9
ResNet101 930 99.8 98.7 99.8 97.7

Based on the experimental analysis done, the accuracy and performance of each DL
model using the random data augmentation method during training still contributed to
overfitting. Hence, the early stopping criterion was utilized to combat this issue.
270 N. F. M. Zamri et al.

Fig. 7. Example of Alexnet (Top) and Googlenet (Bottom) based on accuracy during training and
testing that leads to overfitting even though data augmentation was utilized
Snatch Theft Detection Using Deep Learning Models 271

Fig. 8. Example of Alexnet (Top) and Googlenet (Bottom) upon completion of the re-train process
that showed a good-fit of testing accuracy based on an early-stopping criterion

6 Conclusion
As a conclusion, snatch theft detection using six DL models was conducted in this study
in evaluating and validating the ability of each model to classify the snatch theft and non-
snatch activities. Datasets of snatch theft and non-snatch theft activities were extracted
272 N. F. M. Zamri et al.

and pre-proceed from google and YouTube that comprised of 120 videos and generated
13000 images of both categories. These images acted as databases to train and test these
DL models. Based on the original training and testing accuracy, the ResNet101showed
perfect accuracy during training and testing. In addition, all models excluding Incep-
tionV3 obtained perfect sensitivity as well, that indicated these CNN models are capable
to classify all non-snatch database as ‘normal’. However, based on the plots of train-
ing and testing accuracy of all models, it was detected that overfitting happened for all
models. To overcome this, all models were re-trained with an early stopping criterion.
Results obtained upon completion of re-trained showed that overfitting was overcome.
Finally, it can be concluded that the highest testing accuracy of 98.9% was obtained by
ResNet50 along with perfect sensitivity. As for specificity, ResNet101 attained the high-
est with 97.7%, even upon re-trained. Still belongs to ResNet101, that showed similar
results before re-trained. Future work includes validating and testing these CNN models
in a real-time environment.

Acknowledgment. This research was funded by the Ministry of Higher Education (MOHE)
Malaysia, Grant No: 600-IRMI/FRGS 5/3 (394/2019), Sponsorship File No: FRGS/1/2019/
TK04/UITM/01/3. The authors would like to thank the College of Engineering, Universiti
Teknologi MARA (UiTM), Shah Alam, Selangor, Malaysia for the facilities provided in this
research.

References
1. Department of Statistics Malaysia Official Portal. https://www.statistics.gov.my/index.php?r=
column/cone&menu_id=dDM2enNvM09oTGtQemZPVzRTWENmZz09. Accessed 11 Apr
2020
2. Crime in England and Wales - Office for National Statistics. https://www.ons.gov.uk/people
populationandcommunity/crimeandjustice/bulletins/crimeinenglandandwales/yearendingma
rch2019%0A. Accessed 11 Apr 2020
3. Crime Rates in the United States, 2020 — Best and Worst States – SafeHome. https://www.
safehome.org/resources/crime-statistics-by-state-2020/?msclkid=de9f2788c3a211ecbf018
f3304be6d50. Accesed 11 Apr 2020
4. Rudin, C.: Predictive policing: Using machine learning to detect patterns of crime. In: ECML
PKDD, pp. 515–530 (2013)
5. Crime Analysis_ Defined - Threat Analysis Group. https://www.threatanalysis.com/2020/05/
13/crime-analysis-defined/?msclkid=1b000502c3a611ec8302b3f6443f5834. Accessed 24
Apr 2022
6. What Is Prediction, Detection, And Forecasting In Artificial Intelligence? https://www.ana
lyticsinsight.net/prediction-detection-forecasting-artificial-intelligence/?msclkid=25af24adc
3a711ec9370812fb338cf16. Accessed 13 Oct 2020
7. Crime Pattern Theory - Crime and intelligence analysis_ an integrated real-time approach
8. The Crime Analyst’s Blog_ Crime Patterns, Crime Sprees, and Crime Series.
9. Truntsevsky, Y.V., Lukiny, I.I., Sumachev, A.V., Kopytova, A.V.: A smart city is a safe city:
the current status of street crime and its victim prevention using a digital application. In:
MATEC Web of Conferences 2018, vol. 170 (2018)
10. Xia, Y., Zhang, B., Coenen, F.: Face occlusion detection based on multi-task convolution
neural network. In: 2015 12th International Conference on Fuzzy Systems and Knowledge
Discovery, FSKD 2015, pp. 375–379 (2016). https://doi.org/10.1109/FSKD.2015.7381971
Snatch Theft Detection Using Deep Learning Models 273

11. Mandal, R., Choudhury, N.: Automatic Video Surveillance for theft detection in ATM
machine: an enhanced approach. In: 2016 3rd International Conference on Computing for
Sustainable Global Development (INDIACom), pp. 2821–2826 (2016)
12. Md Sakip, S.R.B., Moihd Salleh, M.N.: Linear street pattern in urban cities in Malaysia
influence snatch theft crime activities. In: Asia-Pacific International Conference, vol. 3, no.
8, p. 189 (2018)
13. Khalidi, S., Shakeel, M.: Spatio-temporal analysis of the street crime hotspots in faisalabad
city of Pakistan. In: 23rd International Conference on Geoinformatics, Wuhan, China, pp. 3–6
(2015). https://doi.org/10.1109/GEOINFORMATICS.2015.7378693
14. Lee, I., Jung, S., Lee, J., Macdonald, E.: Street crime prediction model based on the physical
characteristics of a streetscape: analysis of streets in low-rise housing areas in South Korea.
Environ. Plann. B Urban Anal. City Sci. 46(5), 862–879 (2019). https://doi.org/10.1177/239
9808317735105
15. Takizawa, A., Koo, W., Katoh, N.: Discovering distinctive spatial patterns of snatch theft in
Kyoto City with CAEP. J. Asian Archit. Build. Eng. 9(1), 103–110 (2010). https://doi.org/10.
3130/jaabe.9.103
16. Laouar, D., Mazouz, S., Van Nes, A.: Space and crime in the North-African city of Annaba.
In: Proceedings of the 11th Space Syntax Symposium, pp. 196.1–196.9 (2017)
17. Lu, J., Tang, G.A.: The spatial distribution cause analysis of theft crime rate based on GWR
Model. In: 2011 International Conference on Multimedia Technology, ICMT 2011, pp. 3761–
3764 (2011). https://doi.org/10.1109/ICMT.2011.6002711
18. Zhuang, Y., Almeida, M., Morabito, M., Ding, W.: Crime hot spot forecasting: a recurrent
model with spatial and temporal information. In: 2017 IEEE International Conference on
Big Knowledge on Proceedings, ICBK pp. 143–150 (2017). https://doi.org/10.1109/ICBK.
2017.3
19. Hanaoka, K.: New insights on relationships between street crimes and ambient population:
use of hourly population data estimated from mobile phone users’ locations. Environ. Plann.
B Urban Anal. City Sci. 45(2), 295–311 (2018). https://doi.org/10.1177/0265813516672454
20. Ibrahim, N., Mokri, S.S., Siong, L.Y., Marzuki Mustafa, M., Hussain, A.: Snatch theft detec-
tion using low level features. In: World Congress on Engineering 2010 on Proceedings,
London, UK, pp. 862–866 (2010)
21. Suriani, N. S., Hussain, A., Zulkifley, M. A.: Multi-agent event detection system using k-
nearest neighbor classifier. In: 2014 International Conference on Electronics, Information
and Communications, ICEIC, Kota Kinabalu, Malaysia, pp. 1–2 (2014). https://doi.org/10.
1109/ELINFOCOM.2014.6914382
22. Butt, U.M., Letchmunan, S., Hassan, F.H., Zia, S., Baqir, A.: Detecting video surveillance
using VGG19 convolutional neural networks. Int. J. Adv. Comput. Sci. Appl. 11(2), 674–682
(2020). https://doi.org/10.14569/ijacsa.2020.0110285
23. Razak, H.A., Almisreb, A.A., Tahir, N.M.: Detection of anomalous gait as forensic gait in
residential units using pre-trained convolution neural networks. In: Arai, K., Kapoor, S.,
Bhatia, R. (eds.) FICC 2020. AISC, vol. 1130, pp. 775–793. Springer, Cham (2020). https://
doi.org/10.1007/978-3-030-39442-4_57
24. Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems 25 (NIPS 2012) on
Proceedings (2012). https://doi.org/10.1201/9781420010749
25. Guo, Z., Chen, Q., Wu, G., Xu, Y., Shibasaki, R., Shao, X.: Village building identification
based on ensemble convolutional neural networks. Sensors 17(11), 1–22 (2017). https://doi.
org/10.3390/s17112487
26. What is Overfitting in Deep Learning and How to Avoid It. https://www.v7labs.com/blog/ove
rfitting. Accessed 02 Mar 2022
274 N. F. M. Zamri et al.

27. Li, H., Li, J., Guan, X., Liang, B., Lai, Y., Luo, X.: Research on overfitting of deep learn-
ing. In: 2019 15th International Conference on Computational Intelligence and Security on
Proceedings, CIS, Macao, China, pp. 78–81 (2019). https://doi.org/10.1109/CIS.2019.00025
28. Bilbao, I., Bilbao, J.: Overfitting problem and the over-training tin the era of data. In: The 8th
IEEE International Conference on Intelligent Computing and Information Systems, ICICIS,
Cairo, Egypt, pp. 173–177 (2018)
29. Almisreb, A.A., Tahir, N.Md., Turaev, S., Saleh, M.A., Al Junid, S.A.M.: Arabic handwriting
classification using deep transfer learning techniques. Pertanika J. Sci. Technol. 30(1), 641–
654 (2022). https://doi.org/10.47836/pjst.30.1.35
Deep Learning and Few-Shot Learning
in the Detection of Skin Cancer: An Overview

Olusoji Akinrinade1(B) , Chunglin Du1 , Samuel Ajila2 , and Toluwase A. Olowookere3


1 Tshwane University of Technology, Pretoria, South Africa
[email protected]
2 Carleton University, Ottawa, Canada
3 Department of Computer Science, Redeemer’s University, Ede, Nigeria

Abstract. Skin cancer is a severe condition that should be detected early. The
two most prevalent types of skin cancer include melanoma and non-melanoma.
Melanoma has been identified as the utmost dangerous skin cancer. Yet, discrim-
inating melanoma lesions from non-melanoma lesions has proven challenging.
Several artificial intelligence-based strategies have been introduced in the litera-
ture to handle skin cancer detection, including deep learning and few-shot learning
strategies. According to the evidence in the literature, deep learning algorithms
are reported to perform well when trained on large datasets. However, they are
only effective when the target domain has enough labeled samples; they do not
ensure adequate network activation variables to adjust to new target regions rapidly
when the target domain has insufficient data. Consequently, few-shot learning
paradigms have been presented in the literature to promote learning from such
limited amounts of labeled data. A search on PubMed from inception to 7 June
2022 for studies investigating the review of the application of deep learning and
few-shot learning in the detection of skin cancer was performed via the use of title
terms “Deep Learning” AND “Few-Shot Learning” AND “Skin Cancer Detec-
tion” AND “Review,” combined with title terms or MeSH terms “Deep Learning”
AND “Few-Shot Learning” AND “Skin Cancer Detection” AND “Review,” with
no limits on language or date of publication. We found no paper that has reviewed
the application of deep learning and few-shot learning in detecting skin cancer.
This paper, therefore, presents a brief overview of some of the most critical appli-
cations of deep learning and few-shot learning schemes in the detection of skin
cancer lesions from skin image data.

Keywords: Artificial intelligence · Deep learning · Few-shot learning ·


Melanoma · Skin cancer detection

1 Introduction
Cancer, a medical disease characterized by unregulated cell proliferation in bodily tis-
sues, is considered one of the foremost healthcare burdens worldwide [1]. Because the
skin is the broadest organ in the body, skin cancer diseases are one of the most com-
mon and hazardous among other several cancer types [2]. Unfixed deoxyribonucleic

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 275–286, 2023.
https://doi.org/10.1007/978-3-031-18461-1_18
276 O. Akinrinade et al.

acid (DNA) in skin cells causes genetic defects or skin changes, leading to skin cancer
[3]. Skin cancer is grouped into two broad classes: melanoma and non-melanoma. [4].
While a majority of skin cancer cases fall into the category of non-melanoma and have
minimal likelihood of spreading to other regions of the body, melanoma cancers are
an uncommon, dangerous, and fatal form of skin cancer and develop in surface cells
called melanocytes. According to the American Cancer Society, melanoma skin cancer
accounted only for 1% of all cases, and it is connected with a greater mortality rate since
it could only be treated when detected early [5]. As a result, it is preferable to recognize
it from the onset in order to ensure it is curable and reduce the cost of therapy. According
to Esteva [6], around 5.4 million increased incidents of skin cancer are documented per
year in the United States, and the speed at which it is expanding is deeply troubling on
a global scale.
Considering the gravity of the situation, scientists have devised technologies for the
detection of skin cancer that can aid in early diagnosis, good accuracy, and sensitivity.
Dermoscopy imaging is used to identify melanoma in the skin of humans. Experts (der-
matologists) examining the dermoscopic images to determine if there are skin lesions
or not. Like many other medical fields, dermatology has immensely benefitted from
advances in the computer vision sub-field of artificial intelligence, which focuses on
enabling computer systems’ ability to extract information from digital images and auto-
mate the identification tasks that human visual systems could do [6]. The performance of
deep learning algorithms in the computer vision task has improved over the years since
the first convolutional neural network emerged [7] and deep learning has been applied
in computer-aided skin cancer recognition tasks recently [5]. The availability of vast
amounts of labeled image data has really helped deep learning algorithms in achieving
the ends improvements in computer vision tasks on medical images over the last decade
[8].
However, when there are limited amounts of labeled image data at the disposal of
deep learning algorithms to learn from, the deep networks trained from such a small
amount of image data tend to fail because they overfit and are less likely to generalize
appropriately. As a result, to aid learning from small amounts of labeled data, few-shot
classification techniques have emerged [9–12]. With only a few training instances, few-
shot learning approaches provide the computer vision models with the ability to quickly
adjust to innovative tasks and settings.
A search on PubMed [13] from inception to 7 June 2022 for studies investigating
the review of the application of deep learning and few-shot learning in the detection
of skin cancer was performed via the use of title terms “Deep Learning” AND “Few-
Shot Learning” AND “Skin Cancer Detection” AND “Review”, combined with title
terms or MeSH terms “Deep Learning” AND “Few-Shot Learning” AND “Skin Cancer
Detection” AND “Review”, with no limits on language or date of publication. We found
no paper that has reviewed the application of deep learning and few-shot learning in the
detection of skin cancer. This paper, therefore, presents a review on the application of
deep learning and few-shot learning approaches in the identification and detection of
skin cancer (melanoma) which has a very small amount of data available. Section 2 of
this paper presents a brief overview of the general application of deep learning in skin
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 277

cancer detection while Sect. 3 highlights the basis of the few-shot learning concept and
Sect. 4 presents a review of the application of few-shot in detecting skin cancer diseases.

2 Deep Learning in Skin Cancer Detection

Artificial intelligence (AI), a field of computer science that employs technologies and
programs to simulate human cognitive abilities, has a variety of applications in health
care, dermatology inclusive. Machine learning is an AI paradigm that permits computers
to do data-driven learning without explicit programming. In other words, the purpose
of machine learning is to create systems that automatically do learning based on obser-
vations of the real world (referred to as “training data”), without the need for people
to explicitly define rules or reasoning [14]. Machine learning (ML) plays a vital role
in skin cancer diagnosis. A subclass of machine learning is deep learning which builds
deep neural network models with neurons that have a variety of parameters and layers
between input and output. The Convolutional Neural Network (CNN) is a deep learning
method that may be used to solve difficult computer vision problems like skin lesion
analysis and overcomes the limitations of traditional machine learning approaches [15].
Deep learning models conveniently learn patterns well on a large amount of data and
have been well suited for use to learn inherent patterns on large clinical imaging data
including cancer lesion images [16].
Melanoma or probable skin lesions are identified primarily using dermoscopy imag-
ing, which detects pigmented skin lesions. The procedure is non-invasive and detects
any lesions early on. Dermatologists can evaluate skin lesions with their very own eyes
as the dermoscopic images do have good resolution and great visual qualities [17]. The
examination process by dermatologists takes time, necessitates a high level of profes-
sional expertise, and is sometimes opinionated. The Convolutional neural networks are
just as good as dermatologists at detecting melanoma or skin lesions, if not better [18].
There are various architectures of deep learning including the AlexNet, VGGNets,
MobileNets, and ResNet among others. Figure 1 depicts a design of the ResNet-50 deep
learning architecture designed for the diagnosis of skin cancer by Medhat et al. [19].
In the Resnet-50 design, each stage has two blocks, the first of which is a convolution
block and the second of which is an identity block. There are three convolution layers
in each convolution block, and there are three convolution layers in each identity block.
When transfer learning is used, the fully linked layer and the classification output layer
are substituted by three new layers for two classes.
Several existing artificial intelligence (AI) algorithms, especially deep learning algo-
rithms, have been proposed in the literature for the identification and detection of skin
cancer. Some of these works are here presented; firstly, by identifying some works
that simply used deep learning in the diagnosis of skin cancer and then identifying
some studies that show that deep learning methods are capable of outperforming human
dermatologists in the detection of skin cancer.
For melanoma detection, a tremendously deep CNN was proposed in [20]. To
improve efficiency, a fully convolutional residual network (FCRN) that has 16 blocks
that are residual in nature was utilized in the segmentation stage. For the stage of clas-
sification, the proposed scheme used a mean value of SVM and softmax classifiers. It
278 O. Akinrinade et al.

has a melanoma classification accuracy of 85.5% in the segmentation task and 82.8%
without the task of segmentation.
According to [21], the vision-based classification of melanoma based on deep learn-
ing proposed the use of VGG-16, a pre-trained deep Convolution Neural Network archi-
tecture comprising 5 convolutional blocks and 3 fine-tuned layers. Having a 78% accu-
racy, the VGG-16 models successfully recognized lesion images as melanoma skin
cancer. 1200 skin photos that are normal and 400 photographs of skin lesions were used
to train the deep learning model. With 86.67% accuracy, the suggested system catego-
rized the input pictures into two elementary categories, which included the normal skin
images and the lesion images.

Fig. 1. A diagram of the ResNet-50 CNN architecture for skin cancer diagnosis (Source: Medhat
et al. [19])

For skin lesions categorization, [22] suggested a technique for extracting deep fea-
tures from several pre-trained convolutional neural networks. The deep feature genera-
tors employed in the study included pre-trained VGGnet 16, AlexNet, and ResNet-18,
which were then fed into a multiple class Support Vector Machines model. Eventually,
to carry out classification, the classifiers’ results were pooled. On the ISIC 2017 dataset,
the suggested technique classified seborrheic keratosis (SK) and melanoma with 97.55%
and 83.83% AUC, accordingly.
The study in [23] proposed a deep learning strategy for automatically recognising and
segmenting melanoma lesions that circumvent certain constraints. There were two new
subnetworks in each network that were connected to each other. This made it easier for
people to learn about and extract features from data because it brought the semantic depth
of encoder function maps closer to the space of the decoder. The technique used Softmax
classifiers to categorize melanoma tumors in pixels using a multiple-stage, multiple-
scale methodology. They developed a novel approach known as the lesion classifier, that
divides skin lesions into melanoma and non-melanoma categories based on the results
of pixel classification. The proposed technique is clearly more satisfactory than several
advanced techniques, as demonstrated on two popular datasets of the Hospital Pedro
Hispano (PH2) and the International Symposium on Biomedical Imaging (ISBI) 2017.
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 279

The method correctly predicted 95% of the ISIC 2017 data set and 92% of the PH2 data
set, as well as 95% of dice and 93% of the PH2 data set. The dice coefficients were 95%
and 93%, respectively.
The work in [24] focused on deep learning techniques for the detection and diag-
nosis of skin cancer. The purpose of this work existed to improve the CNN model for
skin cancer diagnosis that could distinguish between separate types of skin cancer and
aid in timely detection. The segmentation and recognition approaches were established
in Python using Keras and Tensorflow as the foundation. The prototype is created and
verified to make use of a series of network topologies and layer sizes, together with
convolutional layers, dropout layers, pooling layers, and dense layers. Transfer Learn-
ing techniques were employed to allow rapid interconnection. The data was compiled
from the archives of an International Skin Imaging Collaboration (ISIC) competition
and was used to evaluate and generate the model. By merging the ISIC 2018 and ISIC
2019 databases, a new dataset was generated. The dataset has been cleaned up and the
seven most common types of skin lesions have been kept. The models were capable of
learning more rapidly as a consequence of the greater range of instances provided per
class. Melanocytic Nevus, Basal Cell Carcinoma, Benign Keratosis, Melanoma, Der-
matofibroma, Actinic Keratosis, Vascular Lesion, and Squamous Cell Carcinoma are
the seven groups. The input pictures were modified to make the model more resilient
to unknown data, which led to a large improvement in testing accuracy. The image was
normalised using the image normalisation technique. The CNN was generated utilising
transfer learning and ImageNet classification weights. The model was trained and eval-
uated on advanced CNNs, such as Inception V3, ResNet50, VGG16, Inception Resnet,
and MobileNet, to accomplish seven-class categorization of skin lesion photos. The
Inception V3 and Inception Resnet CNN models obtained a favorable assessment, with
90 and 91% accuracy, respectively. It is sufficiently robust to classify lesion pictures into
one of seven categories [22].
The research in [25] utilised a fine-grained classification concept to develop a clas-
sifier model that addressed the difficulties inherent in the automatic recognition of der-
moscopy image lesions due to the extensive classification background and lesion fea-
tures. The model was constructed using MobileNet and DenseNet and incorporated two
standard feature extraction components derived from a lesion classification approach,
as well as a feature discriminating network. In the suggested technique, two types of
training images were fed into the identification model’s feature extraction module. The
resulting two sets of feature space were utilised to construct both binary classification
networks and feature discriminating networks for the detection job. Using this strategy,
the new identification methodology can extract more discriminatory lesion characteris-
tics and increase the model’s performance in a small proportion of model parameters.
When certain model parameters were changed, the proposed method was shown to pro-
duce better segmentation results, with the suggested model getting 96.2% of the time
right.
Research work in [26] suggested a multi-scale Convolutional Neural Network based
on ImageNet-trained inception v3 CNN. They fine-tuned the pre-trained inception v3 in
order to perform the classification of skin cancer on abrasive and fine scales of the lesion
images. The abrasive-scale image resolution was used to collect the shape attributes
280 O. Akinrinade et al.

of lesions as well as their overall environment. The finer scale image resolution, on the
other hand, collected textual information about the lesion in order to distinguish amongst
diverse forms of skin lesions.
In order to compare how deep learning techniques perform side-by-side with human
dermatologists, researchers have done a number of studies. Using the International Skin
ISIC-2016 dataset, Codella et al. [27] constructed an ensemble of deep learning tech-
niques, contrasting them with the abilities of 8 dermatologists in identifying whether 100
skin lesions are malignant or benign. The ensemble which included the trio of convo-
lutional neural networks, deep residual networks, and fully convolutional U-Net archi-
tecture, was able to segment skin lesions and detect melanoma in the identified area
and surrounding tissue. The aggregation of the deep learning algorithms’ performance
exceeded those of the dermatologists, with 76% accuracy and 62% specificity as against
the respective 70.5% and 59% for the dermatologists [27].
Haenssle et al. [26], trained a deep learning model using the popular InceptionV4
deep learning architecture on a huge dermoscopic dataset containing over a hundred
thousand benign lesions and malignant pictures and 58 dermatologists were used to
evaluate the deep learning model’s performance. Two levels of diagnosis were used to
categorize the patients. Only dermoscopy was employed at the first level, while der-
moscopy was employed in conjunction with medical data and patient photographs at
the second level. In the study, dermatologists documented 86.6% sensitivity and 71.3%
specificity at the first level. In the second level, the sensitivity and specificity climbed to
88.9% and 75.7%, respectively. The specificity enhancement was significant (with p =
0.05). The increase in the sensitivity, however, was insignificant statistically (with p =
0.19). The CNN model had vastly greater specificity than that of the dermatologists in the
first level (with p = 0.01) and second level (with p = 0.01). The CNN surpassed several
dermatologists in the trial, implying a possible part in the identification of melanoma
via dermoscopic images [28].
In a similar study, Haenssle et al. [29] evaluated InceptionV4-based deep learning
architecture with dermatologists on a dermoscopic test set of 100 examples. There were
two levels to this study: A dermoscopy photo was included in stage 1, and a medical
similar photo, a dermoscopy photo, and patient information were included in stage 2.
The sensitivity and specificity of the dermatologists in stage 1 were 89% and 80.7%,
respectively, compared to 95% and 76.7% for the CNN system. Dermatologists’ mean
sensitivity climbed to 94.1% with more information in stage 2, while their average
specificity was still unchanged [29].
Similar findings were found in another investigation by Brinker et al. [30].
Researchers employed ResNet-50, a CNN architecture to evaluate the performance of
157 dermatologists on 100 dermoscopy photos. The performance of the dermatologists
was 74.1% overall sensitivity and 60% specificity, while the performance of the deep
learning model was 84.2% sensitivity and 69.2% specificity. It was reported that the deep
learning model performed better than 86.6% of the dermatologists in the research in a
head-to-head matchup. As a result, it was concluded that the deep learning model has a
huge potential to help physicians make accurate melanoma diagnoses [30].
Research in [31] utilized a heterogeneous dataset of 7895 dermoscopic images and
5829 close-up lesion images to detect non-pigmented skin cancers via InceptionV3
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 281

and ResNet50 CNNs. The findings of the CNNs were evaluated against those of 95
dermatologists who were divided into three categories based on their level of expertise.
The CNN algorithms outperformed the human categories with beginner and intermediate
raters, achieving accuracy comparable to humans. The findings of the study revealed
that proper diagnoses were made in a greater number of instances compared to all
dermatologists in the study, but not when matched to professionals with over a decade
of experience [31].
With the participation of 112 German dermatologists, [32] investigated the specificity
and sensitivity of a ResNet50 CNN for multiple-class classification of skin lesions. In the
major end-point, the performance of detection of skin lesions by dermatologists stood at
about 74.4% sensitivity with 59.8% specificity. The deep learning algorithm’s specificity
was 91.3% at a comparable level of sensitivity. Dermatologists had 56.5% sensitivity
and 89.2% specificity, with the secondary end goal of properly identifying a particular
picture into a class out of the five considered classes. The deep learning algorithm had
98.8% specificity at the same sensitivity level. The deep learning system outperformed
the dermatologists significantly (p < 0.001). In all classes, the deep learning system out-
performed the 112 dermatologists, with the exception of basal cell carcinoma, when the
system performed similarly to the 112 dermatologists [32].
From the evidence in the literature, it could be seen that deep learning algorithms
perform well when trained on big data. However, they are really only efficient when
there are enough labeled samples in the region of interest, but they do not assure appro-
priate network configuration settings which can swiftly adjust to new regions of interest
in situations where there is insufficient data in the target domain [8].

3 The Basis for the Few-Shot Learning Paradigm


As it has been emphasized in this paper, the limited availability of labeled data in the
target domain has been one of the most challenging aspects of applying deep learning
to medical applications. This is primarily due to the high cost of hiring a professional
clinician to evaluate the patient’s health condition [33]. Data augmentation is one way
of dealing with this constraint, and it involves creating artificial instances from the
original data, albeit this does not solve the problem [34, 35]. Transfer learning is an
additional extensively utilized approach. This method involves training the network
using information gleaned from a linked domain, then migrating the weights and biases
to a new network and fine-tuning this with the target domain [33]. The disadvantage of
this strategy is that it performs poorly whenever the quantity of target data is insufficient
or if the target data has a modest distribution shift [36].
The Few-shot learning paradigms have therefore been proposed in the literature to
promote learning from limited quantities of labeled data to mimic the ability of humans
to generalize new knowledge based on a small number of examples [37]. With only
a few training images, few-shot strategies create the model that is capable of quickly
adapting to new tasks and settings. The basic concept is to learn the initial parameters of
the model so that it performs optimally on an innovative task just when the parameters
of the model have been changed using one or more gradient steps estimated using a little
quantity of information acquired from that innovative task [8].
282 O. Akinrinade et al.

The goal of few-shot learning is to acquire accurate class representations from a


trivial number of training examples [10, 11, 38].
Some approaches to few-shot learning include matching networks by [10] which
develop an attention mechanism over support set labels to anticipate query set labels
for novel classes. Prototypical networks by [38] train embedding and centroid repre-
sentations together (as class prototypes) to categorize novel samples using Euclidean
distance. Also, embeddings are trained end-to-end as shown in the works of [10, 38],
and the training utilizes episodic sampling approach [39].

4 Research Progress in the Application of Few-Shot Learning


to Skin Cancer Detection
Some researchers have applied the few-shot learning techniques in detecting skin cancer.
This section focuses on recent research updates application of few-shot learning in the
identification and detection of skin cancer (melanoma) which has a very small amount
of data available. An architecture for few-shot learning was put forward in the work
of Mahajan, Sharma, and Vig [8]. The diagram of the few-shot meta-learning concept
they designed is depicted in Fig. 2. In the architecture, they proposed a pipeline with
two phases: meta-training and meta-testing. The meta-training phase comprises a meta-
learner to train a neural network for managing a considerable quantity of few-shot image
recognition operations generated from a set of labeled training classes comprised of skin
cancer types, finding optimal network initialization parameters for a prototype model.
Usually, a distance-metric-based Prototypical network or meta-gradient approach may be
used as the meta-learning technique. The model is then updated to classify images on a
fresh collection of unseen/unusual classes with relatively few samples in the meta-testing
stage.

Fig. 2. An overview of few-shot meta-learning architecture based on prototypical network.


(Source: Mahajan, Sharma, and Vig, [8])

With the focus on using few-shot learning for the diagnoses of dermatological dis-
ease, Prabhu et al. [39] stated that conventional off-the-shelf methods of identifying
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 283

a dermatological disease from photographs confront two major obstacles. One problem
is that real-world dermatological data populations are frequently long-tailed, and there is
a great deal of intra-class heterogeneity. They characterized the first problem as low-shot
learning, wherein a base classifier must adapt quickly to detect new situations after being
deployed with a minimal number of labeled examples. They demonstrated Prototypical
Clustering Networks (PCN), a Prototypical Networks extension [36], which successfully
reflects intra-class variability by training a fusion of “prototypes” for each class. Proto-
types for every class were started using clustering, then they were refined using an online
updating technique. The prototypes were classified by matching their resemblance within
each class to a weighed blend of prototypes, with the weights representing the projected
cluster operations. They trained a 50-layer ResNet-v2, an advanced CNN architecture
for image categorization, using ImageNet pre-training. They demonstrated the capability
of the suggested method in the effective identification on a real dataset of dermatolog-
ical illnesses. They gave the MCA of the test set for the highest 200 classes obtainable
throughout the test time for 2 low shot setups with 5-shots and 10-shot in the training set
and 5-shot of test set for two different low shot setups. Their strategy, PCN, has an MCA
of 30.0 ± 2.8 for novel courses in the 5-shot task, and an MCA of 49.6 ± 2.8 for new
classes in the 10-shot task. Since the approach leverages on episode training to build dis-
criminative visual features that can be extrapolated to new classes with minimal sample
size, it accomplishes substantially better (nine percent) actual enhancements over the
previous approaches) in producing new categories generalizations.
In low-data and heavy-tailed data distribution domains, [8] investigated meta-
learning-based algorithms like Prototypical networks, a distance metric-based learn-
ing approach, and Reptile, a gradient-based approach for identifying skin lesions from
clinical images. The suggested network is called “Meta-DermDiagnosis,” and it uses a
meta-learning approach to allow deep neural networks learned on the dataset of popu-
lar diseases to quickly adapt to unusual conditions with considerably less labeled data.
It comprises a meta-learner that trains the neural network on a series of few-shot pic-
ture classification tasks using a baseline set of class labels. To select optimal network
initialization weights, class labels are sampled from the top of the class distribution,
and the model is then customized to classify images on a new set of unknown classes
with only a few occurrences. Additionally, they showed that in the case of skin lesion
image classification, utilizing Group Equivariant convolutions (G-convolutions) in Meta-
DermDiagnosis considerably enhances the network’s efficiency because orientation is
often not a significant aspect in such images. Using three publicly accessible skin lesion
classification datasets; Derm7pt, SD-198, and the ISIC 2018, they evaluated the per-
formance of the proposed technique employing Reptile and Prototypical networks and
compared it to the pre-trained transfer learning baseline. The study’s findings showed
that Reptile with G-convolutions outperformed extra techniques in skin lesion classi-
fication in little-data circumstances, such as pre-training and prototype networks with
82.1 average accuracy on the ISIC 2018 dataset, 76.9 average accuracy on the Derm7pt
dataset and 83.7 average accuracy on the SD-198 dataset.
In the work “Few-shot learning for skin lesion image classification”, [40] used the
improved Relational Network for metric learning to achieve the categorization of skin
disease based on a limited amount of annotated skin lesion image data. The technique
284 O. Akinrinade et al.

employed a relative position network (RPN) and a relative mapping network (RMN), in
which the RPN collects and extracts feature representations via an attention mechanism,
and the RMN uses a weighted sum of attention mapping distance to determine image
categorization similarity. On the public ISIC melanoma dataset, the average classification
accuracy obtained is 85%, demonstrating the technique’s efficacy and practicability.
For skin lesion segmentation, [41] proposed a few-shot segmentation network that
only needs minimal pixel-level labeling. Firstly, the co-occurrence area between the
support image and the query image was collected, and this was then utilized as a prior
mask to remove extraneous background areas. Secondly, the findings are combined and
submitted to the inference module, which predicts query image segmentation. Thirdly,
using the symmetrical structure, the network was retrained by inverting the support and
query roles. Extensive tests on ISIC-2017, ISIC-2019, and PH2 show that the method
provides a promising framework for few-shot skin lesion segmentation.
Lastly, relying on the Internet of Medical [42] developed a few-shot prototype net-
work to alleviate the paucity of annotated samples. First, a contrast learning branch
was designed to improve the feature extractor’s capabilities. Second, for comparison
learning, a unique technique for creating positive and negative sample pairings was pro-
posed, and it was reported to reduce the need to explicitly maintain a sample queue.
Finally, the dissimilarity learning division was utilized to correct corrupted data and
develop the category prototype. Finally, to increase classification accuracy and conver-
gence speed, the hybrid loss, which combines prototype and contrasting losses, was
applied. On the mini-ISIC-2i and mini-ImageNet datasets, their technique was reported
to have performed admirably considerably well [42].

5 Conclusion
This paper has provided a brief overview of some of the relevant applications of deep
learning and few-shot learning techniques in the diagnosis of skin cancer lesions using
skin imaging data. As pointed out from evidence in the literature and in practice, it has
been identified that deep learning algorithms perform admirably when trained on huge
datasets. The deep learning algorithms have however been seen to be mostly successful
whenever the target domain has just enough annotated instances; they do not ensure
adequate network activation variables that could swiftly adapt to novel target domains
when the target domain lacks data. This makes deep learning unsuitable for model
building in circumstances where data is limited and thus necessitates the development
of appropriate techniques for such data-scarce situations. In this regard, this paper has
explored the application of few-shot learning methods in the detection of various classes
of melanoma, banking on the ability of the few-shot learning techniques to learn from
limited amounts of labeled data in the classes. The models trained in these few-shot-
based approaches have been shown to have considerably good detection performance.
In future work, attempts will be made at employing few-shot learning potentials and
deep-learning feature extraction capabilities in the skin cancer detection domain with a
view to improving the detection performance.
Deep Learning and Few-Shot Learning in the Detection of Skin Cancer 285

References
1. Das, K., et al.: Machine learning and its application in skin cancer. Int. J. Environ. Res. Public
Health 18, 1–10 (2021)
2. Ferlay, J., et al.: Cancer statistics for the year 2020: an overview. Int. J. Cancer (2021). https://
doi.org/10.1002/ijc.33588
3. Ashraf, R., et al.: Region-of-interest based transfer learning assisted framework for skin cancer
detection. IEEE Access 8, 147858–147871 (2020)
4. Elgamal, M.: Automatic skin cancer images classification. Int. J. Adv. Comput. Sci. Appl. 4
(2013)
5. Dildar, M., et al.: Skin cancer detection: a review using deep learning techniques. Int. J.
Environ. Res. Public Health 18 (2021)
6. Li, C.X., et al.: Artificial intelligence in dermatology: past, present, and future. Chin. Med. J.
132, 2017–2020 (2019)
7. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning for computer
vision: a brief review. Comput. Intell. Neurosci. 2018 (2018)
8. Mahajan, K., Sharma, M., Vig, L.: Meta-dermdiagnosis: few-shot skin disease identification
using meta-learning. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Workshops, pp. 3142–3151, June 2020
9. Koch, G.: Siamese neural networks for one-shot image recognition (2011)
10. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K.: Matching networks for one shot
learning
11. Santoro, A., Botvinick, M., Lillicrap, T., Deepmind, G., Com, C.G.: One-shot learning with
memory-augmented neural networks (2016)
12. Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning
13. National Center for Biotechnology Information (NCBI) [Internet]. No Title. Bethesda (MD):
National Library of Medicine (US), National Center for Biotechnology Information
14. Khan, S., Rahmani, H., Shah, S.A.A., Bennamoun, M.: A guide to convolutional neural
networks for computer vision. Synth. Lect. Comput. Vis. 8, 1–207 (2018)
15. Indolia, S., Goswami, A.K., Mishra, S.P., Asopa, P.: Conceptual understanding of convo-
lutional neural network-a deep learning approach. Procedia Comput. Sci. 132, 679–688
(2018)
16. Oyetade, I.S., Ayeni, J.O., Ogunde, A.O., Oguntunde, B.O., Olowookere, T.A.: Hybridized
deep convolutional neural network and fuzzy support vector machines for breast cancer
detection. SN Comput. Sci. 3(1), 1–14 (2021). https://doi.org/10.1007/s42979-021-00882-4
17. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning algorithm
for automated detection of skin lesions. Appl. Sci. 11 (2021)
18. Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks.
Nature 542, 115–118 (2017)
19. Medhat, S., Abdel-Galil, H., Aboutabl, A.E., Saleh, H.: Skin cancer diagnosis using convo-
lutional neural networks for smartphone images: a comparative study. J. Radiat. Res. Appl.
Sci. 15, 262–267 (2022)
20. Yu, L., Chen, H., Duo, Q., Qin, J., Heng, P.-A.: Automated melanoma recognition in der-
moscopy images via very deep residual networks. IEEE Trans. Med. Imaging 36, 994–1004
(2017)
21. Kalouche, S.: Vision-based classification of skin cancer using deep learning. Stanford’s
machine learning course (CS 229) (2016)
22. Mahbod, A., Schaefer, G., Wang, C., Ecker, R., Ellinge, I.: Skin lesion classification using
hybrid deep neural networks. In: ICASSP, IEEE International Conference on Acoustics,
Speech and Signal Processing - Proceedings, 1229–1233, May 2019
286 O. Akinrinade et al.

23. Adegun, A.A., Viriri, S.: Deep learning-based system for automatic melanoma detection.
IEEE Access 8, 7160–7172 (2020)
24. Nahata, H., Singh, S.P.: Deep learning solutions for skin cancer detection and diagnosis. In:
Jain, V., Chatterjee, J.M. (eds.) Machine Learning with Health Care Perspective. LAIS, vol.
13, pp. 159–182. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-40850-3_8
25. Wei, L., Ding, K., Hu, H.: Automatic skin cancer detection in dermoscopy images based on
ensemble lightweight deep learning network. IEEE Access 8, 99633–99647 (2020)
26. DeVries, T., Ramachandram, D.: Skin lesion classification using deep multi-scale convolu-
tional neural networks (2017)
27. Codella, N.C.F., et al.: Deep learning ensembles for melanoma recognition in dermoscopy
images. IBM J. Res. Dev. 61, 1–28 (2017)
28. Haenssle, H.A., et al.: Man against machine: diagnostic performance of a deep learning
convolutional neural network for dermoscopic melanoma recognition in comparison to 58
dermatologists. Ann. Oncol. 29, 1836–1842 (2018)
29. Haenssle, H.A., et al.: Man against machine reloaded: performance of a market-approved
convolutional neural network in classifying a broad spectrum of skin lesions in comparison
with 96 dermatologists working under less artificial conditions. Ann. Oncol. 31, 137–143
(2020)
30. Brinker, T.J., et al.: Deep learning outperformed 136 of 157 dermatologists in a head-to-head
dermoscopic melanoma image classification task. Eur. J. Cancer 113, 47–54 (2019)
31. Tschandl, P., et al.: Expert-level diagnosis of nonpigmented skin cancer by combined
convolutional neural networks. JAMA Dermatol. 155, 58–65 (2019)
32. Maron, R.C., et al.: Systematic outperformance of 112 dermatologists in multiclass skin cancer
image classification by convolutional neural networks. Eur. J. Cancer 119, 57–65 (2019)
33. Garcia, S.I.: Meta-learning for skin cancer detection using deep learning techniques, pp. 1–7
(2021)
34. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for
classification: when to warp? In: 2016 International Conference on Digital Image Computing:
Techniques and Applications, DICTA 2016 (2016). https://doi.org/10.1109/DICTA.2016.779
7091
35. Mikołajczyk, A., Grochowski, M.: Data augmentation for improving deep learning in image
classification problem. In: 2018 International Interdisciplinary PhD Workshop, IIPhDW 2018,
pp. 117–122 (2018). https://doi.org/10.1109/IIPHDW.2018.8388338
36. Kumar, V., Glaude, H., de Lichy, C., Campbell, W.: A closer look at feature space data augmen-
tation for few-shot intent classification. In: DeepLo@EMNLP-IJCNLP 2019 - Proceedings
of the 2nd Workshop on Deep Learning Approaches for Low-Resource Natural Language
Processing, pp. 1–10 (2021). https://doi.org/10.18653/v1/d19-6101
37. Duan, R., et al.: A survey of few-shot learning: an effective method for intrusion detection.
Secur. Commun. Netw. 2021 (2021)
38. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances
in Neural Information Processing Systems, pp. 4078–4088, December 2017
39. Prabhu, V., et al.: Few-shot learning for dermatological disease diagnosis. Proc. Mach. Learn.
Res. 106, 1–15 (2019)
40. Liu, X.J., Li, K., Luan, H., Wang, W., Chen, Z.: Few-shot learning for skin lesion image
classification. Multimedia Tools Appl. (2022). https://doi.org/10.1007/s11042-021-11472-0
41. Xiao, J., Xu, H., Zhao, W., Cheng, C., Gao, H.: A prior-mask-guided few-shot learning for
skin lesion segmentation. Computing (2021)
42. Xiao, J., Xu, H., Fang, D., Cheng, C., Gao, H.: Boosting and rectifying few-shot learning
prototype network for skin lesion classification based on the internet of medical things. Wirel.
Netw., 0123456789 (2021)
Enhancing Artificial Intelligence Control
Mechanisms: Current Practices, Real Life
Applications and Future Views

Usman Ahmad Usmani1 , Ari Happonen2(B) , and Junzo Watada3


1 Department of Computer and Information Science, Universiti Teknologi Petronas, 79
LakeVille Seri Iskandar, 32610 Seri Iskandar, Perak, Malaysia
2 LUT School of Engineering Science, LUT University, Yliopistonkatu 34, 53850
Lappeenranta, Finland
[email protected]
3 Waseda University, 1 Chome-104 Totsukamachi, Shinjuku City, Tokyo 169-8050, Japan

Abstract. The popularity of Artificial Intelligence has grown lately with the
potential it promises for revolutionizing a wide range of different sectors. To
achieve the change, whole community must overcome the Machine Learning (ML)
related explainability barrier, an inherent obstacle of current sub symbolism-based
approaches, e.g. in Deep Neural Networks, which was not existing during the last
AI hype time including some expert and rule-based systems. Due to lack of trans-
parency, privacy, biased systems, lack of governance and accountability, our soci-
ety demands toolsets to create responsible AI solutions for enabling of unbiased
AI systems. These solutions will help business owners to create AI applications
which are trust enhancing, open and transparent and also explainable. Properly
made systems will enhance trust among employees, business leaders, customers
and other stakeholders. The process of overseeing artificial intelligence usage and
its influence on related stakeholders belongs to the context of AI Governance.
Our work gives a detailed overview of a governance model for Responsible AI,
emphasizing fairness, model explainability, and responsibility in large-scale AI
technology deployment in real-world organizations. Our goal is to provide the
model developers in an organization to understand the Responsible AI with a com-
prehensive governance framework that outlines the details of the different roles
and the key responsibilities. The results work as reference for future research is
aimed to encourage area experts from other disciplines towards embracement of
AI in their own business sectors, without interpretability shortcoming biases.

Keywords: AI governance · Responsible AI · Real life applications ·


eXplainable AI · Three lines model

1 Introduction
From business to healthcare, sustainability, product design, industrial and educational
context alike, innovations on AI innovation and Industry 4.0 are delivering new opportu-
nities to improve people’s lives, all across the globe [1, 3, 10, 19, 23, 26, 32]. This tough

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 287–306, 2023.
https://doi.org/10.1007/978-3-031-18461-1_19
288 U. A. Usmani et al.

does raise problems for fairness inclusions, system security and how to bring privacy,
into these systems effectively [2]. User-centered AI systems should consider both gen-
eral and solutions to machine learning (ML) specific issues [4–8, 37]. Understanding
the genuine impact of a system’s estimates, suggestions operational models, process
principles and decisions depends on how users interact with systems and operations [8,
9, 52, 53]. Also, design attributes such as adequate disclosures, clarity, and control are
required for good user experience and are usual parameters explaining estimations for
revenue streams and value of AI solutions [11]. Considering augmentation and assis-
tance, a single solution is suitable if designed to service a broad range of users and use
cases. In certain circumstances, giving the user a limited set of alternatives is beneficial
to the system. Precision across many answers is significantly more challenging achiev-
ing than accuracy over a single solution [9]. Think about click-through rate data and
customer lifetime value, as well as subgroup-specific false positive and false negative
rates, to evaluate overall system performance and short and long-term product health
(e.g., click-through rate and customer lifetime value) [12]. It should be ensured that the
metrics are suitable for the context and purpose of the system. E.g. a fire alarm system
should have a high recall, even if there are some false alarms [13, 14]. We can double-
check the comprehension of raw data since ML models reflect the data they are trained
on. If this isn’t possible, such as with sensitive raw data, it is advised to make the most
of your information while maintaining privacy by distributing aggregated summaries.
Most model is used that satisfies the performance objectives [15]. Also the users need
to be made of aware of system limits. E.g. application utilizing ML, made to identify
selected bird species reveals that the model was trained on a tiny sample of images from
a particular area of the globe [16]. The quantity of feedback increases by better educating
people about the product or application. The most acceptable test processes are learned
and quality engineering from software engineering to ensure that the AI system operates
as expected and can be trusted. To include a wide variety of consumer needs into devel-
opment cycles, iterative user testing should be done [17]. Then the quality engineering
approach is used to have quality checks into a system, ensuring that unplanned errors
are prevented or handled swiftly (for example, if a critical feature is suddenly absent,
the AI system will not generate a prediction). The model will be constantly monitored
to consider real-world performance and user input (e.g., happiness tracking surveys,
HEART framework) [13, 18].
In base definition, all world models are flawed, there is no such thing as 100% perfect
system. It is recommended to make time in the product strategy for troubleshooting.
Both short- and long-term solutions should be taken into account. While a fast remedy
can temporarily solve the issue, it is usually not a long-term answer for the problem.
Long-term learning solutions should tough be linked with quick learning solutions.
The candidate and deployed model’s variation must be considered and how the change
will affect overall system quality and user experience before altering a deployed model
[20, 21]. The critical point is that AI success is contingent on a group, not a single
person or position [22]. According to Collins those disciplines increase scientific process
awareness and thinking in complicated, interacting systems. They generally have the
critical thinking abilities needed to conduct good experiments and analyze the results
of ML applications, says the author. Having a varied staff has several advantages [14].
Enhancing Artificial Intelligence Control Mechanisms 289

Fig. 1. An overview of the three lines model. [13]

Sitting back and hoping for diversity to come to an individual isn’t a realistic team-
building approach [7, 24, 25]. The following are the contributions of this manuscript:

• We explain the techniques and moral principles of AI ethics, which is related to


responsible use of AI technology and intended to be informative for development.
• AI governance is explained from a perspective of enabling organizations to trust
AI-powered outputs to automate current or new business processes for gains in time-
to-market advantage at any stage of the real-life application development process.
• A detailed overview of the Three Lines Model (shown in Fig. 1) for AI governance
is given to help organizations explain and reinforcing the fundamental principles,
broadening the scope, and illustrating how key organizational activities interact to
improve governance and risk management, for building a safe AI application.

2 Literature Review
AI’s ethical deployment and governance are essential for permitting the large-scale
deployment of AI systems required to improve people’s welfare and safety. Yet, AI
development typically outpaces regulatory development in many areas. The technique
also enables AI developers to address typical challenges in automated systems, such as
reducing social bias reinforcement, keeping people’s jobs and talents, resolving respon-
sibility to ensure confidence in an algorithm’s results, and more [27]. Commercial AI
systems in radiation clinics have just lately been developed, in contrast to the aerospace
sector, with efforts concentrated on showing performance in academic or clinical set-
tings, as well as product approval [28, 29]. Until previously, commercial AI systems
for radiation were only available as static goods, allowing cancer specialists to analyze
their effectiveness. Although an agile lifecycle management strategy, where AI-based
290 U. A. Usmani et al.

segmentation models are updated with new patient data regularly, sounds appealing, it
is unlikely to be accessible anytime soon.
Continuous quality monitoring of linear accelerators and vital software systems for
treatment planning and radiotherapy department operations benefit from the same qual-
ity assurance and monitoring. In general, it is a good practice to study different fields
for their monitoring practices [38, 47, 49], in addition of the specific field where AI
is applied to. Before an AI can be deployed, it must first go through a thorough and
transparent examination of the ethical implications of its proposed activities, particu-
larly in terms of social impact, but also in terms of safety and bias, before moving on
to a five-layer high-frequency checking system to ensure that the AI’s decisions are
correct and trustworthy. These characteristics are expectation confinement, synthetic
data exercise, independence, comprehensiveness, and data corruption assurance. Dose
calculation in therapeutic decision support systems, atlas-based auto-segmentation, and
magnetic resonance imaging benefit from similar methodologies. [30].
In recent years, (deep) neural networks and machine learning (ML) approaches have
complimented and, in some cases, surpassed symbolic AI solutions. As a consequence,
its social importance and impact have skyrocketed, bringing the ethical discussion to a
far larger audience. The argument has focused on AI ethical principles (nonmaleficence,
autonomy, fairness, beneficence, and explainability) rather than acts the “how”. Even if
the AI community is becoming more aware of potential issues, it is still in the early stages
of being able to take steps to mitigate the risks. The purpose of this study is to bridge the
gap between principles and practices by developing a typology that can help developers
apply ethics at each level of the Machine Learning development pipeline while also
alerting researchers to areas where further research is needed. Although the research is
limited to Machine Learning, the findings are predicted to be readily transferrable to other
AI domains. The following is the difference between ethics via design and pro-ethical
design: Because it does not rule out a course of action and requires agents to choose
it, the nudge is less paternalistic than pro-ethical design. The nudge is less paternalistic
than pro-ethical design since it does not prohibit but rather compels agents to choose a
path of action. A simple illustration may help you comprehend the distinction. Because
it enables drivers to pay a fine, a speed camera is both pro-ethical and nudging in the
event of an emergency. Speed bumps, on the other hand, are a kind of traffic-calming
device used to slow down automobiles and increase safety. They may seem to be a good
concept, but they need a long-term route adjustment, leaving motorists with few options.
This implies that even while responding to emergency, emergency vehicles such as a
medical ambulance, police car, or fire engine must slow down.

3 AI Governance Development

While AI moral governance offers promise, it also has limits and risks becoming in-
effectual if not used correctly. AI restrictions must be understood and followed. E.g.
who has the last say on what constitutes “ethical” AI? Companies are coming up with
their ideas and methodologies for establishing what it means to use these technologies
ethically and what “ethical” AI implies for society. Those at the top of organizations,
mostly white males, set the tone [4, 31]. The AI ethics board should be diverse, reflecting
Enhancing Artificial Intelligence Control Mechanisms 291

the views of people who AI systems may impact. Bias reduction when the company’s
and leadership’s aims are at odds: Being the first to market is highly prized by businesses.
On the other side, eradicating prejudice and building responsible AI conflict with this
aim necessitates extra procedures and stop points, lengthening the time it takes to bring a
product to market. Traditional goals clash with ethical and responsible AI goals, placing
the company in jeopardy and losing money [33].
In addition to the governance in companies the governments and different coun-
tries unions should also step in and set their views on table for ethics, governance and
methodologies of regulating AI, just as they have done in manufacturing and waste
management, to restrict events like Boeing 737 MAX groundings, effectually a result of
failed industry self-regulation practices. Examining current goals and making sure that
responsible AI is a significant focus is critical. Finally, the interplay between ethical and
economic gains demonstrates a fundamental knowledge of market success.
Cutting shortcuts on ethics has long-term and legal ramifications, particularly essen-
tial given AI’s quickly shifting regulatory framework. In inadequate accountability and
training, when it comes to putting ideas into practice, there is sometimes a lack of precise
instruction. Furthermore, there is no responsibility when a company’s ideals are broken.
In many circumstances, corporate culture and the market in general [34] prioritizes effi-
ciency above fairness and prejudice reduction, making it impossible to put the ideas into.
When combining bias and fairness criteria, it is critical to what is “fair” for a particular
AI system and who defines “fair”.
The EGAL brief on ML fairness dives further into the subject, laying out methods
and roadblocks. The technical solutions and “ethical washing” should be a priority. Most
principles and associated ideas are based on the assumption that technology can solve
problems, and they tend to have technical bias. E.g. a variety of qualitative techniques
are included in the EGAL brief on ML fairness, which can be helpful [35]. Particularly
for higher-risk applications, initial high-level assessments of the technology’s potential
for damage, as well as a record of decisions made throughout the AI system’s con-
struction, are critical. The terms management and governance are not interchangeable.
Governance is responsible for supervising how decisions are made, while management
is in charge of making them. By extending the same concept to AI governance, we arrive
at the following AI Governance for company’s definition. The more dangerous an AI
application is, for a description of some of these threats, see AI hazards, the more critical
AI governance becomes [36, 37]. Because AI enabled robots collect data and informa-
tion on a continuous basis, biases or unwanted outcomes, such as a chatbot that learns
inappropriate or violent language over time, are quite probable. If the data it receives
is biased, the system will provide skewed outcomes. Individual topic experts must first
inspect the data that enters these computers for biases, and then maintain governance to
guarantee that biases do not arise over time.
Companies may use additional visibility, a better understanding of their data, and
AI governance to assess the machine’s business rules or learnt patterns before adopting
and spreading them to staff and consumers. When it comes to ethical AI applications,
it’s all about trust. Customers, companies, and government authorities all want to know
that these smart systems are assessing data and making the best judgments they can.
They seek to demonstrate that the business outcomes produced by these machines are
292 U. A. Usmani et al.

in everyone’s best interests. Some of the tactics recommended in this article may assist
businesses in becoming more trustworthy. They can also enhance how AI offers options to
customers, aid with regulatory compliance, improve command and control, and provide
total transparency and the ability to make the best decisions possible. In this Section,
we explain the Three Lines Model (shown in Fig. 2) where we explain the role of the
different execution lines and how the governance can be effective using the end to end
governance model for Responsible AI.

3.1 The Three Lines Model

A few of the key’red flag’ AI applications include facial recognition, AI for recruitment,
and bias in AI-based assessments and recommendations. There should be a way to keep
track of everything that’s going on. It is desirable to control the significant development
and approval method if AI is included in any current law or expert bodies for autonomous
self-governance. If there is no precedent to follow, there can be additional risks that have
not been considered. Corporate governance and risk management are ideas that have
been around for a long time. [44] Standard procedures, norms, and conventions are
typical to guarantee that businesses run smoothly. The Federation of European Risk
Management Associations (FERMA) and the European Confederation of Institutes of
Internal Auditing (ECIIA) developed the three lines of defense concept in 2008–10 as
guidance on Article 41 of the 8th EU Company Law Directive.

Fig. 2. Three lines model for AI governance [25]


Enhancing Artificial Intelligence Control Mechanisms 293

In a position paper titled The Three Lines of Defense in Effective Risk Management
and Control, released in 2013, the Institute of Internal Auditors (IIA) endorses this
strategy. Risk analysis and management, as well as governance adoption, have all become
industry standards [39]. In June 2020, the IIA revised its recommendations to include a
position paper on the IIA’s Three Line Model. This model covers the six basic governance
concepts, essential roles in the three-line model, role connections, and use the three-line
model. It includes information on management’s tasks and obligations, the internal audit
function, and the governing body. The three-line model has been revised to consider
a wider variety of financial services organizations, including technology and model
risk. Banks often use these three defense lines to manage credit risk, market risk, and
operational risk models. We modified this paradigm for AI Governance by creating a new
governance organization, methodology, roles, and responsibilities. Managers, Executors,
and Inventors: The creators, executors, and operations teams develop, build, deploy, and
execute the data, AI/ML models, and software, respectively; the data, software, and
models are maintained and monitored by the operations team. Managers, supervisors,
and quality assurance staff are responsible for identifying and executing the plan’s risks
connected to data, AI/ML models, automation, and software. Continuous monitoring is
examined as the second line of defense. The second line is also responsible for ensuring
that the first line’s systems are configured correctly.
The Auditors are responsible for ensuring that the organization’s rules, regulations,
and goals are followed and that technology is utilized responsibly and ethically [40].
The ethics board consists of a diverse group of corporate leaders and workers. Specific
organizations can appoint external members to the Board of Directors. Companies will
have to work with external auditors, other assurance providers, and regulators, in addition
to their internal duties. In the diagram below, the features of the numerous roles and the
critical tasks of each function are shown.
It is critical to monitor the IEEE Global Initiative on Ethics of Autonomous and
Intelligent Systems. The next stage is to devise a strategy [44]. Software and AI models,
as previously said, need entirely distinct techniques. While some of the research may
not be beneficial, a portfolio strategy increases the likelihood that at least some of them
will be correct. At any one-time, established AI adoption and firms have a portfolio of
models in different stages of development, including conception, testing, deployment,
production, and retirement. The ROI must be monitored and adjusted as required across
the portfolio to ensure the best mix of business use cases, efficiency vs. effectiveness
initiatives, and so on. Ten human talents and four intelligence are necessary to profit
from human-centered AI. The distribution strategy must be carefully evaluated because
of the convergence of data, software, and AI models.
To deliver AI-embedded software, or Software 2.0, both waterfall and agile software
development methodologies must be updated and interlaced. The essential indicators
that must be recorded and monitored for supervision will be identified in the specific
delivery plan. The next level of governance is the ecosystem as a whole. The ecosystem
in which AI models will be incorporated, as well as the context in which they will be
used by personnel both within and outside the company, is what we’re talking about.
The societal impact of the company’s AI should be evaluated. In this area, IEEE’s Well-
being Measures are a strong contender [41].
294 U. A. Usmani et al.

3.2 Role of Different Execution Lines

The governing board is in charge of setting the company’s vision, mission, values, and
organizational appetite for risk. Following that, management is given the duty of fulfilling
the organization’s objectives and obtaining the necessary resources. The governing body
receives management reports on planned, actual, and projected performance and risk
and risk management information. The degree of overlap and separation between the
governing body’s and management’s tasks varies depending on the organization. The
governing body can be more or less “hands on” when it comes to strategic and operational
matters. The strategic plan can be created solely by the governing body or management,
or it can be a shared effort. Also, the CEO can be a member of the board of directors and
even the chairman. Effective communication between management and the governing
body is required in every circumstance. Although the CEO is often the principal point
of contact for this communication, other senior executives also interact regularly with
the governing body.
Second-line executives, such as a Chief Risk Officer (CRO) and a Chief Compliance
Officer (CCO), are sought and required by organizations and authorities. This is entirely
compatible with the Three Lines Model’s concepts. Management and internal auditing
encompassing occupations on the first and second lines. Internal audit’s independence
from control allow it to plan and carry out its activities without fear of being influenced or
interfered with. It has unlimited access to the people resources and information it requires.
It makes recommendations to the governing body. Independence, on the other hand, does
not imply loneliness. Internal audit and management must for internal audit’s work to be
relevant and consistent with the organization’s strategic and operational goals. Internal
audit broadens its expertise and understanding of the firm via all its activities, increasing
the assurance and direction as a trusted and strategic partner [26]. Coordination and
communication between the first and second lines of management and internal audit are
essential to minimize excessive duplication, overlap, or gaps. Because it reports to the
governing body, internal audit is frequently referred to as the organization’s “eyes and
ears”.
The governing body is in charge of internal audit oversight, which includes hiring
and firing the Chief Audit Executive (CAE), approving and resourcing the audit plan,
receiving and considering CAE reports, and providing the CAE with unrestricted access
to the governing body, including private sessions without management present. Second-
line employment can be created by delegating essential responsibilities and reporting
lines to the governing body to give first-line employees and senior management some
autonomy. The Three Lines Model allows for as many reporting lines between manage-
ment and the governing body as are required. For compliance or risk management, for
example, as many persons reporting directly to the board are as necessary and organized
to give a degree of independence. Second-line roles offer third-line employees with the
same advice, monitoring, analysis, reporting, and assurance, but with less discretion.
Lower-level employees who make risk management choices—devising and implement-
ing policies, setting boundaries, establishing targets, and so on—are obviously “in the
kitchen making sausages” and part of management’s actions and responsibilities. Most
notably regulated financial institutions, some organizations must have these in place to
maintain true independence.
Enhancing Artificial Intelligence Control Mechanisms 295

Risk management is still the duty of first-line management in these instances. Risk
management monitoring, counseling, directing, testing, assessing, and reporting are
examples of second-line responsibilities. Second-line jobs are a component of man-
agement’s responsibilities. They are never really independent of management regardless
of reporting lines since they assist and challenge those in first-line positions. They are
critical to management decisions and actions. Third-line occupations are distinguished
by their independence from management. Internal audit’s independence, which distin-
guishes it from other activities and gives different assurance and recommendations, is in
the Three Lines Model Principles. Internal audit preserves its independence by refusing
to make decisions or conduct actions that are part of management’s responsibilities, such
as risk management, and refusing to assure activities that internal audit is responsible
for now or previously. The CAE is expected to take on more decision-making respon-
sibilities for jobs that require similar skills, such as statutory compliance or enterprise
risk management (ERM), specially to look for upkeeping the best practices for company
performance [42].

3.3 End to End Governance Model for Learning Responsible AI Practices


Other sources of assurance cannot be available. Effective governance requires accu-
rate task assignment and strong activity alignment via cooperation, collaboration, and
communication. Internal audit should provide the governing body with confidence that
governance structures and processes are well-organized and operating as intended. Orga-
nizations are humans that work in a constantly turbulent, multifaceted, interconnected,
and chaotic environment. They generally include many stakeholders, all of whom have
varying, competing, and often conflicting interests. Stakeholders entrust supervision to
a governing body that has empowered management with the resources and authority
to make crucial choices like risk management. For these and other reasons, businesses
need efficient structures and processes to their objectives while retaining good gover-
nance and risk management. The governing body and management rely on internal audits
to provide independent, objective assurance and advice on all issues and inspire and pro-
mote innovation and growth since the governing body receives management reports on
actions, results, and predictions. The governing body is ultimately accountable for gov-
ernance, which is implemented via the governing body’s actions and, management, and
internal audit. The Three Lines Model assists businesses in establishing structures and
procedures that support good governance and risk management while also helping them
achieve their objectives [43, 44, 45].
Any organization uses the model, and it is bolstered by: Using a principles-based
approach and customizing the model to the aims and circumstances of the organization.
Risk management is considered to achieve the objectives, create value and be “defen-
sive” by safeguarding the assets. The model’s representation should be of roles and
duties, as well as their interrelationships. Place processes should guarantee that actions
and goals are in line with stakeholders’ main concerns. The first assumption is that
the government exists. Governance is made up of structures and procedures that enable
accountability to stakeholders for monitoring by a governing body via integrity, lead-
ership, and transparency. Management operations include risk-based decision-making
and resource allocation, including risk management, to accomplish goals [27]. Through
296 U. A. Usmani et al.

thorough investigation and intelligent communication, an independent internal audit


position provides assurance and guidance to offer clarity and confidence and encourage
and support continual growth [46].
The governing body ensures the establishment of the relevant institutions and pro-
cesses for efficient governance. The organization’s aims and operations sync with its
stake- holders’ significant objectives. The governing body assigns tasks and resources
to management to achieve its goals while adhering to legal, regulatory, and ethical obli-
gations; establishes and maintains an unbiased, objective, and competent internal audit
department to provide clarity and confidence in goal progress. Management, as well as
first and second-line employees, are covered by Principle 3. Governance is accountable
for both first and second line functions to organizational goals. Support responsibilities
are included in first-line jobs, which are most directly engaged in providing goods and/or
services to the organization’s customers. Posts help with risk management. It is possible
to combine or split the first and second lines.
Specialists are appointed to specialized second-line occupations to give extra knowl-
edge, supervision, and challenge to first-line activities. Second-line risk management
jobs may concentrate on internal control, information, technology security, sustainabil-
ity, and quality assurance, among other risk management goals. While second-line jobs,
such as enterprise risk management (ERM), may have more responsibilities, risk man-
agement is still a part of first-line operations and is managed by management. Because of
external factors at work, there is always a risk in the retail business. Customer credit is an
example of an external factor that has a significant influence on a business’ profitability.
If a business does a customer credit risk analysis and finds that things aren’t going as
planned, it is able to lower its risk. This is done by stopping the invoice extensions for
clients who are deemed high risk by the organization to decrease risk. Take the manu-
facturing business, for example. A company wants to develop a new product. They must
do a comprehensive risk analysis before beginning production to determine the degree
of risk that the firm may face. They may then decide if the advantages of producing
a new product exceed the dangers. Internal audit offers unbiased assurance and advice
on the appropriateness and effectiveness of governance and risk management. This is
accomplished by the skillful use of rigid and disciplined methodologies and knowledge
and insight. To encourage and support future development, it informs its findings to
management and the governing body. It may evaluate assurance from various internal
and external sources throughout this process.
The independence of the third line Internal audit’s impartiality, authority, and cred-
ibility are all dependent on its independence from management obligations. It includes
accountability to the governing body, unrestricted access to the people, resources, and
data it needs to perform its job and independence from prejudice or involvement in the
design and delivery of audit services. Creating and preserving value is the sixth principle.
They work together to develop and maintain value when all functions are coordinated
and prioritize stakeholder interests. Communication, cooperation, and collaboration are
used to align activities. This maintains the information’s consistency, coherence, and
transparency, essential for risk-based decision-making. Responsibilities are distributed
differently in different organizations. The high-level acts that follow, on the other hand,
Enhancing Artificial Intelligence Control Mechanisms 297

help to emphasize the Three Lines Model’s Principles. Stakeholders understand that the
governing body is in charge of organizational monitoring.
Second-line posts help with risk management. It is possible to combine or split the
first and second lines. Specialists are appointed to specialized second-line occupations to
give extra knowledge, supervision, and challenge to first-line activities. Second-line risk
management jobs may concentrate on internal control, information, technology security,
sustainability, and quality assurance, among other risk management goals. Internal audit
offers unbiased assurance and advice on the appropriateness and effectiveness of gover-
nance and risk management. This is accomplished by the skillful use of rigid and disci-
plined methodologies and knowledge and insight [50]. To encourage and support future
development, it informs its findings to management and the governing body. Formula-
tion, execution, and continuous improvement of risk management procedures, including
internal control at the process, system, and entity levels are examples of second-line jobs
[48].
Risk management goals are achieved through adherence to laws, norms, and accepted
ethical behavior; internal control; information and technology security; sustainability;
and quality assurance—analyzes and reports on risk management’s effectiveness and
appropriateness, including internal control. Internal audit is separate from management
and reports to the governing body. For example, The President reports to the Audit and
Risk Management Committee of the Board of Governors, while they report to the Execu-
tive Director, University Governance, and University Secretary, respectively. Assurance
and advice on the adequacy and usefulness of governance and risk management with
the internal control are communicated to management and the governing body inde-
pendently and objectively to support the achievement of organizational objectives and
to promote and facilitate continuous improvement. Any threats to impartiality or inde-
pendence are brought to the governing body’s attention, which takes appropriate action.
Further assurance is provided to meet legal and regulatory duties that safeguard stake-
holders’ interests, and internal assurance sources are augmented to meet management
and governing body needs.
Data architects are also vital in the governance of AI systems. In order to model AI,
businesses must have a solid data or metadata pipeline. Keep in mind that the success
of AI is contingent on well-organized data architecture devoid of mistakes and noise.
There will be a need for data standards, data governance, and business analytics. The
development of the AI governance function necessitates the use of human resources.
They may, for example, seek for employees who “fit” into the company’s present AI
framework and provide existing staff training tools to help them learn how to build eth-
ical AI applications. When AI technology is deployed, it is critical to guarantee that no
legal boundaries are breached. AI solutions meet organizational and industry-specific
regulatory standards. There is no such thing as a one-size-fits-all plan that takes into
account all legal and regulatory issues. Customers’ perceptions of ethical behavior in
the financial services industry, for example, may vary significantly from corporate ethics.
The AI governance function’s integration of legal and regulatory teams gives a diverse
set of decision-making inputs. Marketing, sales, human resources, supply chain, and
finance efforts all realize the advantages of AI. As a consequence, subject knowledge
is required not just for app creation but also for app evaluation. As a consequence,
298 U. A. Usmani et al.

having a strong business presence on the core AI governance council may aid in improv-
ing outcomes. People from various backgrounds should be represented on a company’s
governing board. It also contributes to inclusive and smooth governance by considering
all of the company’s issues. Product-based businesses provide a diverse variety of AI-
enabled products. When a business purchases a product that isn’t primarily based on AI,
it often falls beyond the purview of the AI regulatory agency. But what if the business
introduces an AI-assisted process, service, or product? Procurement and finance depart-
ments should ideally have AI professionals on staff to help with product onboarding. A
well-functioning AI governance position will provide a framework for monitoring AI
algorithms and products more effectively. In addition, developing an agile and cross-
functional AI governance committee would bring a diverse set of perspectives to the
table and help in the spread of AI knowledge.

4 Future Views
Despite the development of ethical frameworks, AI systems are nevertheless being
quickly implemented across a wide range of vital areas in the public and commer-
cial sectors—including healthcare, education, criminal justice, and many more—with
no protections or accountability procedures in place. There are a number of challenges
that must be addressed, and no one endeavor, nation, or corporation will be able to
do it by themselves. Emerging technologies are becoming more cross-border, and if
norms and practices impacting technical development and implementation in various
countries do not coincide, significant possibilities can be missed (WTO, 2019) [46].
New conflicts can erupt both inside and between states in a divided globe. In terms
of economic prosperity, it is feasible that the development of certain technical systems
can grow more costly, postponing innovation. This can lead to injustice and new divides
between technologically advanced and technologically disadvantaged nations or regions.
Additionally, major differences in how new technologies (particularly AI) are handled
and utilized in terms of human rights can make guaranteeing people’s equal access to
rights and opportunities across borders more difficult. New technologies can be used
as new digital surveillance tools, allowing governments to automate citizen monitoring
and tracking; they can also help policymakers allocate public goods and resources more
efficiently; and they can even be powerful mechanisms for private companies to forecast
our behavior [50].
A personal data can be retained and used for AI in an open or hidden manner. It can be
voluntarily offered as a kind of remuneration, or it can be taken without the agreement or
knowledge of the owner. Overall, arguments about who has access to our data, who has
the right to make decisions about it, and who has the instruments to enforce that authority
haunt the path to the digital future. This isn’t to say that all technological governance
should be done at the global level. It is critical for regions, states, and cities to be able
to adapt to their residents’ social, economic, and cultural needs. While the majority of
research has focused on wealthy nations, there is a need for additional information about
the geographically specific effect of AI systems on developing countries, as well as how
new technology can perpetuate historical inequalities in these areas.
Global processes, on the other hand, are essential even if they do not result in inte-
grated systems, since inequity thrives in the absence of universal laws. To manage the
Enhancing Artificial Intelligence Control Mechanisms 299

digital transition and achieve social inclusion, it will be required to create internationally
identical ethical, humanitarian, legal, and political normative frameworks. Furthermore,
while taking into account geopolitical and cultural disparities, there will be a growing
need to focus on algorithmic criteria rather than ethical principles. The G20’s role in
aligning interests and organizing such projects will be critical in the coming years.
The G20 brings together some of the most powerful political and economic forces on
the planet. It encompasses the whole globe and includes some of the world’s most strong
economies. It is the perfect place to examine the future of digital governance and respond
to one of the most major contemporary difficulties and concerns facing our world today
since it is a crucial venue for dialogue and involvement, both executive and legislative
[51]. Right now, there is no one-size-fits-all solution for the best AI technique, but there
are lots of options. We must all work together to determine which choice will benefit the
most people. By participating in and leading this discussion, the G20 has the potential to
become the spinal column of a new architecture for the 21st century, ensuring a brighter
future for everyone.
Although recognition and classification aren’t the only tasks given to AI systems,
they are the most popular. The flexibility of AI approaches might be seen as a sign of
variety. When research focuses on small, local gains in well-known, well-suited tasks like
identification and classification and then applies the effort to comparable difficulties over
a wide variety of domains, however, the predominance of a few activities may pose a risk.
It’s also feasible that the consequences of a system failure will be completely unexpected.
The systems may result in a wide variety of symptoms, from mild discomfort to death.
This backs with prior research that shows AI piques people’s interest in a broad variety
of topics, regardless of whether they are beneficial. On the other hand, mechanisms
that protect people’s privacy or remove the potential of discrimination are few and few
between. Furthermore, not every system requires extensive testing: a system that causes
pain is much more forgiving than one that causes death. As a result, the severity of a
system failure should be considered while designing a system. Despite their significant
dispersion throughout a broad range of categories, the systems are primarily defined
by their limited application within those categories. In each facet, just a few, if any,
systems represented several categories. For example, several domains were considerably
underrepresented and underrepresented in an inconsistent manner.
In a variety of industries, e.g. agriculture and medical applications, robots have lately
overtaken humans. The most critical system tasks were recognition and classification.
This might be due to researchers’ access to resources like robots, the overwhelming pop-
ularity of particular applications like self-driving vehicles, or technology’s increasing
capacity to apply itself to these more difficult fields. However, less well-known chal-
lenges should not be overlooked in future research, since less well-known does not always
imply lower value, and AI might be useful in situations like crop maturity assessment,
disease diagnosis, and MRI scan help. This is particularly significant since practical
research in these potentially extremely beneficial areas may aid in the dissemination
of findings and should be included in studies. In this discipline, software engineering
research hasn’t always prospered. Similarly, the majority of the tasks entrusted to the
systems are of moderate complexity.
300 U. A. Usmani et al.

The most significant category, for example, ‘recognition & classification,’ is a dif-
ficult task, yet it is often required due to the difficulties it solves: Is there any mold on
the product? Is the person in both pictures the same person? Is it now time to reap the
benefits of your labor? Assembly, as compared to recognition, is a combination of the
two: identifying and actively assembling key components. As a result, it seems that only
a small portion of genuine AI system development is concentrated on very difficult chal-
lenges. Many model-centered validations, as well as data-driven ML testing in general,
are plagued by data problems [61]. If sufficient high-quality data is available, model-
centered methods, on the other hand, allow more efficient and maybe faster validation
of model-centered systems, or at least the models used in the systems. If a system has
several components, we suggest carefully assessing whether the model-centered app-
roach is the optimum validation technique. In general, the study seems to place a higher
priority on first validation than ongoing validation. This isn’t surprising; it’s always been
done this way: a system is tested before being released when it seems to be functional.
Given the demand discrepancies between AI and traditional systems, this may not be
the case [62].
The first validation of a self-configuring system often includes self-configurability
and first configuration validation. If the system reconfigures itself during deployment, it
should almost likely be assessed to ensure that it continues to satisfy the system’s needs.
The same can be said for a video streaming platform’s recommendation algorithm, which
may have a constantly shifting user base: without continuing validation, the system
may fail to meet its requirements, which developers and users may be unaware of. Our
validation and continuing validation categories are based on original research. As a
consequence, the completeness of the categories will not be examined in this study.
Different taxonomies can be beneficial for recognizing and grasping AI validation; as a
consequence, researching, expanding, and improving these taxonomies for validity and
utility will be a future job. Monitoring the system’s outputs, thresholds, and other factors
on a continuous basis may help AI systems improve their accuracy and efficiency. The
first step in the oversight process could be to create a list of all AI systems in use at the
company, along with their specific uses, techniques used, names of developers/teams and
business owners, and risk ratings – such as calculating the potential social and financial
risks that could arise if such a system fails. Examining the AI system’s inputs and outputs,
as well as the AI system itself, may need a different methodology. Although data quality
standards aren’t exclusive to AI/ML, they do have an influence on AI systems that learn
from data and provide output depending on what they’ve learned. Training data may be
used to assess a data collection’s quality and biases.
If practical and appropriate, benchmarking of other models and existing approaches
to improve model interpretability can be incorporated in AI system assessment. Under-
standing the elements influencing AI system outputs helps to boost AI system confi-
dence. Drift in AI systems might cause a plethora of problems. A shifting link between
goal variables and independent variables, can lead to poor model accuracy. As such,
drift detection is useful tool in AI problems, e.g. in the security, privacy fairness of a
model, as avoidance measures. By evaluating whether input data varies considerably
from the model’s training data, monitoring may assist discover “data drift”. Accounting
for the model’s data collected in production and analyzing the model’s correctness is
Enhancing Artificial Intelligence Control Mechanisms 301

one way for acquiring insight into the model’s “accuracy drift”. In lending institutions,
compliance, fair lending, and system governance teams are prevalent, and they seek
for signs of bias in input variables and procedures. As a consequence of technology
advancements and the deployment of de-biasing AI, a portion, if not the majority, of
this labor can be automated and simplified in future. Fair AI, on the other hand, may
need a human-centered approach. The generalist knowledge and experience of a well-
trained and varied group probing for discriminatory bias in AI systems is unlikely to
be completely replaced by an automated procedure. As a result, human judgment might
be utilized as a first line of defense against biased artificial intelligence. By the recent
research, discrimination-reducing approaches have found to be able to minimize dispar-
ities in class-control context and still keeps good predictive quality. In order to reduce
inequities, mitigation algorithms design the “optimal” system for a certain degree of
quality and discriminating measures.
The algorithms look for alternatives when there isn’t another system with a greater
degree of quality for a certain level of discrimination. However, no solution has been
devised that completely removes bias for any given level of quality. Before such algo-
rithms utilization in production environment, one needs more testing and validation
studies. E.g. traditional algorithm searches and feature specifications for valid and less
discriminating systems, as well as more modern approaches adjusting input data or the
algorithms’ optimization functions themselves, are the two sorts of methodologies. To
reduce disparate impact, feature selection may be used, which involves removing one or
two disparate-effect components from the system and replacing them with a few addi-
tional variables. In complicated AI/ML systems, these tactics have been demonstrated to
be ineffective. For bias reduction, one needs new strategies in pre-processing and inside
the decision-making phase of the algorithm, continuing to the output post-processing
phase. The legal context, in which technology is used, as well as how it is used, has an
impact on whether specific tactics are allowed in a certain circumstance.
Accuracy drift detection might be useful in the business sector since it can detect
a decrease in model accuracy before it has a major effect on the company. Precision
drift may lead to a loss of precision in your model. Data drift, on the other hand, aids
companies in determining how data quality varies over time. It may be challenging for
many businesses to guarantee that AI/ML explanations are both accurate and useful
(explainability). AI/ML explanations, like the underlying AI/ML systems, may be poor
approximations, wrong, or inconsistent. In the financial services industry, consistency
is crucial, especially when it comes to unfavorable action letters for credit lending deci-
sions. To lessen explainability issues, explanatory procedures may be tested for accuracy
and stability in human assessment studies or on simulated data, depending on individual
implementations. According to a new study, providing explanations and forecasts about
how AI systems function may aid criminal actors.
Businesses should only provide information with customers when they directly
request it or when it is mandated by law, to prevent security concerns. Traditional security
techniques such as real-time anomaly detection, user authentication, and API throttling
may be employed to secure AI/ML systems trained on sensitive data and producing
predictions available to end users, depending on the implementation and management
302 U. A. Usmani et al.

environment. In AI applications, traditional robust technologies, as well as cyber safe-


guards, may be effective risk mitigators. As adversarial learning improves, it might be
utilized to help construct safe machine learning systems.
Despite the fact that this is a relatively young subject of study, the technology sector
is considering a variety of possible mitigation techniques. Differential privacy has been
proposed as a means of keeping personal information, including training information,
secret. Differential privacy anonymizes data by infusing it with random noise, allowing
statistical analysis without revealing personally identifiable information. As a conse-
quence, the system produces the same results even if a single user/data element record
is destroyed. Strong technological and cyber controls may be an effective mitigation
depending on implementations and context, whereas mitigation methods for the AI/ML
threats are still being investigated. Even though effective information security processes
may prevent model extraction attacks, watermarking can be used to identify an extracted
model. The AI/ML system is taught to give unique outputs for certain inputs in water-
marking. If another system delivers the exact unique result for the same precise inputs,
it might be a sign of intellectual property theft.

5 Conclusions

Utilization of AI promises new brighter future, especially in context of new generation


industries and business context, where data is available in wide scale. These new contexts
could be e.g. big fleets with lot of data [56, 57] business models operating in platform
levels [58] and wide variety of huge groups contributed data has to be processed like in
Digital citizen science activities [59, 60]. Even many traditional industries and processes
like accelerating product design [3], sustainability and circularity boosting [19] and e.g.
predicting assets maintenance based on collected big data [54]. But for this utilization
to work, different actors, asset owners and platform utilizers have to be willing to share
the gains and pains in shared development efforts [55] and also in the costs of keeping
the system and its intelligent parts developing. In the ethical AI sense, actors who have
fair cooperation do give extra incentives (reward and/or punishment) for ethical AI
development in addition to the reasons that exist now or may exist by default in the
future. Such actors value intrinsically a good compliance with norms and they do prefer
reward those who comply with appropriate criteria. On the other hand, one might want
to create an incentive for oneself to act in a certain way, as a commitment mechanism;
or one can use incentives as a supplement to other governance tools such as behavior
monitoring and direct regulation, or one wants to influence the incentives of a large
number of people. There are a variety of incentives to consider, like rapid and high
return on investment, meeting the needs of social beneficial developments and so on.
Creating incentives for essential stakeholders for engagement can assist with all the
issues such as public funding, international cooperation, etc. at once.
This research focused on the responsible use of AI, which has lately been highlighted
as an essential requirement for ML technology adoption in real-world applications. Our
research was about an end-to-end governance model for AI responsible use, emphasizing
fairness, and responsibility in large-scale AI technology deployment in real-world orga-
nizations. Key deliverables or artefacts were looked upon, which the three lines provide,
Enhancing Artificial Intelligence Control Mechanisms 303

in order for AI to be used ethically and effectively. We infer that the model generated can
aid businesses in the task of identifying the structures, processes, and responsibilities
that best support goal attainment while simultaneously ensuring robust governance and
risk management. The concept of our proposed model can help manage and regulate
risks effectively.

References
1. Collins, C., Dennehy, D., Conboy, K., Mikalef, P.: Artificial intelligence in information sys-
tems research: a systematic literature review and research agenda. Int. J. Inf. Manag. 60,
102383 (2021)
2. Smuha, N.A.: Beyond a human rights-based approach to AI governance: promise, pitfalls,
plea. Philos. Technol. 34(1), 91–104 (2021). Yang, Q
3. Ghoreishi, M., Happonen, A.: New promises AI brings into circular economy accelerated
product design: a review on supporting literature. In: E3S Web Conference, vol. 158, pp. 1–10
(2020). https://doi.org/10.1051/e3sconf/202015806002
4. Tigard, D.W.: Responsible AI and moral responsibility: a common appreciation. AI Ethics
1(2), 113–117 (2020). https://doi.org/10.1007/s43681-020-00009-0
5. Shneiderman, B.: Responsible AI: bridging from ethics to practice. Commun. ACM 64(8),
32–35 (2021)
6. Berlin, S.J., John, M.: Particle swarm optimization with deep learning for human action recog-
nition. Multimedia Tools Appl 79(25–26), 17349–17371 (2020). https://doi.org/10.1007/s11
042-020-08704-0
7. Rakova, B., Yang, J., Cramer, H., Chowdhury, R.: Where responsible AI meets reality:
practitioner perspectives on enablers for shifting organizational practices. Proc. ACM Hum.
Comput. Interact. 5(CSCW1), 1–23 (2021)
8. Wearn, O.R., Freeman, R., Jacoby, D.M.: Responsible AI for conservation. Nat. Mach. Intell.
1(2), 72–73 (2019)
9. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, oppor-
tunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
10. Ghoreishi, M., Happonen, A., Pynnönen, M.: Exploring industry 4.0 technologies to enhance
circularity in textile industry: role of Internet of Things. In: Twenty-first International Working
Seminar on Production Economics, Austria, 24–28 February 2020, pp. 1–16 (2020). https://
doi.org/10.5281/zenodo.3471421
11. Metso, L., Happonen, A., Rissanen, M.: Estimation of user base and revenue streams for novel
open data based electric vehicle service and maintenance ecosystem driven platform solution.
In: Karim, R., Ahmadi, A., Soleimanmeigouni, I., Kour, R., Rao, R. (eds.) IAI 2021. LNME,
pp. 393–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93639-6_34
12. Usmani, U.A., Haron, N.S., Jaafar, J.: A natural language processing approach to mine online
reviews using topic modelling. In: Chaubey, N., Parikh, S., Amin, K. (eds.) COMS2 2021.
CCIS, vol. 1416, pp. 82–98. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-767
76-1_6
13. Trocin, C., Mikalef, P., Papamitsiou, Z., Conboy, K.: Responsible AI for digital health: a
synthesis and a research agenda. Inf. Syst. Front., 1–19 (2021)
14. Peters, D., Vold, K., Robinson, D., Calvo, R.A.: Responsible AI—two frameworks for ethical
design practice. IEEE Trans. Technol. Soc. 1(1), 34–47 (2020)
15. Clarke, R.: Principles and business processes for responsible AI. Comput. Law Secur. Rev.
35(4), 410–422 (2019)
304 U. A. Usmani et al.

16. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning based
adaptive ROI generation for video object segmentation. IEEE Access 9, 161959–161977
(2021)
17. Sambasivan, N., Holbrook, J.: Toward responsible AI for the next billion users. Interactions
26(1), 68–71 (2018)
18. Butler, L.M., Arya, V., Nonyel, N.P., Moore, T.S.: The Rx-HEART framework to address
health equity and racism within pharmacy education. Am. J. Pharm. Educ. 85(9) (2021)
19. Ghoreishi, M., Happonen, A.: Key enablers for deploying artificial intelligence for circu-
lar economy embracing sustainable product design: three case studies. In: AIP Conference
Proceedings 2233(1), 1–19 (2020). https://doi.org/10.1063/5.0001339
20. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning algorithm
for automated detection of skin lesions. Appl. Sci. 11(20), 9367 (2021)
21. Dignum, V.: The role and challenges of education for responsible AI. Lond. Rev. Educ. 19(1),
1–11 (2021)
22. Leslie, D.: Tackling COVID-19 through responsible AI innovation: five steps in the right
direction. Harv. Data Sci. Rev. (2020)
23. Ghoreishi, M., Happonen, A.: The case of fabric and textile industry: the emerging role of
digitalization, Internet-of-Things and industry 4.0 for circularity. In: Yang, X.-S., Sherratt,
S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information
and Communication Technology. LNNS, vol. 216, pp. 189–200. Springer, Singapore (2022).
https://doi.org/10.1007/978-981-16-1781-2_18
24. Wang, Y., Xiong, M., Olya, H.: Toward an understanding of responsible artificial intelligence
practices. In: Proceedings of the 53rd Hawaii International Conference on System Sciences,
pp. 4962–4971. Hawaii International Conference on System Sciences (HICSS), January 2020
25. Cheng, L., Varshney, K.R., Liu, H.: Socially responsible AI algorithms: issues, purposes, and
challenges. J. Artif. Intell. Res. 71, 1137–1181 (2021)
26. Happonen, A., Ghoreishi, M.: A mapping study of the current literature on digitalization and
industry 4.0 technologies utilization for sustainability and circular economy in textile indus-
tries. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International
Congress on Information and Communication Technology. LNNS, vol. 217, pp. 697–711.
Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_63
27. Ashok, M., Madan, R., Joha, A., Sivarajah, U.: Ethical framework for artificial intelligence
and digital technologies. Int. J. Inf. Manag. 62, 102433 (2022)
28. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforced active learning algorithm
for semantic segmentation in complex imaging. IEEE Access 9, 168415–168432 (2021)
29. Maree, C., Modal, J.E., Omlin, C.W.: Towards responsible AI for financial transactions.
In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 16–21. IEEE,
December 2020
30. Rockall, A.: From hype to hope to hard work: developing responsible AI for radiology. Clin.
Radiol. 75(1), 1–2 (2020)
31. Constantinescu, M., Voinea, C., Uszkai, R., Vică, C.: Understanding responsibility in responsi-
ble AI. Dianoetic virtues and the hard problem of context. Ethics Inf. Technol. 23(4), 803–814
(2021). https://doi.org/10.1007/s10676-021-09616-9
32. Happonen, A., Santti, U., Auvinen, H., Räsänen, T., Eskelinen, T.: Digital age business model
innovation for sustainability in university industry collaboration model. In: E3S Web of Con-
ferences, vol. 211, Article no. 04005, pp. 1–11 (2020). https://doi.org/10.1051/e3sconf/202
02110400
33. Al-Dhaen, F., Hou, J., Rana, N.P., Weerakkody, V.: Advancing the under- standing of the role
of responsible AI in the continued use of IoMT in health-care. Inf. Syst. Front., 1–20 (2021)
Enhancing Artificial Intelligence Control Mechanisms 305

34. McDonald, M.L., Keeves, G.D., Westphal, J.D.: One step forward, one step back: white
male top manager organizational identification and helping behavior toward other executives
following the appointment of a female or racial minority CEO. Acad. Manag. J. 61(2), 405–439
(2018)
35. Usmani, U.A., Roy, A., Watada, J., Jaafar, J., Aziz, I.A.: Enhanced reinforcement learning
model for extraction of objects in complex imaging. In: Arai, K. (ed.) Intelligent Computing.
LNNS, vol. 283, pp. 946–964. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-
80119-9_63
36. Lee, M.K., et al.: Human-centered approaches to fair and responsible AI. In: Extended
Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8,
April 2020
37. Yang, Q.: Toward responsible AI: an overview of federated learning for user-centered privacy-
preserving computing. ACM Trans. Interact. Intell. Syst. (TiiS) 11(3–4), 1–22 (2021)
38. Hirvimäki, M., Manninen, M., Lehti, A., Happonen, A., Salminen, A., Nyrhilä, O.: Evaluation
of different monitoring methods of laser additive manufacturing of stainless steel. Adv. Mater.
Res. 651, 812–819 (2013). https://doi.org/10.4028/www.scientific.net/AMR.651.812
39. Sen, P., Ganguly, D.: Towards socially responsible ai: cognitive bias-aware multi-objective
learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03,
pp. 2685–2692, April 2020
40. de Laat, P.B.: Companies committed to responsible AI: from principles to-wards implemen-
tation and regulation? Philos. Technol. 34(4), 1135–1193 (2021)
41. Happonen, A., Tikka, M., Usmani, U.: A systematic review for organizing hackathons and
code camps in COVID-19 like times: literature in demand to understand online hackathons
and event result continuation. In: 2021 International Conference on Data and Software
Engineering (ICoDSE), pp. 7–12 (2021). https://doi.org/10.1109/ICoDSE53690.2021.964
8459
42. Wangdee, W., Billinton, R.: Bulk electric system well-being analysis using sequential Monte
Carlo simulation. IEEE Trans. Power Syst. 21(1), pp. 188–193 (2006)
43. Usmani, U.A., Usmani, M.U.: Future market trends and opportunities for wearable sensor
technology. IACSIT Int. J. Eng. Technol. 6(4), 326–330 (2014)
44. Dignum, V.: Ensuring responsible AI in practice. In: Dignum, V. (ed.) Responsible Artifi-
cial Intelligence. Artificial Intelligence: Foundations, Theory, and Algorithms, pp. 93–105.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30371-6_6
45. Amershi, S.: Toward responsible AI by planning to fail. In: Proceedings of the 26th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, p. 3607, August
2020
46. Cath, C.: Governing artificial intelligence: ethical, legal and technical opportunities and
challenges. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 376(2133), 20180080 (2018)
47. Eskelinen, T., Räsänen, T., Santti, U., et al.: Designing a business model for environmental
monitoring services using fast MCDS innovation support tools. TIM Rev. 7(11), 36–46 (2017).
https://doi.org/10.22215/timreview/1119
48. Truby, J.: Governing artificial intelligence to benefit the UN sustainable development goals.
Sustain. Dev. 28(4), 946–959 (2020)
49. Happonen, A, Salmela, E.: Automatic & unmanned stock replenishment process using scales
for monitoring. In: Proceedings of the Third International Conference on Web Information
Systems and Technologies - (Volume 3), Barcelona, Spain, 3–6 March 2007, pp. 157–162
(2007). https://doi.org/10.5220/0001282801570162
50. Braun, B.: Governing the future: the European central bank’s expectation management during
the Great moderation. Econ. Soc. 44(3), 367–391 (2015)
51. Nitzberg, M., Zysman, J.: Algorithms, data, and platforms: the diverse challenges of governing
AI. J. Eur. Public Policy (2021)
306 U. A. Usmani et al.

52. Salmela, E., Santos, C., Happonen, A.: Formalisation of front end innovation in supply net-
work collaboration. Int. J. Innov. Reg. Dev. 5(1), 91–111 (2013). https://doi.org/10.1504/
IJIRD.2013.052510
53. Piili, H., et al.: Digital design process and additive manufacturing of a configurable product.
Adv. Sci. Lett. 19(3), 926–931 (2013). https://doi.org/10.1166/asl.2013.4827
54. Metso, L., Happonen, A., Rissanen, M., Efvengren, K., Ojanen, V., Kärri, T.: Data openness
based data sharing concept for future electric car maintenance services. In: Ball, A., Gelman,
L., Rao, B.K.N. (eds.) Advances in Asset Management and Condition Monitoring. SIST, vol.
166, pp. 429–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57745-2_36
55. Happonen, A., Siljander, V.: Gainsharing in logistics outsourcing: trust leads to success in
the digital era. Int. J. Collab. Enterp. 6(2), 150–175 (2020). https://doi.org/10.1504/IJCENT.
2020.110221
56. Kärri, T., Marttonen-Arola, S., Kinnunen, S-K., Ylä-Kujala, A., Ali-Marttila, M., et al.: Fleet-
based industrial data symbiosis, title of parent publication: S4Fleet - service solutions for fleet
management, DIMECC Publications series No. 19, 06/2017, pp. 124–169 (2017)
57. Kinnunen, S.-K., Happonen, A., Marttonen-Arola, S., Kärri, T.: Traditional and extended
fleets in literature and practice: definition and untapped potential. Int. J. Strateg. Eng. Asset
Manag. 3(3), 239–261 (2019). https://doi.org/10.1504/IJSEAM.2019.108467
58. Metso, L., Happonen, A., Ojanen, V., Rissanen, M., Kärri, T.: Business model design ele-
ments for electric car service based on digital data enabled sharing platform, Cambridge.
In: International Manufacturing Symposium, Cambridge, UK, 26–27 September 2019, p. 6
(2019). https://doi.org/10.17863/CAM.45886
59. Palacin, V., Gilbert, S., Orchard, S., Eaton, A., Ferrario, M.A., Happonen, A.: Drivers of
participation in digital citizen science: case studies on Järviwiki and safecast. Citiz. Sci.
Theory Pract. 5(1), Article no. 22, pp. 1–20 (2020). https://doi.org/10.5334/cstp.290
60. Palacin, V., et al.: SENSEI: harnessing community wisdom for local environmental monitoring
in Finland. In: CHI Conference on Human Factors in Computing Systems, Glagsgow, Scotland
UK, pp. 1–8 (2019). https://doi.org/10.1145/3290607.3299047
61. Zhang, D., Yin, C., Zeng, J., Yuan, X., Zhang, P.: Combining structured and unstructured data
for predictive models: a deep learning approach. BMC Med. Inform. Decis. Mak. 20(1), 1–11
(2020)
62. Vassev, E., Hinchey, M.: Autonomy requirements engineering. In: Vassev, E., Hinchey, M.
(eds.) Autonomy Requirements Engineering for Space Missions. NASA Monographs in Sys-
tems and Software Engineering, pp. 105–172. Springer, Cham (2014). https://doi.org/10.
1007/978-3-319-09816-6_3
A General Framework of Particle Swarm
Optimization

Loc Nguyen1(B) , Ali A. Amer2 , and Hassan I. Abdalla3


1 Loc Nguyen’s Academic Network, Long Xuyên, Vietnam
[email protected]
2 Computer Science Department, TAIZ University, Taiz, Yemen
3 College of Technological Innovation, Zayed University, P.O. Box 144534, Abu Dhabi, UAE

Abstract. Particle swarm optimization (PSO) is an effective algorithm to solve


the optimization problem in case that derivative of target function is inexistent or
difficult to be determined. Because PSO has many parameters and variants, we
propose a general framework of PSO called GPSO which aggregates important
parameters and generalizes important variants so that researchers can customize
PSO easily. Moreover, two main properties of PSO are exploration and exploita-
tion. The exploration property aims to avoid premature converging so as to reach
global optimal solution whereas the exploitation property aims to motivate PSO to
converge as fast as possible. These two aspects are equally important. Therefore,
GPSO also aims to balance the exploration and the exploitation. It is expected that
GPSO supports users to tune parameters for not only solving premature problem
but also fast convergence.

Keywords: Global optimization · Particle Swarm Optimization (PSO) ·


Exploration · Exploitation

1 Introduction to Particle Swarm Optimization (PSO)

Particle swarm optimization (PSO) algorithm was developed by James Kennedy (a social
psychologist) and Russell C. Eberhart (an electrical engineer). This section is navigated
by the article “Particle swarm optimization: An overview” of Riccardo Poli, James
Kennedy, and Tim Blackwell. The main idea of PSO is based on social intelligence
when it simulates how a flock of birds search for food. Given a target function known as
cost function f (x), the optimization problem is to find out the minimum point x* known
as minimizer or optimizer so that f (x* ) is minimal. In PSO theory, f (x) is also called
fitness function and thus, when f (x) is evaluated at f (x0 ) then, f (x0 ) is called fitness
value which represents the best food source for which a flock of birds search.. If x* is an
optimizer, f (x* ) is called optimal value, best value, or best fitness value. As a convention,
the optimization problem is global minimization problem when x* is searched over entire
domain of f (x). For global maximization, it is simple to change a little bit our viewpoint.

x∗ = argmin f (x)
x
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 307–316, 2023.
https://doi.org/10.1007/978-3-031-18461-1_20
308 L. Nguyen et al.

Traditional local optimization methods such as Newton-Raphson and gradient


descent along with global optimization methods require that f (x) is differentiable. Alter-
nately, PSO does not require existence of differential. PSO scatters a population of
candidate solutions (candidate optimizers) for x* and such population is called swarm
whereas each candidate optimizer is called particle in the swarm. PSO is an iterative
algorithm running over many iterations in which every particle is moved at each itera-
tion so that it approaches the global optimizer x* . Movement of all particles is attracted
by x* . In other words, such movement is attracted by minimizing f (x) so that f (x) is
small enough. In PSO, x is considered as position of particle. The movement of each
particle is affected by its best position and the best position of the swarm. Note, the
closer to x* , the better the position is.
As a formal definition, let be the swam of particles and let xi and pi be current
position and best position of particle i. Note, pi is called local best position. Moreover,
the movement speed of particle i is specified by its velocity vi . Let pg be the global best
position of entire swarm. The closer to x* , the better the positions pi and pg are. It is
expected that pg is equals to x* or is approximated to x* . The ultimate purpose of PSO
is to determine pg .

pg ∼
= argmin f (x)
x

Of course, xi = (xi1 , xi2 , . . . , xin )T , pi = (pi1 , pi2 , . . . , pin )T , and pg =


 T
pg1 , pg2 , . . . , pgn are n-dimensional points and vi = (vi1 , vi2 , . . . , vin )T is n-
dimensional vector because f (x) is from real n-dimensional space Rn to real space R.
Following is pseudo-code of PSO [1] (Table 1).
Equation 1 is the heart of PSO, which is called velocity update rule. Equation 2 is
called position update rule. There are two most popular terminated conditions:

1. The cost function at pg which is evaluated as f (pg ) is small enough. For example,
f (pg ) is smaller than a small threshold.
2. Or PSO ran over a large enough number of iterations.

Function U (0, φ 1 ) generates a random vector whose elements are random numbers
in the range [0, φ 1 ]. Similarly, function U (0, φ 2 ) generates a random vector whose
elements are random numbers in the range [0, φ 2 ]. For example,

U (0, φ1 ) = (r11 , r12 , . . . , r1n )T where 0 ≤ r1j ≤ φ1

U (0, φ2 ) = (r21 , r22 , . . . , r2n )T where 0 ≤ r2j ≤ φ2

Note, the super script “T ” indicates transposition operator of vector and matrix. The
operator ⊗ denotes component-wise multiplication of two points [2, p. 3]. For example,
given random vector U (0, φ 1 ) = (r 11 , r 12 ,…, r 1n )T and position xi = (x i1 , x i2 ,…, x in )T ,
A General Framework of Particle Swarm Optimization 309

Table 1. Basic particle swarm optimization (PSO) algorithm.


310 L. Nguyen et al.

their component-wise multiplication is:


⎛ ⎞
r11 xi1
⎜ r12 xi2 ⎟
⎜ ⎟
U (0, φ1 ) ⊗ xj = ⎜ . ⎟
⎝ .. ⎠
r1n xin
   
Two components U (0, φ1 ) ⊗ pi − xi and U (0, φ2 ) ⊗ pg − xi are considered as
 
attraction forces that push
 every  particle to move. Sources of force U (0, φ1 ) ⊗ pi − xi
and force U (0, φ2 ) ⊗ pg − xi are the particle i itself and its neighbors. Thus, two most
important parameters of PSO are φ 1 and φ 2 which represent the two attraction forces.
The popularvalues of  them are φ 1 = φ 2 = 1.4962. Parameter φ 1 along with the force
U (0, φ1 ) ⊗ pi − xi express
  exploitation of PSO whereas parameter φ 2 along with
the
the force U (0, φ2 ) ⊗ pg − xi express the exploration of PSO [2, p. 4]. The larger
parameter φ 1 is, the faster PSO converges but it trends to converge at local minimizer.
In opposite, if parameter φ 2 is large, convergence to local minimizer will be avoided in
order to achieve better global optimizer but convergence speed is decreased. Parameters
φ 1 and φ 2 are also called acceleration coefficients or attraction coefficients. Especially,
φ 1 is called cognitive weight and φ 2 is called social weight because φ 1 reflects thinking
of particle itself in moving and φ 2 reflects influence of entire swarm on every particle in
moving. In practical, velocity vi can be bounded in the range [–vmax , + vmax ] in order to
avoid out of convergence trajectories but the parameter vmax is not popular because there
are some other parameters such as inertial weight and constriction coefficient (mentioned
later) which are used to damp the dynamics of particles. Favorite values for the size of
swarm (the number of particles) are ranged from 20 to 50.
Because any movement has inertia, inertial force is added to the two attraction forces.
Hence, the inertial force is represented by a so-called inertial weight ω where 0 < ω ≤
1. Equation becomes [2, p. 4]:
   
vi = ωvi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi (3)

The larger inertial weight ω is, the faster particles move because its inertial is high,
which leads PSO to explore global optimizer. Note that moving fast does not imply fast
convergence. In opposite, the smaller ω leads PSO to exploit local optimizer. In general,
large ω expresses exploration and small ω expresses exploitation. The inverse 1–ω is
known as friction coefficient. The popular value of ω is 0.7298 given φ 1 = φ 2 = 1.4962.
Pioneers in PSO [2, p. 5] recognized that if velocities vi of particles are not restricted,
their movements can be out of convergence trajectories at unacceptable levels. Therefore,
they proposed a so-called constriction coefficient χ to damp dynamics of particles. Note,
χ is also called constriction weight or damping weight where 0 < χ ≤ 1. With support
of constriction coefficient, Eq. 1 becomes [2, p. 5]:
    
vi = χ vi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi (4)

It is easy to recognize that Eq. 3 is special case of Eq. 4 when the expression χ vi
is equivalent to the expression ωvi . The popular value of constriction coefficient is χ =
A General Framework of Particle Swarm Optimization 311

0.7298 given φ 1 = φ 2 = 2.05 and ω = 1. Note, inertial weight ω is also the parameter that
damps dynamics of particles. This is the reason that ω = 1 when χ = 1 but constriction
of χ is stronger than ω because χ affects previous velocity and two attraction forces
whereas ω affects only previous velocity.
Structure of swarm which is determined by defining neighbors and neighborhood of
every particle is called swarm topology or population topology.
 Because
 pg is the best
position of entire swarm, the attraction force U (0, φ2 ) ⊗ pg − xi indicates the move-
ment of each particle is affected by all other particles, which means that every particle
connects to all remaining particles. In other words, neighbors of a particle are all other
particles, which is known as fully connected swarm topology. For easily understandable
explanation, suppose particles are vertices of a graph, fully connected swarm topology
implies that such graph is fully connected, in which all vertices are connected together.
Alternately, swarm topology can be defined in different way so that each particle i only
connects with a limit number K i of other particles. In other words, each particle has only
K i neighbors. With custom-defined swarm topology, Eq. 4 is written as follows [2, p. 6]:


1 Ki  
vi = χ vi + U (0, φ) ⊗ qk − xi (5)
Ki k=1

where qk is the best position of the k th neighbor of particle i. Of course, qk is pj of some


particle j.
qk = pj such that particle j is the k th neighbor of particle i.
Please pay attention that, in Eq. 5, particle i is also its neighbor. In other words, in
Eq. 5, the set of K i neighbors includes particle i. The two parameters φ 1 and φ 2 are
reduced into only one parameter φ > 0, which implies the strengths of all attraction
forces from all neighbors on particle i are equal. The popular value of φ is φ = 2.05
given χ = 0.7298. Equation 5 is known as Mendes’ fully informed particle swarm (FIPS)
method. The topology in the basic PSO specified by Eq. 1, Eq. 3, and Eq. 4 is known
global best topology because only one best position pg of entire swarm is kept track.
However, Eq. 5 indicates that many best positions from groups implied by neighbors
are kept track. Hence, FIPS specifies a so-called local best topology, which converges
slowly but avoids converging at local optimizer. In other words, local best topology
aims to exploration rather than exploitation. However, at a compromise, FIPS makes
convergence speed of PSO slow because the exploitation is scarified for the exploration.
Therefore, we propose a general framework of PSO in Sect. 2 which aims to balance the
exploration and the exploitation. Moreover, in Sect. 2, we apply a probabilistic technique
into tuning parameters for improving the exploitation. PSO researchers often concern the
problem of premature convergence, but convergence speed is also important. Section 3
summarizes experimental results and Sect. 4 is conclusion. In general, we hope that this
research will have two contributions to PSO research community: 1) proposal of a PSO
general framework, especially for new researchers and 2) concerning the importance of
exploitation.

2 General PSO with Probabilistic Constriction Coefficient


Recall that the two main aspects of PSO are exploration and exploitation. The explo-
ration aspect aims to avoid premature converging so as to reach global optimizer whereas
312 L. Nguyen et al.

the exploitation aspect aims to motivate PSO to converge as fast as possible. Besides
exploitation property can help PSO to converge more accurately regardless of local opti-
mizer or global optimizer. These two aspects are equally important. Consequently, two
problems corresponding to the exploration and exploitation are premature problem and
dynamic problem. Solutions of the premature problem are to improve the exploration
and solutions of the dynamic problem are to improve the exploitation. Inertial weight
and constriction coefficient are common solutions for dynamic problem. Currently, solu-
tions of dynamic problem often relate to tuning coefficients which are PSO parameters.
Solutions of premature problem relates to increase dynamic ability of particles such as:

– Dynamic topology.
– Change of fitness function.
– Adaptation includes tuning coefficients, adding particles, removing particles, and
changing particle properties.
– Diversity control.

The proposed general framework of PSO called GPSO aims to balance the explo-
ration and the exploitation, which solves both premature problem and dynamic problem.
If we focus on the fact that the attraction force issued by the particle i itself is equivalent
to the attraction force from the global best position pg and the other attraction forces
from its neighbors qk , Eq. 5 is modified as follows:
    
vi = χ ωvi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi

1 Ki  
+ U (0, φ) ⊗ qk − xi (6)
Ki k=1

In Eq. 6, the set of K i neighbors does not include particle i and so, the three parameters
φ 1 , φ 2 , and φ are co-existent. Inertial weight ω is kept intact too. It is easy to recognize
that Eq. 6 is the general form of velocity update rule. In other words, GPSO is specified
by Eq. 6, which balances local best topology and global best topology with expectation
that convergence speed is improved but convergence to local optimizer can be avoided. In
other words, Eq. 6 aims to achieve both exploration and exploitation. The topology from
Eq. 1, Eq. 3, Eq. 4, and Eq. 5 is static [2, p. 6] because it is kept intact over all iterations
of PSO. In other words, neighbors and neighborhood in static topology are established
fixedly. However, in the GPSO specified by Eq. 6, it is possible to relocate neighbors of a
given particle at each iteration. Therefore, dynamic topology can be achieved in GPSO,
which depends on individual applications. This implies that the premature problem can
be solved with GPSO so that PSO is not trapped in local optimizer.
In PSO theory, solutions of dynamic problem are to improve the exploitation so that
PSO can converge as fast as possible. Inertial weight and constriction coefficient are
common solutions for dynamic problem. Hence, GPSO supports tuning coefficients.
Concretely, constriction coefficient is tuned with GPSO. However, tuning a parameter
does not mean that such parameter is modified simply at each iteration because the mod-
ification must be solid and based on valuable knowledge. Fortunately, James Kennedy
and Russell C. Eberhart [2, p. 13], [3, p. 3], [4, p. 51] proposed bare bones PSO (BBPSO)
in which they asserted that, given xi = (x i1 , x i2 ,…, x in )T , pi = (pi1 , pi2 ,…, pin )T , and pg =
A General Framework of Particle Swarm Optimization 313

(pg1 , pg2 ,…, pgn )T , the jth element x ij of xi follows normal distribution with mean (pij +
 2
pgj )/2 and variance pij − pgj . Based on this valuable knowledge, we tune constriction
parameter χ with normal distribution at each iteration.
Let zi = (zi1 , zi2 ,…, zin ) be random vector corresponding to each position xi of
particle
i. Every jth element zij of zi is randomized according to normal distribution
pij +pgj  2  p +p  2
N 2 , pij − pgj with mean μi = ij 2 gj and variance σi2 = pij − pgj .
Every zij is randomized by normal distribution


pij + pgj  2
N , pij − pgj (7)
2
where,

  2 
pij + pgj 2  2 1 1 zij − μi
N μi = , σi = pij − pgj = exp −
2 2π σi2 2 σi2

Note, N denotes normal distribution. Let g(zij ) be the pseudo probability density
function of zij .
⎛  ⎞
  2  pij +pgj 2
  1 zij − μi z − 2
⎜ 1 ij ⎟
g zij = exp − = exp⎝−   ⎠ (8)
2 σi2 2 pij − pgj 2

Of course, we have:


  pij + pgj 2  2
g zij ∼ N μi = , σi = pij − pgj
2

Note, the sign “ ~” denotes proportion. Let X = (χ1 , χ2 , . . . , χn )T be probabilistic


constriction coefficient specified by Eq. 9.

⎪ pij +pgj
⎨ 0 if pij = pgj and zij = 2
pij −pgj
χj = 1 if pij = pgj and zij = 2 (9)
⎩ g z  if p = p

ij ij gj

Note, X is n-dimension vector. GPSO velocity update rule specified by Eq. 6 is


modified as follows:
    
vi = X ⊗ ωvi + U (0, φ1 ) ⊗ pi − xi + U (0, φ2 ) ⊗ pg − xi

1 Ki  
+ U (0, φ) ⊗ qk − xi (10)
Ki k=1

In Eq. 10, constriction coefficient χ is replaced by probabilistic constriction coef-


ficient X. Obviously, Eq. 10 is the most general form of GPSO velocity update rule.
According to Eq. 10 with probabilistic constriction coefficient X, the closer to global
best position pg the local best position pi is, the more dynamic the position xi is, which
314 L. Nguyen et al.

aims to exploration for converging to global optimizer. The farer to global best posi-
tion pg the local best position pi is, the less dynamic the position xi is, which aims to
exploitation for fast convergence. This is purpose of adding probabilistic constriction
coefficient X to Eq. 6 for solving dynamic problem. As a convention, GPSO specified
by Eq. 10 is called probabilistic GPSO. Source code of GPSO and probabilistic GPSO is
available at https://github.com/ngphloc/ai/tree/main/3_implementation/src/net/ea/pso.

3 Experimental Results and Discussions


GPSO specified by Eq. 6 and probabilistic GPSO specified by Eq. 10 are tested with
basic PSO specified by Eq. 4. The cost function (fitness function) is [5, p. 24]:
 
f x = (x1 , x2 )T = −cos(x1 )cos(x2 )exp −(x1 − π )2 − (x2 − π )2 (11)

The lower bound and upper bound of positions in initialization stage are lb = (–10,
–10)T and ub = (10, 10)T . The terminated condition is that the bias of the current global
best value and the previous global best value is less than ε = 0.01. Parameters of GPSO
are φ 1 = φ 2 = φ = 2.05, ω = 1, and χ = 0.7298. Parameters of probabilistic GPSO are
φ 1 = φ 2 = φ = 2.05. Parameters of basic PSO are φ 1 = φ 2 = 2.05 and χ = 0.7298. The
swarm size is 50. For the three PSO, dynamic topology is established at each iteration
by a so-called fitness distance ratio (FDR). Exactly, Peram [2, p. 8] defined the topology
dynamically at each iteration by FDR. Given target particle i and another particle j, their
FDR is the ratio of the difference between f (xi ) and f (xj ) to the Euclidean difference
between xi and xj .
  
  f (xi ) − f xj 
FDR xi , xj =   (12)
xi − xj 

Given target particle i, if FDR(xi , xj ) is larger than a threshold (>1), the particle j
is a neighbor of the target particle i. Alternately, top K particles whose FDR (s) with xi
are largest are K neighbors of particle i.
From the experiment, basic PSO, GPSO, and probabilistic GPSO converge to best
values –0.9842, –0.9973, and –0.9999 with global best positions (3.0421, 3.1151)T ,
(3.1837, 3.1352)T , and (3.1464, 3.1485)T after 6, 18, and 18 iterations, respectively.
The true best value of the target function specified by Eq. 11 is −1 whereas the true
global optimizer is x* = (3.1416, 3.1416)T . Therefore, the biases in best values (fitness
biases) of basic PSO, GPSO, and probabilistic GPSO are 0.0158, 0.0027, and 0.0001,
respectively and the biases in best positions (optimizer biases) of basic PSO, GPSO, and
probabilistic GPSO are (0.0995, 0.0265)T , (0.0421, 0.0064)T , and (0.0048, 0.0069)T ,
respectively.
A General Framework of Particle Swarm Optimization 315

Table 2. Evaluation of PSO algorithms.

Fitness bias Optimizer bias Converged iteration


Basic PSO 0.0158 (0.0995, 0.0265)T 6
GPSO 0.0027 (0.0421, 0.0064)T 18
Probabilistic GPSO 0.0001 (0.0048, 0.0069)T 18

From Table 2, fitness bias and optimizer bias of probabilistic PSO are smallest.
Therefore, probabilistic PSO is the preeminent one. Basic PSO converges soonest after
6 iterations but basic PSO copes with the premature problem due to lowest converged
fitness value whereas both GPSO and probabilistic GPSO solve the premature prob-
lem with better converged fitness values (−0.9973 and −0.9999) but they require more
iterations (18). The reason that GPSO is better than basic PSO is combination of local
best topology and global best topology in GPSO. The event that probabilistic GPSO
is better than GPSO proves that the probabilistic constriction coefficient can solve the
dynamic problem. About fitness bias, probabilistic GPSO is 27 times better than normal
GPSO, which implies that the exploitation is as important as the exploration. In some
situations where there are many local optimizers, reaching a good enough local opti-
mizer can be acceptable and more feasible than reaching the global optimizer absolutely.
Practical PSO attracts researchers’ attention because it solves the complexity problem
of pure mathematics in global optimization which gets stuck in how to find assuredly the
global optimizer. Therefore, that the probabilistic GPSO improves convergence speed is
meaningful. Moreover, it does not restrict the dynamics of particles. Indeed, it keeps the
dynamics of particles towards optimal trends with support of probabilistic distribution.
Thus, it also balances two PSO properties such as exploration and exploitation.

4 Conclusions
The first purpose of GPSO which is to aggregate important parameters and to generalize
important variants completed with the general form of velocity update rule and the
second purpose is to balance the two PSO properties such as exploration and exploitation
which is reached at moderate rate although experimental results showed that GPSO
and probabilistic GPSO are better than basic PSO due to combination of local best
topology and global best topology along with definition of probabilistic constriction
coefficient, which proved improvement of global convergence. The reason of balance at
moderate rate is that dynamic topology in GPSO is supported indirectly via general form
of velocity update rule, which is impractical because researchers must modify source
code of GPSO in order to define dynamic topology. Moreover, premature problem is
solved by many other solutions such as dynamic topology, change of fitness function,
adaptation (tuning coefficients, adding particles, removing particles, changing particle
properties), and diversity control over iterations. In future trend, we will implement
dynamic solutions with support of other evolutionary algorithms like artificial bee colony
algorithm and genetic algorithm. Moreover, we will research how to apply PSO into
learning neural network.
316 L. Nguyen et al.

References
1. Wikipedia: Particle swarm optimization. (Wikimedia Foundation), 7 March 2017. https://en.
wikipedia.org/wiki/Particle_swarm_optimization. Accessed 8 Apr 2017
2. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. In: Dorigo, M. (ed.) Swarm
Intelligence, vol. 1, no. 1, pp. 33–57, June 2007. https://doi.org/10.1007/s11721-007-0002-0
3. Pan, F., Hu, X., Eberhart, R., Chen, Y.: An analysis of bare bones particle swarm. In: IEEE
Swarm Intelligence Symposium 2008 (SIS 2008), St. Louis, MO, US, pp. 1–5. IEEE, 21
September 2008. https://doi.org/10.1109/SIS.2008.4668301
4. al-Rifaie, M.M., Blackwell, T.: Bare bones particle swarms with jumps. In: Dorigo, M., et al.
(eds.) ANTS 2012. LNCS, vol. 7461, pp. 49–60. Springer, Heidelberg (2012). https://doi.org/
10.1007/978-3-642-32650-9_5
5. Sharma, K., Chhamunya, V., Gupta, P.C., Sharma, H., Bansal, J.C.: Fitness based particle
swarm optimization. Int. J. Syst. Assur. Eng. Manag. 6(3), 319–329 (2015). https://doi.org/10.
1007/s13198-015-0372-4
How Artificial Intelligence
and Videogames Drive Each Other
Forward

Nathanel Fawcett and Lucien Ngalamoum(B)

Lewis University, Romeoville, IL 60446, USA


{nathanelvfawcett,ngalamlu}@lewisu.edu

Abstract. This paper details multiple areas in which artificial intelli-


gence and video games interact, discusses how these fields can continue
to grow together, and highlights some of the other fields that benefit
by the combination. Artificial intelligence has a positive effect, not just
on video games, but can be used in combination for the benefit of such
fields as education, military, healthcare, and aerospace by use of simu-
lations. In turn, video games foster an environment for which artificial
intelligence can grow, be that a competition to create the best artificial
intelligence player, the use of artificially intelligent characters in video
games, or the production of video games.

Keywords: Artificial intelligence · Video games · Simulations

1 Introduction
Some level of artificial intelligence has been an integral feature of video games
since the earliest days. Pong, which is commonly perceived as the first game
ever made (although it is not), featured what could be interpreted as an intel-
ligent opponent. Early Mario games had enemies with unique characteristics.
The green koopas would walk forward until they hit a wall or fall off a cliff,
whereas red koopas would turn away from cliffs. While not a real intelligence,
the red koopas were perceived as more intelligent. The perception or illusion
of intelligence has become a staple characteristic of video games. Modern video
games have become more complex, requiring an equally more complex illusion of
intelligence. In games such as the Fable franchise, non-player characters perceive
the player as good or evil based on their deeds. In Elder Scrolls V: Skyrim, non-
player characters discuss the player’s exploits, before they inevitably go on about
their day. While these tricks have been good enough to satisfy gamers thus far,
how long will a simple illusion of intelligence be satisfactory in games? Artificial
intelligence has already become such a common topic of discussion, and while
game AI may not be on the same level as artificial intelligence, the gap is closing.
More games feature some level of implementation. Players and researchers are
creating artificial intelligence to play games. Studies are preparing for artificial
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 317–327, 2023.
https://doi.org/10.1007/978-3-031-18461-1_21
318 N. Fawcett and L. Ngalamoum

intelligence to take part in the production process. Video games and artificial
intelligence even help other fields of study. These combinations of subjects are
used in education and healthcare. Simulations are used for training purposes in
a variety of fields, when failure would be too much of a risk, including military
and aerospace.

2 The Problem

The primary intention of this paper is to discuss the progression of artificial


intelligence in the video game industry. This paper will discuss how artificial
intelligence and video games can still grow together, as well as discuss how the
fusion of these technologies have benefited a variety of other industries and can
continue to do so. The goal of this paper is to bring attention to the growing
functionalities of these combined subjects in anticipation for the future. This
includes taking a:

1. How video games and AI are used in conjunction for competitions and edu-
cation
2. How AI can be used in game development as Non-player characters
3. How AI can be used at various stages in the production of games.
4. How AI and video games can function together to create tools in different
fields.
5. How AI can benefit from being used in video games.

This study proposed that video games and AI are a combination that will
continue to benefit each other as well as other fields.

3 Intro to AI - AI and Video Games in Education


It is typical in higher education for a student interested in computer science
to learn about artificial intelligence. AI course material is frequently populated
with video game topics. One such common task found in AI courses is the use
of a minimax (MM) algorithm to solve a game of tic tac toe. MM is a recur-
sive algorithm used to maximize gain and minimize loss. The presence of game
development subjects in AI is because the combination of these subjects are not
just engaging, but because given its reliance on the illusion of intelligence, game
development is becoming inseparable from AI. From the first games ever made,
AI has been the characteristic that created a challenge for players. The case
study by C. T. Pozzer and B. Karlsson [8] recommends using game development
in order to teach many advanced computer science topics, as this keeps students
engaged with the course work. Video games are the reason that some students
have found an interest in computer science. While game development is much a
process of its own, creating a game is seen as an enjoyable and engaging method
of learning. Certain topics of game development translate easily to other subjects
in computer science, like artificial intelligence.
Artificial Intelligence and Videogames 319

4 Artificial Intelligence Players


Computer Science students will encounter a number of different competitions,
including hacking, game development, and AI competitions. AI Player tourna-
ments, like most software-based competitions, have had a growth in popularity
since their introduction. These tournaments are typically judged based on a win
to lose ratio, and normally feature AI Player to AI Player. The issue with this
tournament is that since it focuses on artificial opponents, the translation to
players may not result in the AI Player performance being sufficient. Human
players can understand more abstract details about a game and behave entirely
outside what is a predictable perimeter for an AI. M. Kim, K. Kim, S. Kim and
A. K. Dey [1] evaluated 18 AI Players in a game of StarCraft, a popular real-time
strategy (RTS) game by Blizzard. RTS games focus on controlling resources in
order to gain an advantage on opponents and eventually defeat them in a top
down game of war. The author in [1] had 45 human players each rate their artifi-
cial opponents in the categories of human likeness, decision making, prediction,
operation, build order, micro control, and performance. In this study, AI play-
ers with high win ratios did not frequently place near the top of the categories
judged. The findings did not show a correlation between the player rated char-
acteristics and the win ratios of the AI Players. This could be a result of years
of focusing the implementation of the AI Players around beating the previous
AI competition winners, which can be expected from this style of competition.
Measuring only the win/lose ratio may be satisfactory for an AI to AI compe-
tition, however it would be more beneficial to measure a whole list of features
when regarding human interaction. Obviously, these AIs don’t need to pass a
Turing Test to win an AI vs AI competition, but there may be room for new
kinds of competitions. Competitions of this nature are useful for fostering an
interest in AI development [2], but would not be immediately functional as an
AI for a non-player character.

5 Artificial Intelligence as NPCs


Non-player characters (NPCs) are an essential aspect of most every game, be
it enemies, support characters, or minor characters that just function as shop-
keepers and townsfolk. This section will discuss usage of AI for each of these
subcategories of NPCs.

5.1 AI as Enemies and Rivals

It is a necessary part of a game to offer a player some sort of challenge. Challenges


most commonly appear as enemies in video games. In this regard, the term AI
frequently deviates from the academic definition of AI. However, there have been
cases of machine learning implemented as enemies in video games. This trend
is becoming more evident. The article by J. Švelch [13] concludes that Alien:
Isolation used several tricks to make a difficult, intelligent monster. The game
320 N. Fawcett and L. Ngalamoum

utilized a foreground AI system that actually consisted of a decision tree, and


a background AI, or a game director that would place the alien near the player
when he or she starts believing that the alien might be gone. The combination of
these tactics created a more horrifying experience with an unknowable enemy.
Alien: Isolation was fairly well received, but some players felt that the mon-
ster cheated, due to the monster’s adaptiveness. Players all had totally unique
experiences in this game depending on how they played, an especially appealing
feature in horror games. This method of unfair adaptability works well in a hor-
ror game, but would require more humanistic flaws in other settings. Enemies
and rivals in most genres can equally benefit from similar AI implementation,
however one could argue that such an overwhelming advantage is only appeal-
ing in horror games. Normal NPCs should not necessarily be designed to win,
only to challenge the player. One such example may be a First Person Shooter
(FPS) game, where enemies are penalized for defeating the player, but rewarded
for lowering the player’s health. Rivals in games can feature a more dynamic
interaction with the player. Consider a good and evil scale, similar to the one
implemented in the Fable franchise. Rivals could come up with a whole host
of reasons to oppose the player and learn how to defeat him or her, based on
the choices that the player makes and how they play the game. Theoretically, a
character that wants to stop the player at any cost would know what weapons
to bring to the fight.

5.2 AI as Support Characters


Support characters, sometimes companions, such as Lydia from The Elder Scrolls
V: Skyrim (shown in Fig. 1) are the characters that help the player through the
story.

Fig. 1. Screenshot of The Elder Scrolls V: Skyrim


Artificial Intelligence and Videogames 321

Providing support characters with a neural network (NN) would create a


dynamic relationship between the character and the player. The support char-
acter could create tactics to better assist the player even changing playing styles
as needed. A companion character could know that if the player’s style of play
harbors a weakness, they could compensate by automatically adapting their play
style. For example, in a fantasy game a companion might choose to use a bow or
cast spells if the player only ever uses melee weaponry. A companion character
could additionally come up with various unique dialogues, including suggestions
and tips for the player or a joke regarding events that occurred in the game. The
companion could also voice unique grievances with the player. Assuming that
the player role plays as an evil character, the companion would have a dynamic
response, in place of the commonplace method of leaving the party once the
player surpasses a point on a gauge.

5.3 AI as Minor Characters


Minor characters, such as shopkeepers and townsfolk do not necessitate the most
robust of AIs. Whether or not the NPC likes the player, perhaps an observation
of what the player has on hand, or recognition of the player’s reputation would
seem like more than enough. Some video games have implemented more inter-
actions between these otherwise insignificant characters. In those video games,
a player could follow a NPC around and learn that they were programmed as
if to actually have a life. It would be quite a feature to maintain a game where
every NPC had unique characteristics and could learn and change the way they
interact, but providing some level of immersion in this area could add a whole
new dynamic to a game. An alternative to providing every NPC with an unique
AI would be to have a single overarching background AI or game director control
the socioeconomic characteristics of the game. This second approach could be
used to set a dynamic economy like that of the game franchise Fable, even if
Fable’s economy was easily exploited. The use of a NN would be able to consider
the consequences of actions that the player took. Is the player performing heroic
deeds, selling some really good equipment, or destroying a town? At the time
of this writing, most games do not utilize a high level economic system. Cur-
rent implementations usually consist of the ability to buy and sell goods either
without limit or limited by how much currency the shopkeeper has.

5.4 Game Director


Artificial Intelligence does not have to be implemented as a player character or
NPC. Artificial Intelligence can be implemented into the game director. Dynam-
ically changing the difficulty of parts in the game to provide support to novice
players or to create a greater challenge for more experienced players. It goes
without saying that there should be an override for dynamic difficulties, how-
ever this feature can be further implemented to generate a baseline for players.
The baseline can then be used to assess when the players are performing incon-
sistently and offer tips or suggest that the player take a break. Dynamic difficulty
322 N. Fawcett and L. Ngalamoum

is already a feature in games like Left 4 Dead and Metal Gear Solid 5. Dynamic
difficulties and player analysis would be potential for the player to receive tips
and strategies to improve at a game. A game director AI would additionally be
able to provide more dynamic responses to negative player actions and redirect
them towards a more productive activity, instead of simply relying on a sweeping
ban; a point made by M. O. Riedl and A. Zook [12].

5.5 Long Term AI Agents


Long term AI agents would be AI elements used in either a long lived game, such as
a Massive Multiplayer Online (MMO) game or a series of games. There are no cur-
rent implementations of this feature, however this feature would be able to provide
a dynamic series. The article by M. O. Riedl and A. Zook [12] discusses how a long
term agent would be able to analyze how a player interacts in a game. Not only can
a long term agent inform a player how they could improve, but can even make rec-
ommendations for other video games that the player might be interested in, based
on their play style. The utilization of long term agents could provide players with a
push to advance as competitive gamers. Long term agents could create a baseline
for the respective player, so that it can tell if they are underperforming. The agent
could then suggest that the player take a break or provide them with a tip in order
to improve their playstyle. This baseline feature would also be beneficial to gamers
playing VR games, as some players report motion sickness from playing too long.
The agent could remind the player to take a break.

6 Artificial Intelligence Game Production


6.1 Dynamic Scripting
Dynamic scripting can be used to further provide players with an adaptive chal-
lenge. In this concept, an artificial intelligence will have a training set composed
of both manually and automatically designed tactics. M. J. V. Ponsen, H. Muñoz-
Avila, P. Spronck, and D. W. Aha [3] analyze how dynamic scripting can be used
in a RTS game called Wargus, which is a clone of a Warcraft game, but with the
added benefit of having an open source engine. The implementation of dynamic
scripting would result in an AI agent that is able to dynamically respond to
human strategies. The AI would function by providing a weight to each possible
action and choosing the best option, although a number of these actions will
have to be predefined. Use of this method could allow game developers to spend
less time on game AI, spending that time instead on gameplay logic and design.

6.2 Automated Understanding and Game Development


AI already makes an appearance in level generation, although in this case the
AI would be provided with art. The randomly generated levels, a feature that
is common in roguelike games, have surged in popularity. Automated under-
standing is one concept that can be implemented towards game development.
Artificial Intelligence and Videogames 323

M. Guzdial, B. Li, M. and O. Riedl [7] created a first of its kind artificial intel-
ligence that replicates a game engine, or predicts the game’s backend rules. To
accomplish this task, the model is provided with a spritesheet, or a 2D image
of artwork used in the game and videos of gameplay. In case of this project, the
team chose to use a classic: Super Mario Bros (shown in Fig. 2).

Fig. 2. Screenshot from Super Mario Bros.

The project had some shortcomings, as it could not make conclusions regard-
ing level transitions and losing a life. Because automated understanding is a new
field, flaws are to be expected, but as this technology expands, implementation in
game development should be expected. Once significantly improved, automated
understanding would allow for game developers to focus less time on writing
backend rules. As an added benefit, losing a game engine would be a thing of
the past. M. J. Nelson and M. Mateas [9] detailed a prototype that would cre-
ate micro games with user-requested themes, similar to those of WarioWare, a
popular Nintendo game franchise that consists of many minigames that have the
player do short tasks, including dodging something or filling a gauge. This area
of research has significant room for growth. While neither research mentioned
attempts to interact with an advanced gameplay engine, it can be inferred that
significant strides will lead to more complex engines.

6.3 Narrative Intelligence in Game Development


Narrative Intelligence is a humane goal applied to artificial intelligence. Nar-
rative intelligence, the ability to tell stories, is seen as the characteristic that
separates humankind from animals. Clearly, it has been a goal since artificial
intelligence’s earliest days to implement this human characteristic to an AI.
Riedl [10] discussed the enculturation of artificial intelligence and how narra-
tive intelligence will allow humans and AI to better communicate. Many articles
claim that strides have been made in this area of research, however there is no
clear indication that AI has a clear grasp on narrative intelligence at the time
324 N. Fawcett and L. Ngalamoum

of this writing. Developments in narrative intelligence can directly translate to


video game narratives.

6.4 AI in Art

Artificial Intelligence does not currently have the most profound position in art.
Still, strides are being made to generate 3D models. These models normally are
generated from a static image of a real life object, however work is being done to
improve these models. The author in [5] demonstrates an extension to an existing
software that uses a variety of images to improve on the 3D Model. The method
demonstrated in [6] demonstrates the high quality reconstruction of facial features
from a still partial image of a face. These technologies do not illustrate artistic abil-
ity in a natural sense, however advancement in these fields can lead to generating
environments and character models for the use in video games.

6.5 Using AI to Test Video Games

An essential part of development is testing. The author in [14] discusses a variety


of automated testing methods that are present in video games today. Currently
video game studios largely hire testers or use beta testing methods that can either
be open to the public or selected by some criteria, usually lottery or system specs. It
is not unlikely that a company chooses to use a combination of AI testing, in-house
testing, and beta testing. The paper categorizes the approaches of automated test-
ing into three categories, which include human imitation, scenario-based, and goal-
based. An example of human imitation, an AI agent would attempt to play though
select game content in a simulation of a human player to see if bugs would appear as
it makes its way through the level. An artificial neural network (ANN) would better
simulate human behaviors in playtests. An example of a scenario-based approach
is dynamic scripting (DS). In this version of DS, the agent would play a level with
its previous knowledge, but it would select a randomizer for the new playthrough.
Goal-based approaches include hyper-heuristics, which utilizes a general game-
playing agent that isn’t specialized at any game. It chooses the best way to play a
game based on its knowledge.

7 Using AI and Video Games in Other Industries

7.1 Using AI and Video Games in Healthcare

Video games incorporated with artificial intelligence can be used to create per-
sonalized rehabilitation programs for patients. S. Sadeghi Esfahlani, J. Butt, and
H. Shirvani [4] used a combination of an armband, a Kinect, and foot pedals to
create an engaging fruit-grabbing game that functions to provide patients with
extended program duration and increase the amount of movements performed by
the patients per session. The implementation of video games and AI would serve
to provide individuals requiring rehabilitation sessions with a method that is
Artificial Intelligence and Videogames 325

engaging, personalized, and potentially both cheaper and performed from home.
The video game aspect keeps the patients engaged, and the AI aspect evolves
the workouts, analyzes the progress, and continues to challenge the patient until
they are deemed to have completed the treatment.

7.2 Advantages of Simulations


Simulations are already being used in a number of different fields, including mil-
itary and aerospace. R. J. Stone, P. Panfilov, and V. Shukshunov [11] conclude
that some of the benefits of simulations in the aerospace community lead to an
improved awareness of the situation, a reduction in time needed to train, and an
enhanced ability to recall information. Implementing AI into these simulations
can generate a more responsive training simulation. Individuals going through
simulations will be able to train to deal with situations that have dynamic
responses.
R. Kamath and R. Kamat [16] discuss the field of intelligent virtual envi-
ronment (IVE), which is the convergence of AI and virtual reality (VR). This
combination has many exciting areas of research. Notably, the paper suggests
that VR based robotic systems provide a method to operate and train to operate
complex machinery. This same technology can be used to immerse the user in
various studies that would be otherwise too expensive or risky.

8 Using Video Games for Real Life AI Agents

Game development is not the only field that is benefited by AI. The reverse
can be true as well. While game AI is an essential part of game development,
the utilization of AI in video games can additionally be applicable to real life
AI. Lample and Chaplot [15] augmented a deep recurrent Q-network (DRQN)
with game information in order to implement an AI agent that can operate
in a partially observable scenario. Deep reinforcement learning has done well
in 2D games, but this is considered a fully observable scenario. Lample and
Chaplot applied deep learning methods into a copy of the Doom game engine,
where it outperformed human players. The agent’s tasks were separated into
two categories: navigation and action. Lample and Chaplot make a point that
the former is applicable to robotics, as deciphering a 3D environment from a
partially observable perspective is similarly a challenge in robotics.

9 Conclusion and Further Work

Numerous benefits regarding the amalgamation of artificial intelligence and video


games have been detailed in this paper, as well as how artificial intelligence can
continue to grow and enhance video games. At the time of this paper, most
video games develop an illusion of intelligence method to game AI, instead of a
full implementation. A full AI implementation can lead to longer lasting games.
326 N. Fawcett and L. Ngalamoum

Players will replay the same games to see how different actions interact with and
modify the AI agents in the game.
Game directors and long term AI agents can help to improve a player’s skill at
a game. Players could participate at a competitive level that they normally could
not, thanks to the training provided by AI agents assisting in their gameplay.
The presence of AI in video games is increasing all the time in all forms.
AI player competitions will continue not just to bring people into the fusion of
high level AI and video games, but to indicate that AI presence in video games
is an ever growing milestone that needs to be further acknowledged by game
developers. Games that have a greater perception of intelligence and immersion
are generally both well received and anticipated. Moving forward, more exper-
imental games with high functioning AI will begin getting published, and with
that popularity, larger companies will begin incorporating these features. Similar
to how a game feature or genre has a wave of popularity. This is tracaeble to
the introduction of quick time events, or the popular that was gained in horror
games after the release of Slenderman.
AI in production is a feature that will continue to gain steam. Anyone that
has dabbled in developing a game can identify that popular game engine tools like
Unity and Unreal have incorporated numerous tools that have already started
automating some of the processes of game development. Like every industry,
automation will inevitably have a significant hold. It is not a reach to assume
that Unity and Unreal will begin featuring more automation. Automated testing
will likely become a feature included in popular game engine tools, as it has
been becoming an essential part of the game testing for some game development
studios.

References
1. Kim, M., Kim, K., Kim, S., Dey, A.K.: Performance evaluation gaps in a real-
time strategy game between human and artificial intelligence players. IEEE
Access 6, 13575–13586 (2018). https://ieeexplore.ieee.org/stamp/stamp.jsp?tp&
arnumber=8276283&isnumber=8274985. Accessed 28 Jun 2022. https://doi.org/
10.1109/ACCESS.2018.2800016
2. Ram, A., Ontañón, S., Mehta, M.: Artificial intelligence for adaptive computer
games. In: Twentieth International FLAIRS Conference on Artificial Intelligence,
7–9 May 2007, Key West, FL. Palo Alto, CA, AAAI Press (2007). https://www.
aaai.org/Papers/FLAIRS/2007/Flairs07-007.pdf. Accessed 28 Jun 2022
3. Ponsen, Muñoz-Avila, H., Spronck, P., Aha, D.W.: Automatically acquiring
domain knowledge for adaptive game AI using evolutionary learning. In: The
Seventeenth Annual Conference on Innovative Applications of Artificial Intelli-
gence: IAAI-05, 9–13 July 2005, Pittsburg, PA. Palo Alto, CA, AAAI Press (2005).
https://www.aaai.org/Papers/IAAI/2005/IAAI05-012.pdf. Accessed 28 Jun 2022
4. Sadeghi Esfahlani, S., Butt, J., Shirvani, H.: Fusion of Artificial Intelli-
gence in neuro-rehabilitation video games. IEEE Access 7, 102617–102627
(2019). https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8752216&
isnumber=8600701. Accessed 28 Jun 2022. https://doi.org/10.1109/ACCESS.
2019.2926118
Artificial Intelligence and Videogames 327

5. Martin-Brualla, R., Radwan, N., Sajjad, M.S.M., Barron, J.T., Dosovitskiy,


A., Duckworth, D.: NeRF in the wild: neural radiance fields for unconstrained
photo collections. In: 2021 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 19–25 June 2021. Computer Vision Founda-
tion (2021). https://openaccess.thecvf.com/content/CVPR2021/papers/Martin-
Brualla NeRF in the Wild Neural Radiance Fields for Unconstrained Photo
CVPR 2021 paper.pdf. Accessed 28 Jun 2022
6. Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture infer-
ence using deep neural networks. In: 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 2326–2335 (2017). Accessed 28 Jun 2022.
https://doi.org/10.1109/CVPR.2017.250
7. Guzdial, M., Li, B., Riedl, M.O.: Game engine learning from video. In: Twenty-
Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), 19–
25 August 2017, Melbourne, Australia, IJCAI 2017. https://www.ijcai.org/
proceedings/2017/0518.pdf. Accessed 28 Jun 2022
8. Pozzer, C.T., Karlsson, B.: Teaching AI concepts by using casual games: a
case study. In: 8th International Conference on Intelligent Games and Simula-
tion: GAME-ON 2007, 20 November 2007, Bologna, Italy. Belgium: EUROSIS-
ETI Publications, 2007. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.
1.1.219.5083&rep=rep1&type=pdf. Accessed 28 Jun 2022
9. Nelson, M.J., Mateas, M.: Towards automated game design. In: Basili, R., Pazienza,
M.T. (eds.) AI*IA 2007. LNCS (LNAI), vol. 4733, pp. 626–637. Springer, Heidel-
berg (2007). https://doi.org/10.1007/978-3-540-74782-6 54
10. Riedl, M.O.: Computational narrative intelligence: a human-centered goal for arti-
ficial intelligence. In: CHI 2016: CHI Conference on Human Factors in Computing
Systems, 7–12 May 2016, San Jose, California. New York, NY, Association for
Computing Machinery (2016). https://arxiv.org/pdf/1602.06484.pdf. Accessed 28
Jun 2022
11. Stone, R.J., Panfilov, P.B., Shukshunov, V.E.: Evolution of aerospace simulation:
from immersive virtual reality to serious games. In: Proceedings of 5th International
Conference on Recent Advances in Space Technologies - RAST2011, pp. 655–662
(2011). https://doi.org/10.1109/RAST.2011.5966921
12. Riedl, M.O., Zook, A.: AI for game production. IEEE Conf. Comput. Intell. Games
(CIG) 2013, 1–8 (2013). https://doi.org/10.1109/CIG.2013.6633663
13. Švelch, J.: Should the monster play fair?: Reception of artificial intelligence in alien:
isolation. Int. J. Comput. Game Res. 20(2), 243–260 (2020). http://gamestudies.
org/2002/articles/jaroslav svelch. Accessed 28 Jun 2022
14. Imants, Z.: Analysis of artificial intelligence applications for automated testing of
video games. In: Environment. Technology. Resources. Proceedings of the Interna-
tional Scientific and Practical Conference, vol. 2, p. 170 (2019). https://doi.org/
10.17770/etr2019vol2.4158
15. Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning.
In: The Thirty-First AAAI Conference on Artificial Intelligence, 4–9 February 2017,
San Francisco, CA. Palo Alto, CA, AAAI Press (2005). https://www.aaai.org/ocs/
index.php/AAAI/AAAI17/paper/viewPaper/14456. Accessed 28 Jun 2022
16. Kamath, R., Kamat, R.: Integrating artificial intelligence and virtual reality
- a feasibility study. In: National Conference on Latest Advances, Trends
in Electronic Science and Technology (2014). https://www.researchgate.net/
publication270396913 Integrating Artificial Intelligence and Virtual Reality - A
Feasibility Study. Accessed 28 Jun 2022
Bezier Curve-Based Shape Knowledge
Acquisition and Fusion for Surrogate Model
Construction

Peng An1(B) , Wenbin Ye1 , Zizhao Wang1 , Hua Xiao2 , Yongsong Long2 , and Jia Hao1
1 Beijing Institute of Technology, Beijing 100089, China
[email protected], [email protected]
2 Jiang Nan Design & Research Institute of Machinery & Electricity, Guiyang 550009, China

Abstract. Surrogate model technology is a key technology in the field of engi-


neering design with limited data. Fusion of engineering knowledge into surrogate
models is an effective method to improve the prediction accuracy. However, engi-
neering knowledge in this field describes the complex relationship between vari-
ables, which makes it difficult to obtain quantitative knowledge. Therefore, the
engineering knowledge acquisition and fusion technology based on Bezier Curve
for complex equipment design was proposed, which covered the entire process
from knowledge acquisition to filtering and fusion. Finally, through the verification
of the Unmanned Vehicle Truss design case and test functions, the experimental
results show that the technology can achieve the effective acquisition of complex
curve knowledge and represent multi-knowledge information effectively.

Keywords: Surrogate models · Complex equipment · Knowledge


representation · Engineering knowledge

1 Introduction
Surrogate models, as an effective optimization technique, are widely used in engineering
design fields such as materials and aerospace. The core idea is to use data-driven models
to replace the time-consuming and high-cost physical simulation engines [1–4]. How-
ever, in the design of complex equipment, because of multidisciplinary coupling and
high-dimensional design variables, it is difficult to build high-precision surrogate mod-
els with limited data. There are two main ideas to solve this problem. On the one hand,
some scholars proposed methods such as Transfer Learning [5–7] and Data Augmenta-
tion [8, 9] to increase the amount of data to improve the accuracy; On the other hand,
some scholars proposed that increasing the constraints of the model itself can improve
the accuracy. Fortunately, experts have accumulated a lot of design experience in the
design process for years and have a deep understanding of how design variables affect
product performance, which makes it a great potential to integrate design knowledge
[4, 10–14] into surrogate models. However, design knowledge has the characteristics of
mul-types, large quantities and heterogeneous representation. Not all knowledge can be

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 328–342, 2023.
https://doi.org/10.1007/978-3-031-18461-1_22
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 329

integrated into surrogate models and lack of a unified approach to integration. There-
fore, it is very important to carry out research on the acquisition, modeling and fusion
methods of design knowledge for the construction of surrogate models.
Research on knowledge has been studied for many years. In the 1960s, scholars began
to study the acquisition and integration of design knowledge. Booker [15] et al. studied
the way of knowledge extraction and the application of fuzzy theory in expert informa-
tion; Keeney [16] discussed the knowledge extraction method and process based on the
use of knowledge in nuclear reaction systems and determined the knowledge acquisi-
tion process. Gruber [17] sorted out the three principles that must be followed in the
process of knowledge acquisition, and gradually developed the main knowledge acqui-
sition technologies such as the Interview method [18], Observation method [19] and the
Knowledge Graph [20]. The probabilistic representation of knowledge information is
an important method of knowledge fusion. John[21] summarized the method of con-
verting statistical feature information given by experts into probability distribution and
discussed the basic steps of fusing multiple knowledge to form probability distribution;
Marcello[22] predicted the probability distribution of knowledge fusion information by
introducing Steiner points of knowledge distribution among experts.
However, when dealing with complex equipment design problems, design knowledge
has the characteristics of numerous and miscellaneous: numerous means that the number
of design experts is large, which makes the amount of knowledge large; miscellaneous
means that the understanding of knowledge exists bias due to the subjective cognitive
differences of experts. This makes it difficult to effectively use the methods of acquisition
characterization in structured knowledge for the shape knowledge. Furthermore, experts’
subjective cognitive differences are not consistent with curve shape recognition is very
important to be considered.
Therefore, how to identify the knowledge with large deviation among the numerous
design knowledge and find the consensus information of the knowledge to the greatest
extent are the core problems of dealing with the multi-knowledge fusion in the engineer-
ing field. For the problems above, this paper proposed the shape knowledge acquisition
and fusion technology for surrogate models construction, including the completion of
knowledge acquisition based on Bezier curve, the completion of knowledge filtering
based on Hausdorff Distance index, and the completion of fusion knowledge based on
Fermat Points for Finite Point Sets. The acquisition of knowledge was realized based on
the Bezier curve, and then the Hausdorff Distance and Fermat Points for Finite Point Sets
were used to reduce the amount of knowledge and effectively represent the information
of knowledge.
The rest of the paper is structured as follows: in Sect. 2, the proposed method is
explained in detail. In Sect. 3, several experiments are conducted to verify the proposed
method. A discussion based on empirical results is presented in Sect. 4, and this work
is summarized in Sect. 5.

2 Method
Product design knowledge has the characteristics of mul-types, large quantities and het-
erogeneous representation. Not all knowledge can be integrated into the surrogate model
330 P. An et al.

to improve the accuracy. Our previous work [12] defined design knowledge as the map-
ping relationship between design variables and key performances from the perspective of
knowledge-assisted surrogate models construction, and divided design knowledge into
types of monotonic and shape (Table 1). Among them, monotonic knowledge describes
the monotonic relationship between performance variables and design variables; shape
knowledge is based on monotonic knowledge, and the variables satisfy a more complex
curve relationship (such as a parabola). This paper mainly studies the shape knowledge.

Table 1. The Definition of Monotonic Knowledge and Shape Knowledge

type Description Example


Monotonic IF x1 INCREASEs x1 INCREASEs in [3, 4]
THEN y1 INCREASEs or DECREASEs y1 INCREASEs in [10, 15]
Shape Type S
Type U

Fig. 1. The main steps of the method.

Bezier curve has the advantage of accurately drawing curves, so this paper develops
interactive tools to acquire knowledge on its principle (step 1). Due to the large num-
ber of experts involved in the design of complex equipment, there is a large amount of
knowledge with large cognitive bias in knowledge, which means the knowledge needs
to be filtered. This paper proposes a design knowledge filtering technology based on the
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 331

average of Hausdorff Distance (step 2). After the knowledge is filtered by the indicator,
the amount of retained knowledge is still large. It is very important to find the maximum
consensus of knowledge among multiple experts. This paper proposes a design knowl-
edge fusion method based on Fermat Points for Finite Point Sets (step 3) to get the final
fusion knowledge curve. The overall technical process is shown in Fig. 1.

2.1 Acquisition Knowledge Based on Bezier Curve


Bezier curve is a mathematical curve commonly used in two-dimensional graphs. It has
been proved that all continuous functions in a certain interval can be approximated by a
fitting polynomial of control points (Eq. 1):

n
B(t)= Cni Pi (1 − t)n−i t i (1)
i=0

where B(t) is the position of discrete points at time t (the point on the Bezier curve moves
from the start point to the end point while t changes from 0 to 1), Pi is the position of
control points, and Cni is the coefficient of fitting curve.
When using a set of B(t) to define the shape knowledge, the shape knowledge can
be defined by several control points. In this paper, an interactive knowledge acquisition
tool directly operated by experts was developed by using the characteristic. The Fig. 2
shows the knowledge acquisition tool interface. When dragging the control point 2 from

(a) (b)

(c) (d)

Fig. 2. The interface of knowledge acquisition.


332 P. An et al.

Fig. 2(a) to the position of Fig. 2(b), the fitting curve changes from Fig. 2(a) to (b) and
drag control point 4 in turn to form Fig. 2(c), then drag point 6 to form the fitting curve
of Fig. 2(d).
Experts can the define the knowledge in mind and check the fitting curve in real time
through the Bezier curve formula by increasing or decreasing the number of control
points and dragging them to appropriate location. Therefore, the shape knowledge can
be represented by a set of point definitions.

2.2 Filtering Knowledge Based on Hausdorff Distance


Knowledge filtering is designed to filter out the acquired knowledge whose shape is
significantly different from others, so as to avoid decreasing the accuracy of surrogate
models. The consistency between shape knowledge can be analyzed by the similar-
ity of knowledge correspondence curve. Hausdorff Distance has the ability to measure
the approximate degree of the shape of the curve. Therefore, Hausdorff Distance is
selected as the basis for the consistency measurement between knowledge, and the aver-
age value of the Hausdorff Distance between single knowledge with the other knowledge
is calculated to define the knowledge similarity level in all.

Hausdorff Distance
Given two finite point sets A = {a1, a2, . . . , ap}, B = {b1, b2, . . . , bq} (Fig. 3):

Fig. 3. The definition of Hausdorff distance.

The Hausdorff Distance is defined as:

H (A, B) = max(h(A, B), h(B, A)) (2)

h(A, B) = maxaA (minbB a − b) (3)

h(B, A) = maxbB (minaA a − b) (4)

where a − b is the Euclidean distance between a and b. The function h(A, B) is called
the directed Hausdorff Distance from A to B, which identifies the point a ∈ A that is
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 333

farthest from any point of B and measures the distance from a to its nearest neighbor in
B.
Measure Index
The similarity between any two pieces of knowledge can be characterized by the Haus-
dorff Distance between their knowledge curves. Therefore, the overall similarity level
for single knowledge can be defined by the average Hausdorff Distance of its curve with
rest all curves (Eq. 5).

1 
n−1
SimilarityExpi = H (Expi , Expj ) (5)
n−1
j=1

where n is the number of acquired knowledge, H (Expi , Expj ) is the Hausdorff Distance
between the knowledge curve Expi with Expj .

Filtering Method
After the measurement process of the above steps, each knowledge has an indicator that
characterizes its overall knowledge similarity value. The filter of knowledge is to revolve
around these values, and the knowledge with large value are filtered out. Three Sigma
principle (3σ) is a commonly used error judgment principle, σ represents the standard
deviation, μ represents the mean, the Fig. 4 shows the 3σ principle:

Fig. 4. The Principle Diagram of 3Sigma.

The value of the probability distributed in (μ − 1σ, μ + 1σ) is 0.6826; the value of
the probability distributed in (μ − 2σ, μ + 2σ) is 0.9544;the value of the probability
distributed in (μ − 3σ, μ + 3σ) is 0.9974.
It can be considered that the values are almost all concentrated in (μ−3σ, μ+3σ),the
probability of beyond range is less than 0.3% and should be eliminated as an abnormal
value. This paper proposes a filtering method based on 3σ principle, and the main process
is divided into two steps:

Step1: For the overall similarity of knowledgeSimilarityExpi , calculate the mean μ and
the standard deviation σ and then get a probability distribution N(μ, σ)
Step2: Calculate the difference between the overall similarity value of all knowl-
edge and sigma, and remove  knowledge greater than 3 times the standard
 
deviation(SimilarityExpi − μ > 3σ).
334 P. An et al.

2.3 Fusion Knowledge Based on Fermat Points for Finite Point Sets
After all knowledge are measured and filtered by the measure index and 3σ principle, the
abnormal knowledge is eliminated, but the amount of retained knowledge is still large. It
is important to find a fusion curve Expcombine from a set of knowledge curves Expi , which
can not only reduce the amount of knowledge but also ensure the effective information
of knowledge. However, the design knowledge gained in this paper is derived from
a series of points fitted by Bezier formulas (Eq. 1), this paper proposes a knowledge
fusion method based on Fermat Points for Finite Point Sets, which can make the average
distance between the points on the fusion curve and the filtered knowledge is the shortest
and approximate the shape of the curve by making points coordinates as close as possible.
The main steps are as follow:

Step 1: Divide the Knowledge Filtered Curve into Several Sets of Points
The number of points on the knowledge curve is uniformly determined by parame-
ter t according to Eq. 1. We divide the points on knowledge filtered curves accord-
ing to the value of each t to form a data set Pointsi , which has m coordinate points
from m pieces of knowledge curves. Finally, we obtain several data sets Pointsfermat =
[Points1 , Points2 , . . . , Pointst ] with the number of t samples.

Step 2: Solving for the Fermat Points of a Finite Set of Points


This paper adopts the method of numerical solution and iterative calculation to carry
out, which is as follow:
Define the Objective Function
 
2
f (x, y) = min (x − xi )2 + (y − yi )2 (6)

where x and y are the targets we want. xi and yi are the coordinates of the point on the
i-th knowledge curve.
Derivation of the objective function
The equation shows that f (x, y) is a convex function, and the zero point of its first
derivative is the solution. The problem is converted to find the Fermat Point j (x, y) which

can make f (x, y) = 0(Eq. 7), and y and x expressions are consistent.

∂f (x, y) 
n
x − xi
=  =0 (7)
∂x
i=1
2
(x − xi )2 + (y − yi )2

Iterative calculation

We set a function g(x, y) = f (x, y) + (x, y), then the problem is further transformed
into:(x, y) = g(x, y), and to make (x, y) = g(. . . g(g(x, y))) through constant iterative
calculation.
The expression of the finite point set Fermat Point j (x, y) is:
n n
i=1 √ i=1 √
xi yi
2 2
(x−xi )2 +(y−yi )2 (x−xi )2 +(y−yi )2
x = n y = n (8)
i=1 √ i=1 √
1 1
2 2
(x−xi ) +(y−yi )
2 2 (x−xi ) +(y−yi )
2 2
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 335

Step 3: Combine Several Fermat Point j (x, y) to form Fusion Curve Expcombine .
The t Fermat points fitted by t finite point sets constitute the final fusion knowledge
curve Expcombine = [Point 1 , Point 2 , . . . , Point t ].

3 Experiments
The proposed method is tested with two benchmark functions and one engineering
problems to verify whether the method can reduce the amount of knowledge while
ensuring the effective information of knowledge. The proposed knowledge of bench-
mark functions is obtained by deriving test function and acquired by experimenter with
the interactive knowledge acquisition tool. The knowledge of the engineering problem
is the subjective judgment obtained by asking the designer. This section details these
experimental cases and experimental design.

3.1 Experimental Case


Benchmark Function
Function Maytas
Function Matyas is often used to test two-dimensional functions, and the expression
is:

f (x) = 0.26 x12 + x22 + 0.48x1 x2, x1 ∈ [−10, 10], x2 ∈ [−10, 10] (9)

The Fig. 5 shows the shape knowledge, which is the change in relationship between
f(x) and x1 when x2 = 0.5 and x1 is set to [−10, 10]. The curve shape is simple but the
curve coordinates cover a wide range.

Fig. 5. Function Matyas baseline.


336 P. An et al.

Function Branin

Function Branin is also used to test two-dimensional functions, and the expression
is:
 2
f (x) = a x2 − bx12 + cx1 − r + s(1 − t)cos(x1 ) + s (10)

where a = 1, b = 4π 2 , c = π , r = 6, s = 10, t = 8π , x1 ∈ [−5, 10], x2 ∈ [0, 15].


5.1 5 1

Figure 6 shows the shape knowledge, which is the change relationship between f(x)
and x1 when x2 = 5 and x1 is set to [-5, 10]. The shape of the curve is complex and the
coordinates of the curve cover a wide range.

Fig. 6. Function Branin baseline.

Engineering Problem-Unmanned Vehicle Truss (UVT)


The engineering problem is the design of the truss body of the unmanned vehicle (UVT),
which is faced with multiple working conditions due to the high task requirements. As
shown in Fig. 5, the truss body structure model of the unmanned vehicle has nearly 100
structural parameters (including global overall parameters, local reinforcement and beam
structure). In order to simplify the model in the early stage of design, this paper only
considers the key design parameters. Since the truss body structure is mainly composed
of transverse longitudinal rib beams, four parameters are used to characterize the overall
body parameters [(Fig. 7(a)]. The truss body structure is mainly welded by three profiles
with different side lengths and thicknesses, so the side length and thickness of the cross-
section of the rib beam are used to characterize the local body parameters (Fig. 7(b),
(c)).
The knowledge curve obtained in this case revolves around the thickness of the shell
and the maximum stress performance. After normalization, the thickness of the shell
and the maximum stress range are both [0, 1]. As shown in Fig. 8, the expert draws the
knowledge curve of the two variables. The shape of the curve is complex, but the range
covered by the curve coordinates is small.
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 337

Fig. 7. Unmanned vehicle truss body structure model.

Fig. 8. Shape knowledge of maximum stress performance with thickness of the shell in UVT.

3.2 Experimental Design

Step 1: Knowledge Generation.


In the knowledge acquisition method based on the Bezier curve, the knowledge curve
is determined by the coordinates of the control points and the coordinates of different
control points correspond to different shapes. Based on baseline in the case above, the
experiment simulates the differences in expert cognition by adding noise to the control
points of baseline. The added noise is Gaussian noise N (0, σnoise
3 ) ∗ base, where base
refers to the standard deviation of the coordinates of the control points of the baseline.
For each experimental benchmark curve, we take values according to σnoise = [0.5, 1, 2],
repeat 20 rounds each time, and generate 50 knowledge curves in each round. Therefore,
each case includes 3 kinds of noise, with 20 groups of experiments for each kind of noise
and 50 samples of design knowledge for each group of experiments.
338 P. An et al.

Step 2: Knowledge Selection.


For each case, each group of 50 pieces of knowledge is measured according to the
knowledge measurement method based on the average value of Hausdorff Distance in
Eq. 5 in Sect. 2.2. The 3σ principle is used to analyze the measurement indicators of
knowledge, and those that satisfies 3σ are selected. Curves are used for the next step of
fusion. In the control group, all knowledge curves were retained for the next fusion
without filtering.
Step 3: Knowledge Fusion.
According to the fusion method based on Fermat Points for Finite Point Sets in Sect. 2.3,
the knowledge after filtered in step 2 is fused.
Step 4: Error Checking.
Since the abscissa of the fused knowledge curve and the baseline cannot be unified, we
choose to further model the coordinate information of the fused knowledge curve by
Gaussian process. The abscissa of the benchmark curve of each case is used as the input
for prediction, and the error test is carried out with the ordinate of the point in benchmark
line. Since the prediction of the Gaussian process includes the mean and variance, we
use the Root Mean Square Error with penalty term operator wpenalty i (Eq. 11) as the final
index RMSEp Eq: 12:


⎪ 11.96 ∗ σi ≥ |μipredict− ybaseline
i |





wpenalty =
i
(11)



⎪ 1+
1
1.96 ∗ σ i < |μpredict− ybaseline |
i i


⎩ − i

1.96∗σ i
yi |
1+e predict− baseline

  2
 N μi i ∗ wpenalty
i
 i=1 predict− ybaseline
RMSEp = (12)
N

where μipredict is the predicted mean value of the abscissa of the pointi on the baseline
for the Gaussian process model, σi is its standard deviation.ybaseline
i is the ordinate of
i
point i .The penalty term operator wpenalty shows that the error requires additional mul-
tiplication if predicted value exceeds the 95% confidence level of the Gaussian process
mean.

Step 5: Repeat Step2–4, we complete 3 noise experiments for each case, 20 rounds of
experiments per noise, and calculate the average value of 20 groups of prediction errors
as the final error of each noise experiment.

4 Result and Discussion

As shown in Tables 2, 3, and 4, the prediction error values of the unmanned vehicle
truss body case, the Matyas function, and the Branin function are shown in order. Fusion
knowledge based on Fermat Points for Finite Point Sets has low error performance under
various noises and various curve shapes complexity compared with the range covered
by the knowledge curve.
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 339

Table 2. The mean of error for 20 rounds in case UVT with different noise.

UVT
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 0.0128 0.0096 0.0216
No filtering 0.0098 0.0146 0.0242
Error reduction(%) −30.61 34.25 10.74

Table 3. The mean of error for 20 rounds in case Matyas with different noise.

Matyas
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 0.5188 0.7710 1.2970
No filtering 0.7151 1.0186 2.1728
Error reduction(%) 27.45 24.31 40.31

Table 4. The mean of error for 20 rounds in case branin with different noise.

Branin
N(0,0.5) N(0,1) N(0,2)
Hausdorff distance 1.942 2.848 5.519
No filtering 2.686 4.104 8.184
Error reduction(%) 27.70 30.60 32.56

The Filter index based on the mean of Hausdorff Distance can reduce the error by
at least 10%. Although in the case of the UVT, the indicator is not as good as retaining
all knowledge in lower noise, but the errors of the two methods are extremely low.
Therefore, if the coordinates of the shape knowledge cover a small range, keeping all
knowledge to get fusion is still a good way.
As shown in Fig. 9, it can be found that the uncertainty range of the Gaussian process
prediction can effectively cover the baseline under the 95% confidence level, ensuring
the effective information of the baseline in all cases.
340 P. An et al.

(a)

(b)

(c)
Fig. 9. Gaussian process baseline prediction plot.

5 Conclusion

This paper proposes a shape based design knowledge acquisition and fusion technology
for surrogate model construction, which aims at the lack of technical basis of acquisi-
tion, representation and fusion of design knowledge under the framework of surrogate
model construction with fusion knowledge. The type of knowledge oriented is mainly
shape design knowledge represented by curve shape. This paper develops an interactive
knowledge acquisition tool for experts, which can facilitate experts to quickly acquire
Bezier Curve-Based Shape Knowledge Acquisition and Fusion 341

knowledge and store knowledge quantitatively; the proposed knowledge measurement


index can effectively filter abnormal knowledge, and the method of knowledge fusion
can significantly reduce the amount of knowledge at the same time. It ensures the effec-
tive information of knowledge, and effectively avoids problems such as redundancy and
difficult model convergence caused by multiple knowledge in the process of integrating
into the surrogate models.
But there are still some problems to be solved. First, monotonic knowledge can be
represented by straight lines, as a special case of shape knowledge, which can be used in
the method of this paper. But we do not carry out too much application verification. At
the same time, the contribution degree and rationality of each expert themselves are not
considered in the process of knowledge measurement. It is worthwhile to consider intro-
ducing expert cognitive weight coefficients and weighting the importance of knowledge
before the measurement steps proposed in this paper, which also serves as the direction
for further research.

References
1. Mai, H.T., Kang, J., Lee, J.: A machine learning-based surrogate model for optimization of
truss structures with geometrically nonlinear behavior. Finite Elements Anal. Des. 196 (2021)
2. Karen, İ, Kaya, N., Öztürk, F.: Intelligent die design optimization using enhanced differential
evolution and response surface methodology. J. Intell. Manuf. 26(5), 1027–1038 (2013).
https://doi.org/10.1007/s10845-013-0795-1
3. Ögren, J., Gohil, C., Schulte, D.: Surrogate modeling of the CLIC final-focus system using
artificial neural networks. J. Instrument. 16 (2021)
4. Gorissen, D., Couckuyt, I., Demeester, P., Dhaene, T., and Crombecq, K.: ‘A surrogate
modeling and adaptive sampling toolbox for computer based design 11, 2051–2055 (2010)
5. Zhao, X., Gong, Z., Zhang, J., Yao, W., and Chen, X.A surrogate model with data augmentation
and deep transfer learning for temperature field prediction of heat source layout. Struct.
Multidiscip. Optim. 64(4), 2287–2306 (2021)
6. Tian, K., Li, Z., Zhang, J., Huang, L., Wang, B.: Transfer learning based variable-fidelity
surrogate model for shell buckling prediction. Compos. Struct. 273, 114285 (2021)
7. Ma, Y., Wang, J., Xiao, Y., Zhou, L., and Kang, H.: Transfer learning-based surrogate-assisted
design optimization of a five- phase magnet-shaping PMSM. IET Electr. Power Appl. 15
(2021)
8. Liu Y., T.W., Li S.: Meta-data Augmentation Based Search Strategy Through Generative
Adversarial Network for AutoML Model Selection (2021)
9. Li, K., Wang, S., Liu, Y., Song, X.: An integrated surrogate modeling method for fusing noisy
and noise-free data. J. Mech. Des. 144, 1–23 (2021)
10. Zhang, Z., Nana, C., Liu, Y., Xia, B.: Base types selection of product service system based on
apriori algorithm and knowledge-based artificial neural network. IET Collab. Intell. Manuf.
1, 29–38 (2019)
11. Hao, J., Ye, W., Jia, L., Wang, G., Allen, J.: Building surrogate models for engineering
problems by integrating limited simulation data and monotonic engineering knowledge. Adv.
Eng. Inform. 49 (2021)
12. Hao, J., Zhou, M., Wang, G., Jia, L., Yan, Y.: Design optimization by integrating limited
simulation data and shape engineering knowledge with Bayesian optimization (BO-DK4DO).
J. Intell. Manuf. 31(8), 2049–2067 (2020). https://doi.org/10.1007/s10845-020-01551-8
342 P. An et al.

13. Hao, J., Ye, W., Wang, G., Jia, L., Wang, Y.: Evolutionary Neural Network-based Method for
Constructing Surrogate Model with Small Scattered Dataset and Monotonicity Experience
(2018)
14. Aguirre, L.A., Furtado, E.C.: Building dynamical models from data and prior knowledge: the
case of the first period-doubling bifurcation 76, 046219 (2007)
15. Meyer, M.A.A.B., Jane M.: Eliciting and Analyzing Expert Judgment (2001)
16. Keeney, R., Winterfeldt, D.: Eliciting probabilities from experts in complex technical
problems. IEEE Trans. Eng. Manage. 38, 191–201 (1991)
17. Gruber, T.R.: Automated knowledge acquisition for strategic knowledge. In: Marcus, S. (ed.)
Knowledge Acquisition: Selected Research and Commentary: A Special Issue of Machine
Learning on Knowledge Acquisition, pp. 47–90. Springer, Boston (1990)
18. Nue, B., Win, S.: Knowledge acquisition based on repertory grid analysis system. J. Trend
Sci. Res. Dev. 3(6) (2019)
19. do Rosário, C.R., Kipper, L.M., Frozza, R., and Mariani, B.B.: Modeling of tacit knowledge
in industry: Simulations on the variables of industrial processes. Expert Syst. Appl. 42(3),
1613–1625 (2015)
20. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs: represen-
tation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 33(2), 494–514
(2022)
21. John Paul, G., Anthony, O.H., Jeremy, E.O.: Nonparametric elicitation for heavy-tailed prior
distributions. Bayesian Anal. 2(4),693–718 (2007)
22. Basili, M., Chateauneuf, A.: Aggregation of experts’ opinions and conditional consensus
opinion by the Steiner point. Int. J. Approx. Reason. 123, 17–25 (2020)
Path Planning and Landing
for Unmanned Aerial Vehicles Using AI

Elena Politi, Antonios Garyfallou, Ilias Panagiotopoulos, Iraklis Varlamis(B) ,


and George Dimitrakopoulos

Department of Informatics and Telematics, Harokopio University of Athens,


Kallithea, Greece
{politie,it21577,ipanagio,varlamis,gdimitra}@hua.gr

Abstract. Latest trends, societal needs and technological advances have


led to an unparalleled expansion in the use of Unmanned Aerial Vehicles
(UAV) for military and civilian applications. Such systems are becoming
increasingly popular in many operations, since they reduce costs, facili-
tate activities and can increase the granularity of surveillance or deliv-
ery. Beyond the Visual Line of Sight (BVLOS) capabilities are becoming
recently a pivotal aspect for the UAV industry, and raise the demand for
extended levels of autonomy in order to increase the efficiency of flight
operations. The present study examines two main aspects of BVLOS
operations, namely trajectory planning and self-landing, and demon-
strates how well-established path planning techniques, such as the A*
and Dijkstra algorithms, can be used to ensure the shortest trajectory
length from point A to point B for a UAV under multiple obstacles and
constraints and the least number of error corrections. Extensive simula-
tion results showcase the effectiveness of the proposed method. It also
provides evidence of the use of computer vision algorithms for detecting
the landing site and assisting the UAV to safely land.

Keywords: Unmanned aerial vehicles · Path planning algorithms · A*


and dijkstra algorithms · Reinforcement learning · Self-landing
approach

1 Introduction

Unmanned Aerial Vehicles (UAVs) commonly referred to as drones, have recently


taken center stage in various business processes. Beyond their traditional role in
military applications, their use extends to a wide range of applications. In this
direction, BVLOS capabilities are becoming recently a pivotal aspect for the
drone industry. The extended levels of autonomy in addition to the increased
efficiency of such operations has given potential to even more applications in the
field.
The scope of UAV applications ranges from agriculture and farming, to safety
inspection and emergency response. Although drones usually operate in open
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 343–357, 2023.
https://doi.org/10.1007/978-3-031-18461-1_23
344 E. Politi et al.

environments such as parks and open fields, there are cases where they will have
to fly indoors or in restricted environments, such as within urban or construc-
tion environments [13]. Drones ability to flight higher and move above obstacles
made them great solutions for open air tasks such as patrolling of wide and
hardly accessible areas (e.g. forests), road traffic monitoring, field spraying, etc.
However, there is still place for improvement in tasks that impose a flight height
limit, as is the case in more complex environments, such as the shipping of items
within a city, or the inspection of a building under construction.
The correct positioning of the UAV, as well as the generation and continuous
update of its trajectory are crucial for its correct navigation, especially in the
case of dynamic environments, where the landscape composition is not known
in advance and obstacles may appear as the planned trajectory is executed. An
integral part of this trajectory refers to the landing of the drone at the last
segment of its trajectory, after the drone arrives in the predefined landing area
[9]. The aforementioned tasks require the ability of the drone to frequently re-
evaluate the situation (position and scene perception) and re-adjust the planned
trajectory, in order to avoid obstacles and land safely.
With the vision to solve the above problems, the present study aims to find
the appropriate trajectory that safely leads a UAV from its initial position to
its final destination. In specific, the present analysis examines the optimization
problem of obtaining the optimal trajectory for a UAV in an environment with
obstacles.
In order to validate our claim and demonstrate how the problem of
autonomous UAV navigation can be solved using a combination of well-
established path planning algorithms for simplicity and deep neural network-
based approaches for scene perception and fine grained navigation, we perform
our experiments on the virtual flight simulation environment of AirSim1 . This
virtual verification allows algorithm testing with minimum cost and complete
safety for UAVs, and provides useful feedback for the actual testing of their
navigation in the real environment.
The contributions of this study comprise:

– A comparison of performance of the A* and Dijkstra algorithms for trajectory


planning, in an iterative manner using depth information collected by an on-
board LiDAR.
– An implementation of a self-landing module that detects the exact location
of the landing site using a down-facing on-board camera.
– An experimental validation of the implementations on the AirSim virtual
drone environment simulator.

In Sect. 2 that follows we summarize the main research areas related to the
problems that we study. Section 3 provides details on the simulation platform,
the algorithms we employ for drone navigation and the solution we used for
properly locating a clear landing site. Section 4 demonstrates the results from
the experimental evaluation of our approach, in the direction of comparing the
1
https://microsoft.github.io/AirSim/.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 345

efficiency of the two algorithms in quickly finding the shortest navigation path,
following this dynamic update approach. Also provides two alternative landing
scenarios. Finally, Sect. 5 provides a discussion on the results achieved so far and
describes our next steps.

2 Related Work
UAVs are revolutionizing data collection and environmental exploration tasks.
They allow to enhance remote monitoring capabilities, increase efficiency and
lower costs, giving potential to numerous applications in various areas, such
as aerial photography, infrastructure inspection, search and rescue, commercial
delivery, and surveillance for law enforcement [29].
Efficient perception of the UAV surroundings, and safe and fast navigation
are critical in BVLOS operations. With respect to their autonomy, UAVs should
be able to dynamically revise their path planning strategy according to the
environmental constraints [22]. For this purpose, a wide range of technologies
are utilized by an UAV to generate informed trajectories for environmental to
exploration of unknown scenes [36], which includes, as show in Fig. 1:

Fig. 1. The Different Tasks that Relate to the Autonomous Navigation of UAVs.

– Sensing and sensor fusion: This mainly refers to the combined use of visual
(e.g. simple or stereoscopic cameras) and non-visual (e.g. LiDAR, proximity)
sensors.
– Scene perception: Signal processing and computer vision algorithms that pro-
vide perception of the surrounding environment and obstacle detection.
346 E. Politi et al.

– Map generation and path planning: They refer to techniques for scene repre-
sentation, with the use of volumetric or 2-D maps, and the continuous update
of the map or path in real time, using the output of the perception module.
– Localization: position of the UAV itself. This can be typically treated jointly
with Mapping as a simultaneous localization and mapping problem.
– Path Planning: This involves the generation of a navigation path based on a
given map and the actual position of the UAV, the location of its target and
the detected obstacles.

2.1 UAV Path Planning Solutions


Literature surveys on path planning in various environments, from underwater
[21] to indoor [18], outdoor [3] and the air [28], result in a wide range of techniques
including:
– graph-based space search algorithms,
– bio-inspired and genetic algorithms,
– simulated annealing,
– reinforcement learning.
Graph-based shortest path finding algorithms, such as the Dijkstra’s algo-
rithm [6] and its variations [5], the A* path search algorithm [11], or the Fast
Marching (FM) [25] Potential Fields [15] and Rapidly exploring Random Tree
[17] methods, have been extensively implemented for solving path planning prob-
lems. Such methods provide low complexity solutions when the state space is
considered finite and therefore, all the alternative paths are completely known
and predictable [16].
Genetic algorithms and other bio-inspired techniques have also been
employed for UAV path planning [33]. The respective path planning approaches
either assume various kinds of obstacles and their avoidance or not [14]. They
avoid constructing complex environment models and search for a near optimal
path based on stochastic approaches, so they provide efficient solutions to NP-
hard problems with many variables and nonlinear objective functions [24]. On the
other side they still consuming much computation time and processing resources
and thus there is space for research on their energy-related optimization [27].
Simulated Annealing (SA) is a meta-heuristic algorithm that is usually com-
bined with Genetic Algorithms to cope with route planning [19] or for obtaining
nearly optimal paths for multiple UAVs in constrained environments [32]. The
main disadvantage of SA methods is their high complexity in finding a global
minimum, which can be an overkill, especially when they are executed repeat-
edly.
Finally, Reinforcement learning (RL) allows UAV navigation in highly
dynamic environments. At each step of the path, the UAV re-evaluates its state
and takes a decision for an action. Consequently, the algorithm gives a positive
reinforcement value when the action is correct (e.g. moves closer to the target)
or penalises with a negative value when the action is wrong (e.g. when an obsta-
cle is hit or detected in the taken path). Deep reinforcement learning (DRL)
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 347

techniques that employ on training a neural network using an RL strategy are


becoming popular in UAV navigation [2]. As many other deep neural network
algorithms, DRLs suffer from high memory usage, computational complexity,
and sample complexity [12].
The simplicity of graph-based space search approaches, makes them the most
commonly used methods in autonomous mobile robots and consequently a prefer-
able for drone navigation. The Dijkstra algorithm provides the shortest distance
between any two nodes in a graph, given that the pair-wise distance between all
nodes is known in advance. In a grid-based navigation, the distances between all
neighboring grids are equal, so the detection of obstacles simply removes some
edges from the neighbor graph. Because of its graph-based nature it has been
also used for drone path planning in gps-denied indoor environments [20].
A-star (A*) is a heuristic search algorithm that relies on an estimate, a
heuristic h, to drive the graph exploration to the most favourable areas providing
low computational expenses [34]. In A*, a grid-shape graph represents the search
space with the edges labeled indicating the cost of travelling from a vertex to one
of its neighbors. The A* algorithm has been widely implemented in path planning
problems as it exhibits good real-time performance in the path search process.
The A* algorithm was enhanced with geometric rules through an interpolation
algorithm, in order to plan a collision-free, smooth path with the least cost in
[30]. Improvements of the A* algorithm for path planning of UAVs have been also
proposed in [4], [35]. A weighted A* search was used to generate the footstep
plan with energy considerations in sight [10]. The A* algorithm shows good
performance in simple navigating scenarios, however if the map size is too large
or the environment is too complicated, the A* algorithm will seriously affect the
efficiency [34].

2.2 Autonomous UAV Landing Techniques


The landing of a UAV is a more critical and complex application, especially
when it has to be done on a moving target [7]. It requires careful estimation
of the landing site position and careful estimation of the UAV position and the
distance from the landing site. Among the techniques that have been found in
the related literature [9], [1], camera-based techniques, also known as vision-
based techniques, use the input from one or two cameras (stereoscopic image)
to detect the position and distance from the landing site. Image processing algo-
rithms either compare both images or consecutive images from the same camera
and detect both values. LiDAR is another remote sensing method that is used to
estimate the range between the UAV and the landing target, and is frequently
combined with camera input to provide a safe landing. Low range distance sen-
sors (range finder laser sensors) are also employed in the case of emergency
landing, or landing in a slope [31].
In our setup, we employ the input of a single down-facing camera and a pre-
trained Tiny YOLO model, which has been fine-tuned to detect the landing site
mark. This allows to properly position the UAV above the designated landing
mark and then proceed with a vertical landing movement. Although the UAV is
348 E. Politi et al.

already equipped with a LiDAR, this is facing forward and thus does not provide
proper information for detecting the landing site and positioning the UAV above
it. In addition, replacing the LiDAR with cameras is also among our objectives,
and we are currently investigating depth detection with the use of camera input.

3 Tools and Methods


3.1 Trajectory Planing Using Scene Perception and the A*
Algorithm
Using a (2-D or 3-D) grid representation of the navigation space is the first step
for finding a navigation path. The start and end position of the route are also
mapped to the respective cells in the grid as shown in Fig. 2a. The next step
is to rotate the drone so that it faces towards the end of the route and collect
information from the LiDAR, as shown in Fig. 2b. Using the LiDAR input, the
obstacle detection module marks all the cells of the grid that seem to be occupied
by an obstacle. The third step is to run the path planning algorithm (e.g. Dijkstra
or A-star) and find the shortest path from the current drone position to the end,
as shown in Fig. 2c. The last step is repetitive and is repeated when the drone
arrives at the last cell in its route that the routing algorithm is sure that it is
clear from obstacles. As shown in Fig. 2d, the LiDAR input is processed again
to mark additional cells as occupied and recompute the shortest path from its
current position.

Algorithm 1. A Simple, Repetitive, Navigation Algorithm


Require: start, end, grid = {ci }, ci ∈ {occupied, empty, unknown}
Ensure: position = end
position ← start
∀ci ∈ grid, ci ← unknown
while position = end do
mark ci ∈ grid as occupied using LiDAR
path ← get path(grid, position, end)
for p ∈ path do
if p is empty then
position ← p
else
break
end if
end for
end while

The aforementioned technique, which is summarized in Algorithm 1, works


when the environment is static and obstacles do not move or appear during the
path execution. This means that all cells are initially set to the unknown state
and, when the LiDAR is used, some of them are marked as empty or occupied
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 349

permanently. In the opposite case, the repetitive step takes place whenever the
drone detects that the next cell of its route is not empty. The simple environment
setup shown from top view in Fig. 2 is the one used in our experiments. It consists
of a large building block and a second smaller block on the top of it. The drone
can take the option to raise above the two blocks and move from start to the
end, but we decide to test our navigation algorithm in 2-D and fly around the
two blocks.
The method get path in Algorithm 1 can either be A-star or Dijkstra’s short-
est path.

Fig. 2. The Representation of the Space in a 2-D Grid

3.2 Detection of the UAV Landing Site Using Deep Neural


Networks

The last segment of a drone trajectory is the safe landing in a designated landing
area. In order to break down this task into smaller tasks and take advantage of
software solutions, we assume that the drone safely approaches the landing area,
350 E. Politi et al.

but due to several reasons (e.g. deviations in the drone position due to wind or
other external conditions, the landing site has moved) the exact position of the
landing site is not determined.
For this purpose, the drone takes advantage of a down facing camera that
covers the landing area from a certain height, and a deep learning model for
image analysis and perception. As shown in Fig. 3, the landing site detection
module finds the exact position of the landing site and navigates the drone right
above the landing site and begins its landing. The same module keeps checking
whether the landing site is clear for landing at all times. In the opposite case the
drone can change altitude and try to approach the landing site again or notifies
the drone operator when there is no clear solution for the situation.

Fig. 3. The Three Stages of the Landing Sequence: i) The Drone Enters the Landing
are (Left), ii) the Drone Approaches the Landing Site (Middle) and iii) Activates the
Down Facing Camera in Order to detect the Exact Position of the Landing Site (Right)

The Tiny-YOLO deep learning neural network, a variation of the original


“You Only Look Once” (YOLO) object detector proposed in [23], has been used
for the detection of the landing site, from the input of the downward facing
camera. Tiny YOLO network has 9 convolutional layers followed by 2 fully con-
nected layers. It uses alternating 1×1 convolutional layers to reduce the feature
space between layers. The last layer has been retrained to distinguish between
the landing site and other objects that appear in the camera stream.

4 Experimental Evaluation of the Proposed Approach


In this section, we present two implemented scenarios: one for path planning
and one for landing. In the path planning scenario of the UAV, we examine
navigation only in the 2-D environment with static obstacles. As explained in
Sect. 3, the main objective of a path following algorithm is to guide the vehicle
from a starting point to a final destination area, whilst detecting and avoiding
all obstacle areas. Consequently, the performance of various algorithms is tested
with respect to the following optimization criteria: path length and execution
time. For this reason, we evaluate the two navigation algorithms (i.e. A-star and
Dijkstra’s shortest path) in terms of the total path length and the total time
needed to execute the path from the starting point to the landing area. We do
not consider the time or effort made to find the exact position of the landing
site and perform landing, which is further examined in Sect. 4.3.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 351

4.1 The AirSim Flight Simulation Platform

AirSim is an open-source, cross platform simulator for drones, cars, boats, etc.,
which allows connection between simulation and reality when autonomous vehi-
cles are examined. The platform is build as an Unreal Engine plugin and supports
navigation with popular autopilots, such as the PX4 open source autopilot2 ,
which is also used in real drones. AirSim offers hi-fidelity physical and visual
simulation, which allows to generate realistic scenarios and conditions. It is thus
possible to generate large quantities of training data without cost and risk, in
order to better train and evaluate the performance of various artificial intelli-
gence and machine learning techniques for the different tasks related to UAV
autonomy [26].
The configuration of AirSim is fairly simple, and is performed by describing,
in a JSON file (as shown in Fig. 4), the different types of sensors that are on-
boarded to the vehicle, and their characteristics. The same JSON file can be used
to choose simulation environments of various complexity and load them to the
Unreal engine. Such environments range from simple block-based setups to more
complex environments comprising whole cities, like the CiThruS environment
[8]. Finally, AirSim exposes APIs that allow external code to interact with the
vehicle within the simulation environment, to collect data from its sensors as
well as the state of the vehicle and its environment.

Fig. 4. LiDAR Settings written in a JSON File

2
https://px4.io/.
352 E. Politi et al.

The code in this work, is implemented in Python language and employs


the respective AirSim library to connect with the simulator in real time. With
regards to the sensors, a single LiDAR sensor is placed at the center of the drone
and is used to identify any obstacles located in front of it at every moment. The
sensor has a horizontal FOV which stretches from minus 90◦ C (left) to 90◦ C
(right) and a vertical FOV from 0 to minus 15◦ C (upwards). Also the sensor
has 16 emitters/receivers and each of them emits 10.000 laser pulses per second.
The input from the LiDAR is used by the trajectory planning algorithms as
explained in Sect. 3.1 that follows.
In addition to the LiDAR sensor, the drone is equipped with a down-facing
camera, which is activated once the drone approaches the designated landing
area. The input from this camera is used by the computer vision module that we
developed for locating the position of the landing site, as explained in Sect. 3.2.

4.2 Navigation Scenario Setup

For the navigation scenario, we split the environment using a square grid with a
size of 15 × 15 cells. The rectangular environment as shown in Fig. 5 has a length
of 87 m on each side. There are four obstacles located in the environment with
known position and dimension. Two large cubic blocks were placed one on top of
the other in the middle of the environment, thus allowing the drone to navigate
around the large block. Another obstacle has the shape of a cone and it is located
at the far left of the environment. The last obstacle is a sphere positioned at
the far right of the environment. At the beginning of each episode, the drone
(quadrotor) takes off from a starting point in the environment, denoted with a
capital letter and has to navigate through the obstacles to another designated
point as depicted in Fig. 5.

Navigation with A-Star or Dijkstra’s Shortest Path. In this scenario, we


investigated the performance of the A-star algorithm with respect to trajectory
acquisition, path length and total travelling time, in the aforementioned envi-
ronment with static obstacles. The starting point of the vehicle is set to be at
point A. To explore different angles of approach to the target area, we repeat
the test for two directions, start to target and target to start.To achieve this,
two different goal points are set at points E or E’. We add two more scenarios
with the start being at point A and the target at point T or T’. Moreover, in
order to remove any randomness in the results, we repeat each of the four nav-
igation scenarios five times and report the mean value for the execution time
and path length. This will give us a better understanding of the behaviour of
the algorithm when run on this map.
We repeat the methodology described above, this time using the Dijkstra’s
algorithm and compare results. The corresponding numerical results of our sim-
ulations are presented in Table 1.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 353

Fig. 5. Simulation Set up

Table 1. Performance of the Two Navigation Algorithms.

A-star Dikstra’s
Path length (m) Execution time (s) Path length (m) Execution time (s)
A-T 168.2 111.5 168.2 114.6
T-A 168.2 69 168.2 75.2
A-T’ 110.2 71.5 110.2 72.9
T’-A 110.2 49.6 110.2 47
B-E 139.2 94.8 139.2 99.1
E-B 139.2 59.5 139.2 64.1
B-E’ 110.2 72.8 110.2 73.9
E’-B 110.2 49.1 110.2 49.1

4.3 Landing Using the Down-Facing Camera

The second experiment aims to test the ability of the landing site detection
module, once the drone approximates the landing area. In this case we test two
different scenarios. The first scenario relates to a landing spot that randomly
spawns each time at a slightly different spot within the greater landing area.
Using the YOLO object detection model, it was easy to detect the landing spot
when it is not covered by other objects, or even when it is partially covered.
The drone managed to fly above the landing spot and start landing. In order
to validate the ability of tiny-YOLO to efficiently detect the landing spot, we
performed more than twenty experiments in which the landing spot was ran-
domly spawned at different positions within the predefined landing cell. In the
experiments we tried to distort the landing spot (shaped as shown in Fig. 3), to
354 E. Politi et al.

add some synthetic noise or to partially cut it, in order to increase the difficulty
of its detection. However, the landing spot has been correctly detected in all
experiments.
The second scenario examined a way to land on the detected spot even if it
was partially covered by a static obstacle (e.g. by a shed at a certain height).
In this case, the drone has to gradually change its height until it manages to
find a clear way to land in the spot. For this purpose it repeatedly moves away
from the landing spot, lowers in height in order to get below the obstacle, and
tries to move again above the detected landing spot. The LiDAR is employed in
every loop to detect whether the position right above the landing spot is empty
or not, and to retrieve the obstacle height if possible. The process repeats until
the drone manages to position above the landing site and get a clear view to it.
Depending on the height of the obstacle that covers the spot and the step that
the drone lowers in height, the duration of this process may vary.

5 Conclusions

In this work, we examined the problem of UAV path planning, route execution
and landing, with the use of camera and LiDAR sensor input. We performed an
overview of the various techniques that exist in the literature for the navigation
of autonomous vehicles in various environments and highlighted the pros and
cons of each group of techniques. We selected two popular search techniques
that model the navigation space using a grid and represent the search space as
a graph. The two techniques, namely A-star and Dijkstra’s shortest path, have
been embedded in a path planning strategy that updates the UAV path, when
it is not sure whether the vehicle cannot move to a new cell, or when a cell in
the path is detected to be occupied by an obstacle.
The experimental results showed that proposed techniques can find an opti-
mal path for the vehicle, for a given granularity of the grid space, which is the
same for both path algorithm. The A-star algorithm uses a heuristic function,
which gives priority to cells that are supposed to compose a shorter path than
others, while Dijkstra’s just explores all possible paths. For this reason, A-star’s
best first strategy performs faster that Dijksra’s.
The simplicity and efficiency of the developed solution has been demonstrated
experimentally in the simulated environment of AirSim. Successful execution of
the path planning and navigation within AirSim, with the use of its sensors
and its autopilot, guarantees that minimum effort will be needed to port the
algorithms in a real UAV case. Our work also performed an initial study on
the task of UAV landing, employed a computer vision approach for locating
the exact landing spot and navigating the drone to it, and examined various
scenarios. With the use of the Tiny-YOLO object detector module, which can
easily run on edge devices, with minimum requirements for resources.
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 355

5.1 Future Work


We acknowledge that this work must be further tested in more complex scenarios,
with more complex environment setups that will challenge the algorithms in 3-D
path planning. This also raises the need for a more detailed experimentation
that will also consider the energy efficiency of the proposed solutions and will
put them in practice in a real drone.
A number of exciting work areas are opened based on the results of this work.
Our current steps comprise the training of a reinforcement learning model for
finding the optimal navigation path. The model actually implements the Deep
Q-Learning policy and is currently at the stage of hyper-parameter tuning. As
future work authors plan to enhance the scenarios with various RL and ML
algorithms. The next steps comprise the evaluation of the reinforcement learning
technique and its efficiency and simplicity against the methods proposed in this
work. The same model can be employed for the landing part of the trajectory,
using a different reward mechanism.

Acknowledgments. This work is a part of ADACORSA project, that has received


funding from the ECSEL Joint Undertaking (JU) under grant agreement No 876019.
The JU receives support from the European Union’s Horizon 2020 research and inno-
vation program and Germany, Netherlands, Austria, Sweden, Portugal, Italy, Finland,
Turkey national Authorities.

References
1. Md Shah Alam and Jared Oluoch: A survey of safe landing zone detection tech-
niques for autonomous unmanned aerial vehicles (UAVS). Expert Syst. Appl. 179,
115091 (2021)
2. Azar, A.T., et al.: Drone deep reinforcement learning: a review. Electronics 10(9),
999 (2021)
3. Cabreira, T.M., Brisolara, L.B., Ferreira, P.R.: Survey on coverage path planning
with unmanned aerial vehicles. Drones 3(1), 4 (2019)
4. Cai, Y., Xi, Q., Xing, X., Gui, H., Liu, Q.: Path planning for UAV tracking tar-
get based on improved a-star algorithm. In: 2019 1st International Conference on
Industrial Artificial Intelligence (IAI), p. 1–6 (2019)
5. Deng, Y., Chen, Y., Zhang, Y., Mahadevan, S.: Fuzzy dijkstra algorithm for short-
est path problem under uncertain environment. Appl. Soft Comput. 12(3), 1231–
1237 (2012)
6. Dijkstra, E.W., et al.: A note on two problems in connexion with graphs. Numer.
Math. 1(1), 269–271 (1959)
7. Feng, Y., Zhang, C., Baek, S., Rawashdeh, S., Mohammadi, A.: Autonomous land-
ing of a UAV on a moving platform using model predictive control. Drones 2(4),
34 (2018)
8. Galazka, E., Niemirepo, T. T., Vanne, J.: CiThruS2: Open-source photorealistic
3D framework for driving and traffic simulation in real time. In: 2021 IEEE Inter-
national Intelligent Transportation Systems Conference (ITSC), pp. 3284–3291.
IEEE (2021)
9. Gautam, A., Sujit, P. B., Saripalli S.: A survey of autonomous landing tech-
niques for UAVs. In: 2014 International Conference on Unmanned Aircraft Systems
(ICUAS), pp. 1210–1218. IEEE (2014)
356 E. Politi et al.

10. Gupta, G., Dutta, A.: Trajectory generation and step planning of a 12 DoF biped
robot on uneven surface. Robotica 36(7), 945–970 (2018)
11. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determi-
nation of minimum cost paths. IEEE Trans. Syst. Sci. Cybernet. 4(2), 100–107
(1968)
12. Hodge, V.J., Hawkins, R., Alexander, R.: Deep reinforcement learning for drone
navigation using sensor data. Neural Comput. Appl. 33(6), 2015–2033 (2020).
https://doi.org/10.1007/s00521-020-05097-x
13. Kawabata, S., Lee, J. H., Okamoto, S.: Obstacle avoidance navigation using hor-
izontal movement for a drone flying in indoor environment. In: 2019 Interna-
tional Conference on Control, Artificial Intelligence, Robotics & Optimization
(ICCAIRO), pp. 1–6. IEEE (2019)
14. Raza Khan, M.T., Saad, M.M., Ru, Y., Seo, J., Kim, D.: Aspects of unmanned
aerial vehicles path planning: overview and applications. Int. J. Commun Syst
34(10), e4827 (2021)
15. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In:
Autonomous Robot Vehicles, pp. 396–404. Springer, New York (1986). https://doi.
org/10.1007/978-1-4613-8997-2 29
16. LaValle, S. M.: Planning Algorithms. Cambridge University Press, Cambridge
(2006)
17. LaValle, S. M., Kuffner, J. J., Donald, B. R., et al.: Rapidly-exploring random
trees: progress and prospects. Algorithmic and Computational Robotics, vol. 5,
pp. 293–308 (2001)
18. Li, F., Zlatanova, S., Koopman, M., Bai, X., Diakité, A.: Universal path planning
for an indoor drone. Autom. Constr. 95, 275–283 (2018)
19. Meng, H., Xin, G.: UAV route planning based on the genetic simulated annealing
algorithm. In: 2010 IEEE International Conference on Mechatronics and Automa-
tion, pp. 788–793. IEEE (2010)
20. Mirzaeinia, A., Shahmoradi, J., Roghanchi, P., Hassanalian, M.: Autonomous rout-
ing and power management of drones in GPS-denied environments through dijkstra
algorithm. In: AIAA Propulsion and Energy 2019 Forum, p. 4462 (2019)
21. Panda, M., Das, B., Subudhi, B., Pati, B.B.: A comprehensive review of path
planning algorithms for autonomous underwater vehicles. Int. J. Autom. Comput.
17(3), 321–352 (2020)
22. Politi, E., Panagiotopoulos, I., Varlamis, I., Dimitrakopoulos, G.: A survey of UAS
technologies to enable beyond visual line of sight (BVLOS) operations. In VEHITS,
pp. 505–512 (2021)
23. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 779–788 (2016)
24. Roberge, V., Tarbouchi, M., Labonté, G.: Comparison of parallel genetic algorithm
and particle swarm optimization for real-time UAV path planning. IEEE Trans.
Industr. Inform. 9(1), 132–141 (2012)
25. Sethian, J.A.: A fast marching level set method for monotonically advancing fronts.
Proc. National Acad. Sci. 93(4), 1591–1595 (1996)
26. Shah, S., Dey, D., Lovett, C., Kapoor, A.: AirSim: high-fidelity visual and physical
simulation for autonomous vehicles. In: Hutter, M., Siegwart, R. (eds.) Field and
Service Robotics. SPAR, vol. 5, pp. 621–635. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-67361-5 40
Path Planning and Landing for Unmanned Aerial Vehicles Using AI 357

27. Shivgan, R., Dong, Z.: Energy-efficient drone coverage path planning using genetic
algorithm. In: 2020 IEEE 21st International Conference on High Performance
Switching and Routing (HPSR), pp. 1–6. IEEE (2020)
28. Souissi, O., Benatitallah, R., Duvivier, D., Artiba, A., Belanger, N., Feyzeau, P.:
Path planning: a 2013 survey. In: Proceedings of 2013 International Conference on
Industrial Engineering and Systems Management (IESM), pp. 1–8. IEEE (2013)
29. Tan, L.K.L., Lim, B.C., Park, G., Low, K.H., Yeo, V.C.S.: Public acceptance of
drone applications in a highly urbanized environment. Technol. Soc. 64, 101462
(2021)
30. Tang, G., Tang, C., Claramunt, C., Xiong, H., Zhou, P.: Geometric a-star algo-
rithm: an improved a-star algorithm for AGV path planning in a port environment.
IEEE Access 9, 59196–59210 (2021)
31. Tsintotas, K. A., Bampis, L., Taitzoglou, A., Kansizoglou, I., Antonios Gasteratos,
A.: Safe UAV landing: a low-complexity pipeline for surface conditions recognition.
In: 2021 IEEE International Conference on Imaging Systems and Techniques (IST),
pp. 1–6. IEEE (2021)
32. Turker, T., Sahingoz, O. K., Yilmaz, G.: 2D path planning for UAVs in radar
threatening environment using simulated annealing algorithm. In: 2015 Interna-
tional Conference on Unmanned Aircraft Systems (ICUAS), pp. 56–61. IEEE
(2015)
33. Yang, Q., Yoo, S.-J.: Optimal UAV path planning: sensing data acquisition over
IoT sensor networks using multi-objective bio-inspired algorithms. IEEE Access 6,
13671–13684 (2018)
34. Zhang, Z., Zhao, Z.: A multiple mobile robots path planning algorithm based on
a-star and dijkstra algorithm. Int. J. Smart Home 8(3), 75–86 (2014)
35. Zhang, Z., Tang, C., Li, Y.: Penetration path planning of stealthy UAV based on
improved sparse a-star algorithm. In: 2020 IEEE 3rd International Conference on
Electronic Information and Communication Technology (ICEICT), pp. 388–392
(2020)
36. Zhou, X., Yi, Z., Liu, Y., Huang, K., Huang, H.: Survey on path and view planning
for UAVs. Virtual Reality Intell. Hardware 2(1), 56–69 (2020)
Digital Ticketing System for Public Transport
in Mexico to Avoid Cases of Contagion Using
Artificial Intelligence

Jose Sergio Magdaleno-Palencia(B) , Bogart Yail Marquez , Ángeles Quezada ,


and J. Jose R. Orozco-Garibay

Maestría en Tecnologías de Información, Tecnológico Nacional de México campus Tijuana, Av


Castillo de Chapultepec 562, Tomas Aquino, 22414 Tijuana, B.C., Mexico
{jmagdaleno,bogart,angeles.quezada}@tectijuana.edu.mx

Abstract. In this project, a proposal is made to contribute to the reduction of


COVID-19 infections, investigating how to carry out an electronic and/or digital
collection system for public transport in Tijuana in Mexico, in addition to having
a more versatile intelligent system for users. This proposal also seeks to reduce
soil contamination, using fewer physical inputs. Due to the development of the
pandemic that has occurred in recent years, there have been thousands of peo-
ple infected by the COVID-19 virus (SARS-CoV-2). Due to its rapid spread, it
has caused many deaths. Alternatives have been sought to seek the reduction of
infections. The objective is to reduce the physical entrances to avoid contagions
by manipulation to the touch; the transport routes have been contemplated and the
different routes in the locality will be analyzed through artificial intelligence and
implement it in all public transport routes. Also a system can be implemented for
digital ticket.

Keywords: Artificial intelligence · Q-learning · Digital ticketing system

1 Introduction
At the beginning of the pandemic, several prevention strategies were implemented to
avoid contagion of the virus, such as the “Stay at home” initiative, the use of face masks,
constant hand washing, and the disinfection of commonly used items, in addition to
prone areas to the virus. To reduce contagion, a new way of continuing to work had to
be found so that the economy would not suffer many consequences, which led to remote
work and the temporary or permanent closure of schools [1].
In recent months, places of first need have been opened as the epidemiological
traffic light goes down, this leads to a flow of people on public transport, which is where
a problem lies, the low or no hygiene measure for the prevention of contagion. Even
with the vaccination campaigns, the contingency situation is still in force since the virus
has not been 100% eradicated. This causes insecurity in a certain part of the population
as they must go out and use public transport daily [2]. Digital transportation collection

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 358–367, 2023.
https://doi.org/10.1007/978-3-031-18461-1_24
Digital Ticketing System for Public Transport 359

systems have been implemented around the world, making it an efficient and convenient
system for users. With the use of a digital system for public transport, the contact between
driver and passenger could be reduced.

1.1 Objectives and Justification

Develop a web system of digital tickets for public transportation in the Tijuana area.
Which will be an alternative to physical tickets, to reduce the number of infections due
to the health contingency presented in recent months.
Creation of digital tickets through the web system, in which the reduction of physical
tickets is sought. Seek to be an alternative solution for the reduction of infections through
physical contact between driver and passenger. Reduce the use of paper, contributing to
the reduction of soil contamination. Use the web system through electronic tablets. The
reasons that led us to develop this project is the need to have a digital ticket system for
public transport that is carried out in an agile way and without the need for interaction
between users and drivers.
The main reason to carry it out currently is the contingency due to the pandemic,
which is still active, and the increased risk that it entails for the staff and users of these
services. Another reason is also to seek to reduce soil contamination, since many times
these tickets are discarded by the passenger, throwing them out the windows and even
on the same transport. Through this system of electronic tickets, it is intended to avoid
the manipulation of physical tickets in public transport, since this can lead to contagion
by contact with tickets, such as cash (this is left as something optional), in addition to
avoiding contact physical by reducing the interaction of the passenger with the driver.

2 Design of the Theoretical Framework

It is the supervision of the validity of the tickets used in the transport, normally they
are contactless cards, however, the control of other types of tickets can be implemented,
for example, payment through NFC, SMS tickets, 2D codes in mobile phone screens,
tickets with barcode or QR, among others. The system is designed in such a way as to
make the work of the inspectors easier and more efficient and to eliminate attempts to
avoid payment of the corresponding trip. Electronic tickets in public transport, with the
passage of time and due to the different needs, that have arisen, collection systems have
evolved allowing to reduce effort, time, and margin of error. Reliability and efficiency
have been one of the most important characteristics in the design of this type of system,
which is why the intervention of digital devices facilitates the task.
Various technologies and service models have been developed to provide mobile
payment service in different contexts, including e-wallets, operator billing (SMS) and
contactless payments, for example with NFC (Near Field Communication) technology
[3, 4].
In the context of public transport, different technologies have been implemented and
adapted to create new service models that respond to the challenges of this sector [5, 6].
Technology is also closely related to the structure of the service, some of the options
360 J. S. Magdaleno-Palencia et al.

that can be found are pay per use, fixed tickets (from one specific station to another) and
subscriptions.
Self-ticketing: It is one of the most popular technologies, given the ease of imple-
mentation. By using an app, the user can buy tickets from a specific exit to a specific
final station, which means that the travel route is fixed. The result of the transaction is
a QR code, or barcode, that can be viewed on the phone, there are two options for the
execution of this service model. Both in Europe and in other countries around the world,
different digital collection and ticketing systems for public transport have been created
and implemented [7]. When stations do not have gates or when boarding vehicles, the
ticket must be activated prior to boarding. At that moment, the ticket begins to show
an animated background, which can be represented by a QR code, which the driver can
easily check during boarding through the device that the user is using (his cell phone,
for example). On most services there is an extra check done by an inspector on board
[8].
Closed stations: In stations with gates, the challenge is to open the doors with the
mobile device, for that purpose the doors are usually equipped with QR scanners. NFC
(Near Field Communication). - It’s a wireless data transfer method that detects and
then enables technology in the vicinity to communicate without the need for an internet
connection. It’s easy, fast and works automatically [9]. It means that two devices can
transfer data to each other without being connected to Wi-Fi, or using a pairing code
like in Bluetooth. Due to the encryption protocol, the chips embedded in most high-tech
smartphones are secure enough to be used for payments like a contactless card.
NFC technology is today one of the most used technologies in mobile phones in
general. Mainly used for payments in physical stores and other services. However, the
implementation of this technology has great challenges that directly affect the opportu-
nities in public transport. According to RFID Journal [10, 11] the biggest challenge is
related to the slow adoption process due to lack of infrastructure, complex ecosystem of
stakeholders and standards. Several NFC service models have been developed to adapt
the technology to the specifications and needs of public transport. Some of the NFC
applications are NFC + Sim Card [12]. Most gates today are not equipped with NFC
readers, as they were used in the field before NFC emerged as a standard. In addition,
there is a wide variety of encryption systems according to the different mobile models.
For this reason, this solution proposes a SIM card, plus an NFC chip, which emulates
the protocol of the chip that is integrated in the doors and posts.
Then, it is possible to open the doors and track the user’s transactions as a transport
card, as well as recharge with an app. This model is currently used in Hong Kong. NFC
to scan the card. In this case, given the limited infrastructure of the readers, NFC is used
to scan the service card (for example, OV-chipcard or Octopus Card) and to top up via
an app transaction.
Mobile wallet. - This is so far from the most widespread use of NFC for mobile
payment due to the launch of Apple pay and Android pay in most countries. By using
an application that stores credit and debit card information, a smartphone can be used
as a means of payment instead of the physical card.
There are other technologies which are not widely used. Hop-on: This technology
was developed by an Israeli start-up as an alternative way of paying for public transport
Digital Ticketing System for Public Transport 361

with a mobile phone. The technology sends information through ultrasonic sound waves
transmitted from the mobile to the reader. It is said to be safe, low cost and fast.
Projects such as Route 664 have served, being a project that emerged as a school idea
within the Communication Career of the Faculty of Humanities and Social Sciences at
UABC.
The purpose of the platform is to make known to the citizens of Tijuana the routes of
both trucks and taxis through an interactive map. Other tools such as videos, photographs
and graphics are also used, the latter is made up of a photograph of the transport, the
schedule, the rate and the place where said route can be taken.
A field investigation was carried out, in which some streets of the Center were visited.
Where there are more than 170 routes, the interactive map project has 70 routes. And
it is the users who feed the same information on roads that are needed. The idea is to
add these type of projects. In Fig. 1 you can see the transportation routes of the city of
Tijuana.
Figure 1 shows the current map taken from the project “Route 664: Public
Transportation Platform in Tijuana” and Fig. 2 shows the current Tijuana’s transport
routes.

Fig. 1. Map taken from the project “route 664: public transportation platform in
Tijuana” https://www.sandiegored.com/es/noticias/98691/Ruta-664-Plataforma-de-Transporte-
Publico-en-Tijuana.

GPS, the global positioning system is a satellite navigation system that allows locat-
ing the position of an object, vehicle, person, or ship around the world with precision
of up to a few centimeters, although it is usual to have a margin of error of meters.
GPS works through a network of 32 satellites, 28 operational and 4 backups, orbiting
20,200 km above the planet, with synchronized trajectories. To determine the position,
the receiver locates and uses at least 3 satellites of the network, which deliver a series of
identification signals and the time of each one [13]. With this, the device synchronizes
the GPS clock and calculates the time it takes for the signals to reach the equipment,
362 J. S. Magdaleno-Palencia et al.

which allows us to know the distances to the satellite through triangulation. Triangula-
tion consists of determining the distance of each satellite from the measurement point
[14]. The useful data for the GPS receiver, which allows it to determine its position, is
called ephemeris. Each satellite emits its own ephemeris where the position in space of
each satellite is included, if it should be used or not, its atomic time, doppler information,
etc.
A QR code (Quick Response code) is a method of representing and storing infor-
mation in a two-dimensional dot matrix [15]. This 2D symbology has its origin in 1994
in Japan [16], when the company Denso Wave, a subsidiary of Toyota, developed it to
improve the traceability of the vehicle manufacturing process. It was designed with the
main objective of achieving a simple and fast decoding of the information contained.
They are common in Japan and increasingly widespread worldwide (thanks to their use
to encode Internet URLs and to existing decoding applications for mobile phones with
cameras), they are characterized by having three squares in the corners, which facilitate
the process Reading.
The Secretariat of Infrastructure, Urban Development and Territorial Reorganization
(SIDURT), carries out studies on the Tijuana-Tecate railway, as well as the works and
replacement of tracks, and after these studies the Tijuana-Tecate interurban train will be
able to start operating in 2024, to mobilize an average of 40 thousand people, reorder
transportation routes and thus reduce traffic in Tijuana.
Figure 2 shows the current public transportation network in Tijuana with 114 routes
for mass transportation and 125 routes for enroute taxi transportation.
As can be seen in Fig. 2, there are 34 routes waiting in parallel in a single corridor.

Fig. 2. Current transport routes https://cadenanoticias.com/regional/2021/09/trafico-podria-dis


minuir-en-tijuana-hasta-2024-con-tren-interurbano

Use of electronic tickets in public transport in different parts of the world.


An example in Europe is the Dutch system called OV chipcard. This system consists
of exchanging physical tickets for an electronic system that reads cards that the passenger
uses to pay for their ticket.
Digital Ticketing System for Public Transport 363

There are other systems that are also handled in Europe, such as the London Oyster
card. This works by means of a card which uses RFID and is used in buses, subways, or
suburban trains. It took a few years to be accepted by millions of users. In Asia, South
Korea, a smart card system for public transport, called T-money, was introduced, and is
used throughout the country. The card is used to pay for public transport services; it is
used in buses, taxis, and trams. It is also used for other activities, such as gas stations,
vending machines, among others. In China, the Octopus card is used as a means of
electronic payment for public transport. Like the T-money card, the Octopus card can be
used to pay for public transport services, as well as to make payments in supermarkets,
restaurants, and different businesses. In Germany, a service called “Touch and Travel”
had been launched, in which GPS is used to track the user’s journey. Users are required
to sign in and out on the app when getting on and off. It was a service that started in
2008 in Berlin, and later it could be used throughout Germany. However, this service
was discontinued at the end of 2016 [17].
Artificial intelligence (AI) is increasingly present in our lives. A good definition of it
would be the combination of algorithms, which try to simulate some actions of humans
or better yet, go beyond human intelligence. The best of all is that it is open to many
fields and can provide solutions, such as in the case of customer service using chatbots,
which according to the Gartner Consulting firm by 2020 will be implemented in 84%
of the companies consulted, which will increase investment in this type of technology
[18].
For this and for many other reasons, in this research we will focus on the development
of an artificial intelligence; let’s investigate how they work, and how to create one.

3 Methodology
3.1 Applications

The sample will be obtained in a probabilistic way, selecting an amount of the population,
in certain areas of the city that meet the characteristics of a transport route or a transport
stop that are users of public transport. To have a faster data collection and greater accu-
racy, a stratified random sampling will be used proportionally to the population of the
area near the transport route or main stops and that they use transport normally. In other
words, the universe is taken as all the people who are users of public transportation in
the city of Tijuana. Because it is not possible to locate or interview all the people who
are users of public transportation in the city of Tijuana, a sample of the population will
be chosen, which will be taken at points of greater accessibility.
It was determined that the type of investigation will be an experimental investigation,
it is investigated based on the existing means and an analysis will be made to obtain a
result for the problem. This to be able to analyze the possible changes that there may
be with the changes to be made. The tool that will be used will be the questionnaires.
They are composed of closed questions, so that the people who will be surveyed can
respond briefly and specifically and thus obtain the information with which they wish to
work. This survey will be carried out before and after the changes to be made to obtain
a point of comparison and to be able to determine if there was an improvement after
364 J. S. Magdaleno-Palencia et al.

applying the changes and how much improvement percentage was obtained based on the
efficiency of all the processes carried out at the time to get around on public transport.

3.2 Fieldwork
The sample will be obtained in a probabilistic way, selecting an amount of the population,
in certain areas of the city that meet the characteristics of a transport route or a transport
stop, that are users of public transport. To have a faster data collection and greater
accuracy, a stratified random sampling will be used proportionally to the population of
the area near the transport route or main stops and that they use transport normally.
The survey technique will be used, since with this collection technique, there will be
communication with the people who will be selected for the collection of information,
through questionnaires. The way to do it will be through people previously trained and
informed of what you want to execute.
This person who we will call “field worker” must have contact with the population
that is close to the problem in question, as well as it will be the worker himself who
asks the questions, does the survey, and records the answers to later obtain more general
results. The initial contact is of the utmost importance since it will be necessary to
convince the population that their participation is important. For the approach of the
questions, it is necessary to respect the order in which they appear in the survey, it is
also advisable to read the questions slowly for a better understanding by the interlocutor
and repeat them if necessary. For the recording of the answers, the responses of the
respondents will not be summarized or paraphrased.
Once the data obtained from the surveys is obtained, the data obtained will be ana-
lyzed and classified for a better understanding and to consider the needs of the user.
The surveys will have to be validated to confirm that what was initially assumed has
been done according to what is established, this to detect fraud or failures committed
by the interviewer (field worker) such as: has made a bad record of the answers, that the
order of the questions has not been followed, that has paraphrased the answers obtained.
Afterwards, a list of the answers will continue to work more quickly. Once the data has
been analyzed, processed, and analyzed, the step will be taken to convert this information
into a graphic presentation.

3.3 Development of the Methodology


Population and sample. At this point, the population could not be considered; in this case,
the citizens of Tijuana who are users of public transport since it would be necessary to
cover the entire city. A sample of 60 people was chosen, within a range of between 16
and 60 years of age between men and women. Areas that each team member had within
reach were covered. If any of the members could not carry out the survey in person, the
option of carrying out the survey through google forms is considered.
Analysis techniques. Statistical techniques will be used to assess the quality of the
data. Check the hypotheses or obtain conclusions, it is expected that most of the answers
regarding the implementation of a ticketing system for public transport will be positive
to give rise to the implementation of said system, thus verifying the hypothesis proposed
in previous points of the research.
Digital Ticketing System for Public Transport 365

For this, a tabulation system will be used for the questions that allows us to generate
graphs and give a value to the answer.
The algorithm used in the Q-Learning method is quite complicated, but it is
recommended for what we need.
Qnew (st , at ) ← (1 − α) ∗ Q(st , at ) + a(rt + τ ∗ max Q(st + 1, a)) (1)

• Q(st , at ) old value


• a learning rate
• rt reward
• τ discount factor
• max Q(st + 1, a) estimate of optimal future value
• (rt + τ ∗ max Q(st + 1, a) learned value

Before learning begins, Q is initialized to an arbitrary constant value (chosen by us).


Then, at each time the agent selects an action, observes a reward, introduces a new state
(depending on the previous state and the selected action), and Q is updated. The core of
the algorithm is a simple iteration value update, taking the weighted average of the old
value and the new information:
In short, the Q variable is calculated with the rest of the variables, but the reward
variable is generated with the results of the artificial intelligence training.
The programming language used is Python in its version 3.8.8. Development envi-
ronment. For the development of this project, we are going to work on the Ubuntu
operating system in its version 20.04.
For the development of the artificial intelligence environment, the Python program-
ming language will be used, and because the development of artificial intelligence is
quite complicated, we will use the Python distribution provided by Anaconda, since
what Anaconda does is “encapsulate” a programming environment for Python that helps
us have greater control over absolutely everything that is handled in the programming
language, both the version of the programming language and the packages and their
versions.

4 Analysis and Results


To train our artificial intelligence, we must input three different variables: Alpha. - It
is the “decay rate”, or the “learning index”. Determines the extent to which acquired
information overwrites old information. A factor of 0 means that the agent does not learn
(only taking advantage of prior knowledge), while a factor of 1 makes the agent consider
only the most recent information (ignoring prior knowledge to explore possibilities).
Gamma: It is the “learning rate”, or the “discount factor”. Determine the importance
of future rewards. A factor of 0 will make the agent “myopic” (or short-sighted) by
considering only current rewards. Epsilon: It is a complex variable that works for the
calculations of the equation used in Q-learning.
Population and sample. At this point, the population could not be considered. Reward:
Usually represented as rt. It is the reward received when moving from one state to another,
or when finishing one iteration and starting another.
366 J. S. Magdaleno-Palencia et al.

The agent’s goal is to maximize his total reward. He does this by adding the maximum
attainable reward, in future states, to the reward for reaching his current state, effectively
influencing current action for the potential future reward. This potential reward is a
weighted sum of the mathematical expectation of the rewards of future steps starting
from the current state.
Our population are the results shown by the artificial intelligence tests each time it
is sent to “train”, and our sample is not a fixed amount of these results, rather they are
select results that we will choose depending on the variables put in the training and of
the time the training is left.
For the analysis of the results, we are going to document the data that resulted from
the artificial intelligence training, these data are going to be grouped in their respective
trainings and depending on the results of each training is where we are going to be able
to determine which set of variables it is better to train the artificial intelligence on the
given route.

5 Conclusions

Artificial intelligences have their complexity when developing them, the theory that
needs to be understood before starting to develop an AI is quite abstract because it handles
a lot of information and many complex mathematical calculations; the development of
artificial intelligence itself is even more complicated because the theory and calculations
must be implemented in code. But once we understand to a certain extent how the learning
method works (in this case, Q-Learning) we can already understand more clearly how
everything works and why. Although the knowledge of development can be expressed
with few ideas, the truth is that the knowledge of the specific functioning of artificial
intelligence learning is not so easy to express due to the intrinsic complexity of it.

References
1. Hamidi, S., Zandiatashbar, A.: Compact development and adherence to stay-at-home order
during the COVID-19 pandemic: a longitudinal investigation in the United States. Landsc.
Urban Plan. 205, 103952 (2021)
2. Hernández Bringas, H.: COVID-19 en México: un perfil sociodemográfico. Notas Poblacion
(2021)
3. Coskun, V., Ozdenizci, B., Ok, K.: A survey on near field communication (NFC) technology.
Wirel. Pers. Commun. 71(3), 2259–2294 (2013)
4. Rahul, A., Rao, S., Raghu, M.E.: Near field communication (NFC) technology: a survey. Int.
J. Cybern. Inform. 4(2), 133 (2015)
5. Pelletier, M.-P., Trépanier, M., Morency, C.: Smart card data use in public transit: a literature
review. Transp. Res. Part C Emerg. Technol. 19(4), 557–568 (2011)
6. Połom, M., Wiśniewski, P.: Implementing electromobility in public transport in poland in
1990–2020. A review of experiences and evaluation of the current development directions.
Sustainability 13(7), 4009 (2021)
7. LeRouge, C., Nelson, A., Blanton, J.E.: The impact of role stress fit and self-esteem on the
job attitudes of IT professionals. Inf. Manag. 43(8), 928–938 (2006). https://doi.org/10.1016/
j.im.2006.08.011
Digital Ticketing System for Public Transport 367

8. Finžgar, L., Trebar, M.: Use of NFC and QR code identification in an electronic ticket sys-
tem for public transport. In: SoftCOM 2011, 19th International Conference on Software,
Telecommunications and Computer Networks, pp. 1–6 (2011)
9. Faulkner, C.: Secure commissioning for ZigBee home automation using NFC. Jan 23, 1–3
(2015)
10. Chen, J., Hines, K., Leung, W., Ovaici, N., Sidhu, I.: NFC mobile payments. Cent. Entrep.
Technol. Technical report, vol. 28 (2011)
11. Du, H.: NFC technology: today and tomorrow. Int. J. Futur. Comput. Commun. 2(4), 351
(2013)
12. Aziza, H.: NFC technology in mobile phone next-generation services. In: 2010 Second
International Workshop on Near Field Communication, pp. 21–26 (2010)
13. Bahmani, K., Nezhadshahbodaghi, M., Mosavi, M.R.: Optimisation of doppler search space
to improve acquisition speed of GPS signals. Surv. Rev., 1–17 (2022)
14. Peng, H., et al.: Analysis of precise orbit determination for the HY2D satellite using onboard
GPS/BDS observations. Remote Sens. 14(6), 1390 (2022)
15. Julham, M.L., Lubis, A.R., Al-Khowarizmi, I.K.: Automatic face recording system based on
quick response code using multicam. Int. J. Artif. Intell. 11(1), 327–335 (2022)
16. Pan, J.-S., Liu, T., Yan, B., Yang, H.-M., Chu, S.-C.: Using color QR codes for QR code secret
sharing. Multimedia Tools Appl., 1–19 (2022). https://doi.org/10.1007/s11042-022-12423-z
17. Gerpott, T.J., Meinert, P.: Who signs up for NFC mobile payment services? Mobile network
operator subscribers in Germany. Electron. Commer. Res. Appl. 23, 1–13 (2017)
18. Bosch-Sijtsema, P., Claeson-Jonsson, C., Johansson, M., Roupe, M.: The hype factor of digital
technologies in AEC. Constr. Innov. (2021)
To the Question of the Practical Implementation
of “Digital Immortality” Technologies: New
Approaches to the Creation of AI

Akhat Bakirov1,2 , Ibragim Suleimenov1 , and Yelizaveta Vitulyova2(B)


1 National Engineering Academy of the Republic of Kazakhstan,
Almaty, Republic of Kazakhstan
2 Almaty University of Power Engineering and Telecommunications named after Gumarbek
Daukeyev, Almaty, Republic of Kazakhstan
[email protected]

Abstract. On the basis of the principle of dialectical symmetry put forward within
the framework of the philosophy of dialectical positivism, it is shown that the
scheme of personality structure according to Jung should be clarified. It should
include an element that makes this scheme symmetrical - the collective conscious
(the term is formed by analogy with the term collective unconscious). This app-
roach allows us to distinguish between the concepts of intellect, mind and con-
sciousness. In particular, the intellect is interpreted as a structural component of
the personality, most closely adjacent to the collective conscious. It is shown that
it is this structural component of personality that can be converted into a digital
format already at this stage of research by using methods for decoding algorithms
for the operation of convolutional and similar neural networks, which we previ-
ously proposed based on new methods of digital signal processing based on the
use of non-binary Galois fields. It is shown that the digital reconstruction of a
separate component of the personality - intelligence - can be considered as the
first step towards the implementation of digital immortality technologies.

Keywords: Artificial intelligence · Digital immortality · Collective


unconscious · Personality structure · Essence of intelligence · Convolutional
neural networks · Explainable neural networks

1 Introduction
Currently, a number of technologies have already been implemented that provide an
imitation of digital immortality. Thus, neural networks make it possible to synthesize
video messages from people who have already died, popular show business stars give
concerts, act in films after death, etc. Microsoft has patented a technology for creating
an interactive chat bot of a specific person.
There is no doubt that technologies of this kind are far from digital immortality
proper, it is really nothing more than a kind of imitation. But, their very appearance
and wide distribution not only once again demonstrates a person’s desire for individual

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 368–377, 2023.
https://doi.org/10.1007/978-3-031-18461-1_25
To the Question of the Practical Implementation 369

immortality, but also shows the direction of the further development of information
technologies in general and artificial intelligence (AI) in particular.
The vector of AI development is obvious - it will increasingly approach human intel-
ligence. Consequently, all research (philosophical, neurophysiological, psychological,
etc.) will be updated, which, to one degree or another, can contribute to understanding
the essence of intelligence as such. From the most general considerations of practical
philosophy, it follows that the construction of artificial intelligence systems approach-
ing human intelligence is inseparable from the problem of digital immortality. If we,
mankind, understand what intelligence is - at the level of consistent philosophical inter-
pretation and correct mathematical description - we will understand how to transfer it to
a computer or other non-biological carrier.
As noted in [1], published at the very beginning of this century, estimates of the
information performance of computers - even at the level of representations of that time
- made it possible to assume that it was enough to “transfer a personality to a non-
biological carrier”, which resulted in particular, in numerous discussions regarding the
possibility of creating an “e-creature”.
Consequently, the point is no longer the level of development of computer technology,
the point is to comprehend the essence of such information objects as the human intellect.
This factor makes the thesis about the convergence of natural science, technical and
humanitarian knowledge more than relevant.
An excellent illustration of this thesis is the judgment expressed in the review article
[2], written by one of the most prominent experts in the field of mathematical logic and
the philosophy of logic.
“Gabbay predicts that the day is not far off when the computer scientist will wake
up with the realization that his professional line of work belongs to formal philosophy.”
The further development of artificial intelligence systems, which no one doubts, even
regardless of the issue of digital immortality, already raises and will continue to raise
questions for “techies” that were previously predominantly within the competence of the
social sciences/humanities. The main one, obviously, is the question of the essence of
intelligence as such. Without an answer to it, all discussions about whether this particular
system can be considered artificial intelligence or not become pointless [3, 4].
This paper proposes a new non-trivial approach to the development of artificial
intelligence systems, which is most closely related to the problems of digital immortality
and the problems of Jungian psychology.
Specifically, we are talking about the fact that the problem of “digital immortality”
cannot be solved immediately. We aim to show that it can be solved step by step. The
basis for this approach is that the structure of personality - and this is demonstrated in
this work - is very complex. There is no point in trying to transfer the “personality” to a
non-biological carrier entirely and immediately, especially since modern science has not
reached that level of understanding of the essence and intellect, mind and consciousness
of a person that would allow writing an adequate technical task for programmers.
We argue that there are already prerequisites for transferring to a non-biological
carrier individual component of the personality that are mainly associated with the con-
cept of “intelligence”, and it should be emphasized that the concepts of “intelligence”,
370 A. Bakirov et al.

“mind” and “consciousness”, although they have overlapping semantic spectrum, but
are by no means identical.

2 Literature Review

As shown in [3, 4], the intellect, consciousness and mind of a person should, first of all,
be considered as information processing systems. However, it should be noted that these
concepts are by no means identical. A distinction between them can be made, starting
from the conclusion about the dual nature of the intellect, consciousness and mind of a
person [5]. In cited work, in particular, it was shown that the intellect and consciousness
of a person are only relatively independent, in fact, their nature is dual, i.e. human
intelligence simultaneously has both a collective and an individual component.
This conclusion, as well as the principle of dialectical symmetry put forward in [4],
makes it possible to overcome some of the methodological contradictions inherent in
the views of Jung and his followers on the collective unconscious.
Namely, as was shown in the cited works, the collective unconscious, understood
according to Jung, is a consequence of the formation of transpersonal information struc-
tures that arise due to the fact that the exchange of signals between neurons takes place
not only within the brain of an individual [3, 4].
Any interpersonal communication de facto comes down to the exchange of signals
between neurons regarding independent fragments (localized within the brain of each
of the people) of a common neural network. Consequently, along with such information
objects as the intellect, mind and consciousness of a person (individual level), transper-
sonal information structures are also formed, which are also generated by the exchange
of signals between neurons that are part of the global neural network, which can be
identified with the noosphere, understood according to V.I. Vernadsky.
This mechanism, in particular, allows revealing the essence of the collective uncon-
scious as an objectively existing information system. Moreover, it radically changes the
view of what should be understood as the structure of personality.
It should be noted that at present there are quite a lot of psychological schools (not
only Jungian), whose representatives have proposed quite a lot of different schemes of
personality structure and interpretations of the phenomenon of the collective unconscious
[6, 7]. Moreover, the question is raised about the practical use of Jung’s ideas, both, for
example, in politics [8] and in marketing [9].
However, the dual nature of human intellect and consciousness was reflected in
them inconsistently. In our opinion, this is due to the fact that psychologists created
these models on a purely empirical basis. The theory of neural networks has so far found
only limited application in psychology, which, as emphasized in [5], is due to a lack
of understanding of the algorithms for the functioning of neural networks themselves.
Recall that the vast majority of neural networks currently used are de facto the result
of computer experiments: algorithms are known according to which artificial neural
networks are trained, but it is most often impossible to predict the result of training, and
even more so to reveal the real algorithm of the network’s functioning.
It is this factor (the logical opacity of neural networks) that led to the emergence of
the thesis about the need to develop explainable neural networks [10, 11] and explainable
To the Question of the Practical Implementation 371

artificial intelligence [12–14]. As shown in [15], this problem is completely solvable.


In particular, thanks to the use of projective geometry methods [16], it is possible to
establish specific logical operations performed by formal neurons and even show that
the number of such operations is de facto limited.
Moreover, as it follows from the materials of [5], in modern conditions there is a fairly
rapid evolution of the intellect and consciousness of a person. This is due to the fact that
due to the rapid development of the telecommunications industry, the noosphere is de
facto being converted into a human-machine system. We emphasize that the impact of the
Internet on society has been studied in many works, for example, in [17, 18], at the same
time, considerable attention was paid to the impact of the Internet on society as a whole
at the level of sociology [19, 20]. It has been shown that this effect is very significant,
however, the conclusion that the intellect itself is transformed was not formulated clearly
enough.
At the same time, the shift of a significant part of interpersonal communications that
give rise to the collective component of intellect, mind and the creation of a person on
the Internet creates new additional opportunities not only for comprehending the essence
of intellect, but also for its quantitative study. Corresponding trends are already visible
[21], but the conclusions made in [5] allow us to raise the issue much more broadly,
translating into a practical plane the discussion about the possibility of achieving digital
immortality.
Thus, even a cursory review of the literature data shows that, firstly, the personality
structure is very complex (which already allows us to talk about the transfer of some
of the simplest components to a non-biological carrier). Secondly, the existence of per-
sonality components associated with collective effects allows us to raise the question of
establishing the real mechanisms of the functioning of the intellect and consciousness
without trying to penetrate “deep” in the human brain.
Simplifying, the immersion of the individual in the communication environment
allows revealing at least some algorithms for the functioning of consciousness and intel-
lect, considering the brain as a “black box”, i.e. reveal the relationship between “input”
and “output” even when the true physiological mechanisms remain unknown.

3 Research Methodology
To substantiate new approaches to the implementation of digital immortality (in a limited
format at the first stage), this paper uses the method of bringing the personality structure
scheme according to Jung to a form that meets the principle of dialectical symmetry [3,
4].
From this point of view, the collective unconscious certainly cannot be considered
as an element of the personality structure (at least in the full sense of the term).
Personality should be considered only as something relatively independent, in fact,
what is called a personality is the result of a complex interaction of an individual with
society, or rather, with the noosphere. It is in this respect intellect and consciousness
of a person are interpreted as entities that have a dual nature, in which there are both
collective and individual components, let’s emphasize this again.
In this case, the collective component is generated by transpersonal information
structures, more precisely, the collective component of intelligence is a certain projection
372 A. Bakirov et al.

of transpersonal information structures onto a relatively independent fragment of the


noosphere, localized within the brain of an individual.
We emphasize that this conclusion clearly correlates both with the provisions of
Soviet non-classical psychology [22, 23] (later developed to the level of post-nonclassical
[24]), and the Marxist thesis about the essence of man as a set of social relations.
Thus, we use a general methodological approach in order to demonstrate the ade-
quacy of posing the question of a partial solution to the problem of digital immortality
already at the present stage of research. The main tool used is applied philosophy, to
consider this problem from a general methodological standpoint, bringing together the
results obtained in the field of neural networks, applied philosophy, sociology, etc.

4 Results

Based on the principle of dialectical symmetry and the above methodological conclusion,
it should be recognized that along with the collective unconscious, understood in the
sense of Jung, there is also a collective conscious. It, in particular, is formed by scientific
theories, political doctrines and everything that a person can be taught and is learning
throughout his life. Consequently, the well-known scheme of Jung’s personality structure
(presented at Fig. 1 in simplified form) should be supplemented by making it symmetrical
(Fig. 2).
From this conclusion, in turn, it follows that a quite definite distinction should be
made between the intellect, consciousness and mind of a person. These are by no means
synonymous terms. A detailed consideration of this issue is beyond the scope of this
work, it is only important to note that the intellect is that component of the structure of
the personality (which is considered as a subsystem of the noosphere), which is most
closely in contact with the collective conscious.

Fig. 1. A simplified version of Jung’s personality structure diagram

Namely this part of the structure of the personality can already be “digitized”, i.e.
transferred to a non-biological storage medium in the foreseeable future.
To the Question of the Practical Implementation 373

Fig. 2. Bringing the personality structure scheme according to Jung to a form that meets the
principle of dialectical symmetry (simplified version).

We should emphasize once again, there are all the prerequisites associated mainly
with the development of the theory of artificial neural networks (ANN) and AI at the
modern step of investigations. In particular, as noted above, considerable attention is
currently being paid to research in the field of explainable neural networks [10, 11].
Note that ANNs are often opposed to systems with explicitly prescribed algorithms.
This is due to the fact that the algorithm of the ANN, which is formed in the process
of learning, most often remains uninterpretable, which is expressed by the thesis of the
logical opacity of the ANN.
Obviously, overcoming the thesis of the logical opacity of neural networks is the
basis for further progress in understanding the essence of intelligence, including human
intelligence. From a general methodological point of view, this seems almost obvious,
especially if we consider both trained neural networks and human intelligence as a “black
box”. Having revealed the algorithms for the functioning of neural networks in this vein,
one can come closer to understanding the functioning of more complex systems.
An important step in this direction was made in [25], where a “digital” analogue of
the convolution theorem was applied to the description of signals, the model of which
is built on the basis of functions that take values in Galois fields.
It is appropriate to emphasize, firstly, that this theorem is a direct analogue of the
convolution theorem, widely used in applications (in Fourier optics, for example [26]),
and, secondly, that algebraic structures of this kind are currently actively used in theory
of codes [27, 28]. The theorem considered in [25] exhaustively describes the functioning
of a certain type of ANN (convolutional neural networks [29, 30]), just as the apparatus
of transfer functions exhaustively describes the behavior of any linear systems that have
the property of invariance with respect to time shift.
Consequently, the thesis about the logical opacity of ANNs has been overcome for
at least one, and rather important, type of networks.
374 A. Bakirov et al.

This allows one to raise the question concerning at least partial preservation of the
human intellect on some non-biological carrier through the decoding of those biological
neural network operations that are associated with this particular component of the
personality.
Simplifying, intelligence is not only memory, it is primarily an algorithm (in the
broad sense of the term), a product of information self-organization processes. More
precisely, the intellect is a system of information processing, but this system is built on
certain rules, and it is they that need to be revealed.
Intelligence from this point of view can be considered as a “black box”, the real
structure of which (neurophysiological processes) remains unknown. However, its algo-
rithm can be reconstructed, having an array of data reflecting its response to external
influences. In particular, as applied to convolutional ANNs, such decoding can be carried
out already now, based on the digital analogue of the convolution theorem [19].
Recall that the equivalent electronic circuit of any linear system that processes time-
dependent signals and has the property of invariance (with respect to time shift) can
be established based on the analysis of its amplitude-frequency characteristic, using the
classical convolution theorem. In a completely similar way, the logic of a convolutional
neural network is decoded based on the digital convolution theorem, if there is a sufficient
data array, to which each set of values characterizing the state of the inputs of the ANN
is associated with a set of data characterizing the state of the outputs.
This example clearly shows that for any technologies aimed at even a partial transfer
of intelligence to a non-biological carrier (the first clearly visible step towards ensuring
digital immortality), there is obviously not enough data that fixes the user’s behavior
(say, an array of video information, etc.). In any case, at this stage of research it is not
obvious how such an array of data can be used for the purposes under consideration.
The technologies of digital reconstruction of intelligence must obviously record not
only the behavior of the user, but also the circumstances that cause this or that reaction,
these or those judgments.
The most convenient area in which such reactions can be tracked is obviously related
to the educational process (more broadly, professional activity).
Here already now it is possible to offer a whole range of possible technical solutions
that are quite realizable at the present level of development of programming.
One of them is related to fixing the reaction of users to digital educational resources
and/or special fiction (scientific, popular, etc.) literature, which makes known an external
stimulus (input signals).
In fact, we are talking about the fact that there are already prerequisites for recreating
a professional in a particular field (for example, a teacher), using the appropriate data
array, which is largely formed as a result of his professional activity.
There is no need to create a completely artificial intelligence system designed to
teach students if you can reconstruct the digital image of a real teacher (of course, you
need to choose the most talented and experienced ones) who will conduct the classes.
Of course, such an image can only partly be interpreted as digital immortality, however,
this step is already really visible. In addition, it is able to stimulate further research in
the field of understanding the essence of intelligence, i.e. in an area that de facto remains
the borderline between information technology and applied philosophy.
To the Question of the Practical Implementation 375

Of course, it should be emphasized that we considered convolutional neural networks


as nothing more than an illustration of the main result of this work, which, in essence,
was obtained by methods of applied philosophy.
Intelligence, as well as any other information processing system, can be considered
as a “black box”. This means that in order to reproduce such a system on another storage
medium, it is not at all necessary to copy the device of the “black box” itself in detail.
Just as a certain linear radio engineering device can be replaced by its equivalent circuit,
which is built on the basis of identifying connections between the “input” and “output”,
so the individual components of the personality can be reconstructed without trying to
understand exactly how the individual’s brain is physiologically arranged.
Moreover, applied philosophy suggests that, on the contrary, excessive detailing can
have a negative impact. It is important to isolate the essence of the laws of thinking
that generate intelligence as such, and the elemental base on which it is implemented is
secondary.
Further, consideration of the intellect (and even more so the consciousness of a
person) on the basis of the analogy with the “black box” may turn out to be overly
complicated (at least at the current stage of research). The task, however, is greatly
simplified by the fact that the personality structure is very complex. Accordingly, the
first step towards the practical implementation of the concept of digital immortality is
to transfer only the simplest components of personality to a non-biological medium.
This is the main message of this work - using the methods of applied philosophy,
it is possible to show that the problem of digital immortality should be solved step by
step, taking into account the complex structure of the personality. First of all, one should
focus on deciphering (in the above sense) its first layer, which becomes available for
analysis due to the development of telecommunication technologies.

5 Conclusion
Thus, on the basis of the principle of dialectical symmetry, put forward in the framework
of the philosophy of dialectical positivism, Jung’s scheme of personality structure should
be modernized by adding new component (collective conscious).
This component of the structure of personality most closely adjoins the intellect; it is
formed by all those concepts, theories, views, etc. which a person is able to consciously
explore throughout his life.
Presented report also shows that the question of digital immortality ceases to lie in
the plane of the unattainable.
Moreover, this problem can be solved sequentially; the basis for such approach is the
existence of a complex structure of personality as an element of the enclosing system -
the noosphere.
The first step corresponds to the deciphering of that component of the personal-
ity that most closely adjoins the collective conscious, and even the limited success of
technologies for this purpose will inevitably give impetus to further work in this direction.

References
1. Bostrom, N.: How long before superintelligence? Int. J. Futures Stud., 2 (1998)
376 A. Bakirov et al.

2. Karpenko, A.S.: Modern research in philosophical logic. Quest. Philos. 9, 54–75 (2003)
3. Suleimenov, I.E., Vitulyova, Y.S., Bakirov, A.S., Gabrielyan, O.A.: Artificial intelligence:
what is it?. In: ACM International Conference Proceeding Series, pp. 22–25 (2020). https://
doi.org/10.1145/3397125.3397141
4. Vitulyova, Y.S., Bakirov, A.S., Baipakbayeva, S.T., Suleimenov, I.E.: Interpretation of the
category of complex in terms of dialectical positivism. IOP Conf. Ser. Mater. Sci. Eng. 946(1),
012004 (2020). https://doi.org/10.1088/1757-899X/946/1/012004
5. Bakirov, A.S., Vitulyova, Y.S., Zotkin, A.A., Suleimenov, I.E.: Internet users’ behavior from
the standpoint of the neural network theory of society: prerequisites for the meta-education
concept formation. In: The International Archives of the Photogrammetry, Remote Sensing
and Spatial Information Sciences, vol. XLVI-4/W5–2021, pp. 83–90 (2021). https://doi.org/
10.5194/isprs-archives-XLVI-4-W5-2021-83-2021
6. Hunt, H.T.: A collective unconscious reconsidered: Jung’s archetypal imagination in the light
of contemporary psychology and social science. J. Anal. Psychol. 57(1), 76–98 (2012)
7. Mills, J.: Jung’s metaphysics. Int. J. Jungian Stud. 5(1), 19–43 (2013)
8. Odajnyk, V.W.: Jung and politics: the political and social ideas of CG Jung. iUniverse (2007)
9. Woodside, A.G., Megehee, C.M., Sood, S.: Conversations with (in) the collective unconscious
by consumers, brands, and relevant others. J. Bus. Res. 65(5), 594–602 (2012)
10. Assaf, R., Schumann, A.: Explainable deep neural networks for multivariate time series
predictions. In: IJCAI, pp. 6488–6490 (2019)
11. Angelov, P., Soares, E.: Towards explainable deep neural networks (xDNN). Neural Netw.
130, 185–194 (2020)
12. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, oppor-
tunities and challenges toward responsible AI. Inf. Fus. 58, 82–115 (2020)
13. Gunning, D., et al.: XAI—explainable artificial intelligence. Sci. Robot. 4(37) (2019)
14. Došilović, F.K., Brčić, M., Hlupić, N.: Explainable artificial intelligence: a survey. In: 2018
41st International Convention on Information and Communication Technology, Electronics
and Microelectronics (MIPRO), pp. 0210–0215. IEEE (2018)
15. Suleimenov, I.E., Bakirov, A.S., Matrassulova, D.K.: A technique for analyzing neural
networks in terms of ternary logic. J. Theor. Appl. Inf. Technol. 99(11), 2537–2553 (2021)
16. Vitulyova, Y.S., Bakirov, A.S., Shaltykova, D.B., Suleimenov, I.E.: Prerequisites for the anal-
ysis of the neural networks functioning in terms of projective geometry. IOP Conf. Ser. Mater.
Sci. Eng. 946(1), 012001 (2020)
17. Yang, Z., et al.: Understanding retweeting behaviors in social networks. In: Proceedings
of the 19th ACM International Conference on Information and Knowledge Management,
pp. 1633–1636 (2010)
18. Benevenuto, F., et al.: Characterizing user behavior in online social networks. In: Proceedings
of the 9th ACM SIGCOMM Conference on Internet Measurement, pp. 49–62 (2009)
19. Roblek, V., Meško, M., Bach, M.P., Thorpe, O., Šprajc, P.: The interaction between internet,
sustainable development, and emergence of society 5.0. Data 5(3), 80 (2020)
20. Rutter, J.: From the sociology of trust towards a sociology of ‘e-trust.’ Int. J. New Prod. Dev.
Innov. Manag. 2(4), 371–385 (2001)
21. Hossain, S.: The Internet as a tool for studying the collective unconscious. Jung J. 6(2),
103–109 (2012)
22. Luria, A.: Language and Consciousness. Publishing House Peter, 336 p. St. Petersburg (2020)
23. Kravtsova, E.E.: Non-classical psychology L.S. Vygotsky. Natl. Psychol. J. 1, 61–66 (2012)
24. Klochko, V.E.: The Problem of Consciousness in Psychology: A Post-non-Classical Per-
spective. Bulletin of the Moscow University. Series 14. Psychology, vol. 4, pp. 20–35
(2013)
To the Question of the Practical Implementation 377

25. Vitulyova, E.S., Matrassulova, D.K., Suleimenov, I.E.: Application of non-binary galois fields
Fourier transform for digital signal processing: to the digital convolution theorem. Indones.
J. Electr. Eng. Comput. Sci. 23(3), 1718–1726 (2021)
26. Goodman, J.W.: Introduction to Fourier Optics. Roberts and Company Publishers (2005)
27. Hla, N.N., Aung, D., Myat, T.: Implementation of finite field arithmetic operations for large
prime and binary fields using Java BigInteger class. Int. J. Eng. Res. Technol. (IJERT) 6(08)
(2017)
28. Shah, D., Shah, T.: Binary Galois field extensions dependent multimedia data security scheme.
Microprocess. Microsyst. 77, 103181 (2020)
29. Afridi, M.J., Ross, A., Shapiro, E.M.: On automated source selection for transfer learning in
convolutional neural networks. Pattern Recogn. 73, 65–75 (2018)
30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
Collaborative Forecasting Using
“Slider-Swarms” Improves Probabilistic
Accuracy

Colin Domnauer, Gregg Willcox, and Louis Rosenberg(B)

Unanimous AI, San Francisco, CA 94115, USA


[email protected]

Abstract. Artificial Swarm Intelligence (ASI) is a powerful method for amplify-


ing the collective intelligence of decentralized human teams and quickly improv-
ing their decision-making accuracy. Previous studies have shown that ASI tools,
such as the Swarm® software platform, can significantly amplify the collabora-
tive accuracy of decentralized groups across a wide range of tasks from forecast-
ing and prioritization to estimation and evaluation. In this paper, we introduce
a new ASI method for amplifying group intelligence called the “slider-swarm”
and show that networked human groups using this method were 11% more accu-
rate in generating collaborative forecasts as compared to traditional polling-based
Wisdom of Crowds (WoC) aggregation methods (p < 0.001). Finally, we show
that groups using slider-swarm on three real-world forecasting tasks, including
forecasting the winners of the 2022 Academy Awards, produce collective fore-
casts that are 11% more accurate than a WoC aggregation. These results suggest
slider-swarms amplify group forecasting accuracy across a range of real-world
forecasting applications.

Keywords: Swarm intelligence · Artificial swarm intelligence · Collective


intelligence · Wisdom of crowds · Hive minds · Collaboration

1 Introduction

In the field of Collective Intelligence (CI), it is well known that aggregating estimations,
evaluations, and forecasts from large groups can significantly outperform its individual
members. For well over a century, a wide variety of aggregation techniques have been
explored for harnessing the intelligence of human populations to enable more accurate
decisions [1–3]. Artificial Swarm Intelligence (ASI) is a recent real-time technique that’s
been shown to significantly amplify the decision-making accuracy of networked human
groups using intelligence algorithms modeled on biological swarms. Unlike votes, polls,
surveys, or prediction markets, which treat each participant as a separable datapoint for
statistical processing, the ASI process treats each individual as an active member of a
real-time dynamic system, enabling the full group to efficiently converge on solutions
as a unified intelligence [4, 5, 9].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 378–392, 2023.
https://doi.org/10.1007/978-3-031-18461-1_26
Collaborative Forecasting Using “Slider-Swarms” 379

For example, a recent study conducted at the Stanford University School of Medicine
showed that small groups of radiologists, when connected by real-time ASI algorithms,
could diagnose chest X-rays with 33% fewer errors than traditional methods of aggre-
gating human input [6, 7]. Researchers at Boeing and the U.S. Army recently showed
that small groups of military pilots, when using ASI technology, could more effec-
tively generate subjective insights about the design of cockpits than current methods [8].
Researchers at California Polytechnic published a study showing that networked busi-
ness teams increased their accuracy on a standard subjective judgment test by over 25%
when deliberating as real-time ASI swarms [9–11]. Also, researchers at Unanimous AI,
Oxford University, and MIT showed that small groups of financial traders, when fore-
casting the price of oil, gold, and stocks, increased their predictive accuracy by over
25% when using ASI method [12–14]. And researchers at Unanimous AI showed that
networked human teams collaboratively responding to standard IQ question tests could
increase their collective IQ score by 14 points when working together using the Swarm®
software platform vs. WoC voting [15].
Groupwise decision-making is an increasingly important area of research as more
and more teams work remotely. In addition, the rise of Decentralized Autonomous
Organizations (DAOs) increases the need for powerful and precise tools for amplified
groupwise decision-making. While the ability of swarm-based systems to amplify group
intelligence has been validated across many disciplines, current methods do not allowed
individuals to freely control their own input which is desirable for some probabilistic
forecasting tasks. In this paper, we introduce a new swarming method, called a “slider-
swarm” designed for probabilistic forecasting. We first explain the mechanics of this
new interface, and then examine the effectiveness of groups using the interface when
answering a set of partial-knowledge group probabilistic forecasts.

2 Slider-Swarm

The slider-swarm methodology allows a group of networked users to collaboratively


generate probabilistic forecasts as a real-time ASI system using standard web browsers.
It does this through a two-step process that includes (i) Personal Deliberation and (ii)
Groupwise Deliberation. During the personal deliberation phase, a forecasting prompt is
simultaneously shown to all participants on their individual computer screens along with
a graphical slider for entering a probabilistic forecast. Users are asked to enter forecasts
in isolation (i.e. without seeing any forecasting data from the other networked users.)
In practice this deliberation phase lasts 20 to 30 s and is coordinated by synchronized
countdown timers on all computer screens. Figure 1 below shows an example user
interface used during the Personal Deliberation phase in which each user views the
forecasting question and adjusts their own forecast on a probabilistic scale.
Personal Deliberation is immediately followed by the Groupwise Deliberation phase
during which each user is prompted to update their forecasts while being shown the real-
time forecasts of all other users in the form of a smoothed graphical histogram that
redraws continuously as participants react to each other’s changing forecasts. In this
way, an ASI swarming process is enabled in which all members of the population are
empowered to react to each other’s changing forecasts in real-time, creating a single
380 C. Domnauer et al.

Fig. 1. The slider-swarm interface during personal deliberation

dynamic system that converges on a final result. In practice, this Groupwise Delibera-
tion phase lasts between 20 and 40 s during which time participants continuously adjust
their forecast based on the behaviors of other participants in the real-time groupwise
process. After this window, each individual’s Final Answer is recorded, and an aggre-
gated groupwise forecast is generated algorithmically using the dynamic data collected
during the two-step process.
Figure 2 shows a slider-swarm during on a real-world probabilistic forecasting task
in which a group predicts which of two films is more likely to win an Oscar. In Fig. 2(a),
each user is presented with a forecasting prompt: “Which will win Best Documentary?”
and is asked to set their own individual forecast. Each user, working in isolation, must
move their probabilistic slider out of a highlighted deadband region in order for their
answer to be registered. Once they do, their slider turns green (Fig. 2(b)). After this initial
phase is complete, all users are simultaneously shown the distribution of responses from
the full population of participants (Fig. 2(c)). Each user is then asked to adjust their
forecast by considering the responses from other users. This is a simultaneous swarming
process in which all users can see the changing input from other users in real-time,
thereby creating system in which users are acting, reacting, and interacting as a unified
system. To ensure that all users provide degree of change, they are required again to
move out of a 2% deadband region around their initial answer.
The group is provided 25 s for the swarming phase, during which time the real-time
movements of each user influence the behaviors of other users, often creating cascades of
change within the system. Figure 2(d) shows the final responses from this group: notice
the net-leftward movement of the group and that the mean answer has changed from
an initial collective forecast of 55% probability of the film Summer of Soul winning
to a final collective forecast of a 64% probability of Summer of Soul winning. The
real-time swarming process thereby encouraged this group to collectively change the
final collaborative answer by 9%--a move that was ultimately in the correct direction,
as Summer of Soul indeed won this category in 2022.
Collaborative Forecasting Using “Slider-Swarms” 381

a) b)

c) d)
Fig. 2. A view of a user in a slider-swarm answering a question.

3 Experimental Design

To enable groups of human subjects to answer repeatable sets of forecasting questions


in which their relative knowledge varies across the population in a predictable way, a
novel experimental methodology was developed. In this method, users are presented
with a simulated bag of 19 marbles, each of which is either RED or BLUE. Each user
is asked to predict the likelihood that the majority of the marbles are one color or the
other. As there are an odd number of marbles in the bag, there will always be a definitive
outcome. To vary the useful knowledge possessed by participants across the population,
each user is shown the color of a small number of marbles within the bag. One user,
for example, might be shown two randomly selected marbles - revealing they are RED.
That provides the user with a small amount of insight, as the user has no idea what color
the other 17 marbles are. Another user might be shown a set of nine random marbles.
That user might discover that eight are BLUE and one is RED. This user would have
a strong insight that the majority of the marbles in the 19-marble bag is likely (but
not definitively) BLUE. Because each user is shown a different randomly selected set
of marbles, this method can assign varying levels of confidence across the forecasting
population, as would be expected in real-world forecasting situations where people have
different levels of insight into the question at hand.
For this study, the distribution of revealed marbles was structured as follows: for
each trial, each user was randomly assigned to either the minority Group A with 40%
382 C. Domnauer et al.

of participants or the majority Group B with 60% of participants. Participants were not
aware of the random assignment into groups, nor aware of the minority/majority structure
of the task at hand. Group A and Group B were each shown a different set of marbles,
with every member within the same group seeing the same thing. The marbles shown
to each group were structured so that one group always saw more blue marbles than
red marbles, and vice versa. This created a split in the population so that the majority B
supported one color while the minority A supported the other color.
Additionally, on each question, we controlled the confidence level of each group
by changing the composition of marbles shown (the difference in the number of red
and blue marbles), thereby creating one confident group and one less confident, flexible
group. The “correct” answer was always the answer favored by the confident group.
The majority was designed to be more confident on 5/15 (33%) of the questions in this
experiment, while the minority was designed to be more confident on 10/15 (67%) of the
questions in this experiment. For analysis purposes, the question type was further broken
down by the confidence level assigned to each group: a group was initialized with High
confidence if they saw a three-four marble color difference between Red and Blue (e.g.
six red, two blue), a medium confidence if they saw a two-marble color difference, and
a low confidence if they saw a one-marble color difference. No group was ever shown
a set of marbles with equal numbers of red and blue. Therefore, both question types
(Majority correct and Minority correct) consisted of three subcategories based on the
confidence levels assigned to group A and B: High vs. Medium confidence, High vs
Low confidence, and Medium vs. Low confidence.
Figure 3 shows two examples of the online interface we used for this study, each
from a different participant’s screen. The leftmost user who was part of group A saw six
marbles–five red and one blue–and therefore favored “red” to be the correct answer. The
rightmost user who was part of group B saw seven marbles–three red and four blue–and
therefore slightly favored “blue” to be the correct answer.

Fig. 3. Two participants see different random draws from the same bag of marbles.

Using this methodology, three groups of between 30 and 36 Mechanical Turk users
were convened to answer a set of 15 probabilistic forecasting questions of this type, each
featuring a different simulated “bag” of marbles.
Collaborative Forecasting Using “Slider-Swarms” 383

Participants were given the following instructions before the experiment began: (1)
Each question focuses on one bag of marbles hidden from view. You know three things
about each bag: (i) Each marble in this bag will be either RED or BLUE. (ii) There are
always 19 marbles in the bag. (iii) The fraction of RED and BLUE marbles in the bag
is randomly selected before each question. (2) While we’re all using the same bag of 19
marbles on each individual question, everyone will see a random selection of marbles
from this bag. Some people may see more marbles than others, but no one will see more
than half of the marbles. (3) We will swarm as a group to forecast whether the bag
contains more RED or BLUE marbles.
On each question, after privately observing their marbles and entering their initial
responses based on the marbles they saw, the group then worked together using the
slider-swarm interface to create a collective probabilistic forecast: “Are the majority of
the marbles in this bag RED or BLUE?”
To motivate participants to give reasonable forecasts and to pay attention to one
another’s forecasts, a $2 bonus was given out to each group that answered more than
80% of questions correctly.

4 Results

Results were computed for each of the three methods of probabilistic forecasting: as
(i) individuals, (ii) WoC, and (iii) slider-swarm. Individual answers were collected as
the final responses registered during the Personal Deliberation phase. As previously dis-
cussed, in this section each participant gave an answer on the slider without seeing other
user answers, essentially acting as a blind survey. Next, representing traditional group
aggregation methods, the WoC answers were computed as the mean of all individuals’
initial answers. Finally, the slider-swarm answers were computed as the mean of all
individual final answers at the end of the Group Deliberation phase.
Over the course of the slider-swarm, individuals reduced their mean brier scores from
0.238 during the Personal Deliberation phase to 0.214 during the Group Deliberation
phase, or 10%. Moreover, the group forecast had lower errors when working together in
a slider-swarm compared to using traditional WoC aggregation, significantly reducing
the mean brier score from 0.212 (WoC) to 0.189 (slider-swarm), or 11%.
To compute statistical significance in brier score differences across the three cate-
gories (individual, WoC, slider-swarm), a bootstrapping analysis was performed to gen-
erate a confidence interval for the mean brier score of each category, where the observed
brier scores for each forecasting method were resampled with replacement 1000 times
(Fig. 4). As outlined in Table 1, across the full question set, the slider-swarm method
achieved a significantly lower error as compared to both individuals (p < 0.001) and
WoC (p < 0.001).
To better understand how the composition of the group impacts the performance of the
slider-swarm, we next analyzed the confident-majority and confident-minority question
types separately. As described in Table 2 and illustrated in Fig. 5, on questions with a
more confident majority, the slider-swarm resulted in 22% lower brier scores than the
WoC (p < 0.001). On questions with a confident minority, the slider-swarm moderately
outperformed traditional WoC aggregation yielding a 3.7% reduction in error, but this
384 C. Domnauer et al.

Fig. 4. Results of bootstrap analysis for mean brier score across three categories: individual,
WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error compared to both
individuals and WoC aggregation.

Table 1. Slider-swarm offers significant reduction in error compared to both individual answers
and traditional WoC aggregation.

Forecast method Brier score Error reduction vs slider-swarm (p-value)


Individual 0.238 −20.8% (3.1 × 10–11 )
WoC 0.212 −10.9% (4.3 × 10–5 )
Slider-swarm 0.189 n/a

result was not statistically significant (p = 0.12). Therefore, we can conclude that slider-
swarms produce better probabilistic forecasts than WoC aggregation on questions where
most people are confident and correct; this result may hold for questions where a minority
of people are confident and correct but the effect size is likely smaller and thus a larger
number of trials is needed to confirm the effect.
Exploring this issue further, we analyzed the unique subset of questions in which the
majority group was shown a set of marbles with only a one-marble color differential,
thereby inspiring low confidence, while the minority group had either Medium or High
confidence. For these questions, the slider swarm produced significantly lower brier
scores than a WoC survey (slider-swarm brier = 0.207, WoC = 0.223, p = 0.025).
In other words, when the majority possesses relatively low confidence in their answer
compared to the minority, the slider-swarm method enables the minority to influence
the population, causing the majority to switch to the minority position. Thus, the slider-
swarm system allows a correctly confident minority to more successfully influence the
group to converge upon the correct answer (rather than the most popular answer), even
when a large majority holds the opposing (incorrect) belief. This is an important result.
Collaborative Forecasting Using “Slider-Swarms” 385

Table 2. Performance of WoC compared to slider-swarms by question type

Question type Frequency Slider-swarm Brier change: Brier p-value:


accuracy slider-swarm vs. slider-swarm vs.
WoC WoC
1: Majority > 18 100% (18/18) −0.042 (−22%) 3.0 × 10–11
Minority Confidence
2: Majority < 27 59% (16/27) −0.009 (−3.7%) 0.12
Minority Confidence

Fig. 5. Slider-swarm reduced the individual and WoC error both in cases of a confident majority
(left) and in cases of a confident minority (right)

Why is it that a slider-swarm can aggregate confidence better than the WoC on
these types of questions? The key difference is that the slider-swarm doesn’t merely ask
participants to report their confidence, but requires each participant to behave in real-
time while being exposed to the beliefs and behaviors of other members of the group. In
doing so, slider-swarm is a dynamic system that empowers the group to converge upon
the answer they are collectively the most confident in.
To examine how user behaviors allow slider-swarms to converge upon the answers
the group is collectively the most confident in, we examined the “flipping behavior”
of individuals: e.g. when they choose to switch from one side of 50% to the other. In
those questions with a confident minority, the slider-swarm permitted a large number
of individuals who were initially incorrect to switch on to the correct side of 50. In
fact, if a simple majority vote was taken at the end of the Personal Deliberation phase
(i.e., a traditional WoC survey) on these questions, the group would have answered
only 1/27 (3.7%) of questions correctly. However, if this vote was taken at the end of
the Group Deliberation phase, the group would have answered 15/27 (55.6%) questions
correctly, which translates to a 1500% increase in the voting accuracy enabled by the use
of slider-swarm. Finally, using the mean answer from the end of the Group Deliberation
phase, the slider-swarm probabilities were on the correct side of 50% on 17 out of 27
questions (62.9%). In other words, not only do individuals themselves become more
386 C. Domnauer et al.

accurate through real-time dynamic Group Deliberation using slider-swarms, but also
the collective intelligence becomes even more accurate.
How do individuals collectively become more accurate during the Group Delibera-
tion phase? As illustrated in Fig. 6, individuals belonging to the more confident group
on a given question move their slider significantly less, averaging 1.4% less motion (p
= 7E−4), as compared to individuals belonging to the less confident group. This result
gives an indication of how the slider-swarm works in practice: individuals who are less
confident in their initial answers are more likely to concede their position, driving the
dynamic system towards collective answers that are biased towards the more confident
sub-populations in the group.

Fig. 6. Individuals who are assigned into the more confident sub-population move significantly
less than individuals of the less confident sub-population.

5 Real World Forecasts


To further validate the effectiveness of slider-swarms beyond the controlled scenario of
the “Marbles Test,” additional experiments were conducted to determine the ability of
slider-swarms to forecast real-world scenarios as follows:
Experiment #1 was a subjective judgement task called the “Smile Test” [16]. In
this assessment, participants were shown a short (~3 s) video clip of a human smiling
and were subsequently required to determine if the smile was genuine (“REAL”) or
not (“FAKE”). Experiment #2 was a similar task of subtle facial discernment, showing
participants still images of human faces, asking them to decide if the face in question was
unaltered (“REAL”) or photoshopped (“FAKE”) [17]. Photoshopped images consisted of
three levels of difficultly - easy (obvious change), medium, or hard (very subtle change).
Lastly, Experiment #3 consisted of a group of average movie fans (not experts) tasked
with forecasting the results of 15 categories in the 2022 Academy Awards approximately
one week ahead of the event.
Collaborative Forecasting Using “Slider-Swarms” 387

Again, results were computed for each of the three methods of probabilistic forecast-
ing: as (i) individuals, (ii) WoC, and (iii) slider-swarm. In each of these three real-world
experiments, the slider-swarm achieved a significant reduction in error compared to the
average individual as well as the WoC aggregation.
In the Smile Test, over the course of the slider-swarm, individuals reduced their mean
brier scores from 0.237 during the Personal Deliberation phase to 0.209 during the Group
Deliberation phase, achieving a 12% error reduction using the slider-swarm interface.
Moreover, the group forecasts were 9.6% more accurate (p = 2.5 × 10–3 ) when the
group worked together as a slider-swarm, down to a brier score of 0.191 (slider-swarm)
from 0.211 (WoC). These results are shown in Table 3, and a bootstrap analysis is shown
in Fig. 7.

Table 3. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in subjective judgment “Smile Test.”

Smile test
Forecast method Brier score Error vs slider-swarm (p-value)
Individual 0.237 19.7% greater error (2.5 × 10–8 )
WoC 0.211 9.65% greater error (2.5 × 10–3 )
Slider-swarm 0.191 n/a

Fig. 7. For smile test – results of bootstrap analysis for mean brier score across three cate-
gories: individual, WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error
compared to both individuals and WoC aggregation.

In the Faces Test, over the course of the slider-swarm, individuals reduced their
mean brier scores from 0.218 during the Personal Deliberation phase to 0.188 during the
388 C. Domnauer et al.

Group Deliberation phase, a 14% reduction in forecast error. Moreover, when working
together in a slider-swarm, the group forecast error reduced by 13.6% (p = 3.8 × 10–4 ),
from 0.191 (WoC) to 0.165 (slider-swarm). A bootstrap analysis is shown in Fig. 8 and
these results are tabulated in Table 4.

Table 4. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in photoshop recognition test

Real/fake faces test


Forecast method Brier score Error vs slider-swarm (p-value)
Individual 0.218 23.9% greater error (4.1 × 10–9 )
WoC 0.191 13.6% greater error (3.8 × 10–4 )
Slider-swarm 0.165 n/a

Fig. 8. Results of bootstrap analysis for mean brier score across three categories: individual,
WoC, and slider-swarm. Slider-swarm achieved a significant reduction in error compared to both
individuals and WoC aggregation.

Finally, in the 2022 Academy Awards Test, individuals again reduced their mean
brier scores from 0.211 during the Personal Deliberation phase to 0.184 during the
Group Deliberation phase, a 13% error reduction. Moreover, the group forecast brier
score was reduced by 11.1% (p = 0.038) using slider-swarm, from 0.171 (WoC) to 0.152
(slider-swarm). A bootstrap analysis is shown in Fig. 9 and these results are tabulated in
Table 5.
In total, all the real-world experiments revealed slider-swarms to be the most accurate
method of group forecasting, significantly reducing the group’s brier scores across mul-
tiple datasets. As outlined in Table 6 and depicted in Fig. 10, we find that slider-swarms
Collaborative Forecasting Using “Slider-Swarms” 389

Table 5. Slider-swarm offers reduction in error compared to both individual answers and
traditional WoC aggregation in predicting results of the 2022 academy awards.

2022 academy awards


Forecast method Brier score Error vs slider-swarm (p-value)
Individual 0.211 27.0% greater error (4.6 × 10–5 )
WoC 0.171 11.1% greater error (0.038)
Slider-swarm 0.152 n/a

Fig. 9. Results of bootstrap analysis for mean brier score across three categories: individual, WoC,
and slider-swarm. Slider-swarm achieved a reduction in error compared to both individuals and
WoC aggregation.

produce significant reductions in brier score not only in each experiment in isolation,
but also when all data is combined across multiple real-world scenarios.

Table 6. Slider-swarms achieve lowest error rate across real-world forecasting tests

Brier scores in real-world tests


Individual WoC Slider-swarm Error reduction: slider-swarm
vs. WoC (p-value)
Smile test 0.237 0.211 0.191 9.7% (2.5 × 10–3 )
Faces test 0.218 0.191 0.165 14% (3.8 × 10–4 )
2022academy awards 0.211 0.171 0.152 11% (0.038)
Aggregate 0.225 0.197 0.175 11% (1.9 × 10–6 )
390 C. Domnauer et al.

Fig. 10. Results of bootstrap analysis for mean brier score across all real-world experiments in
three categories: individual, WoC, and slider-swarm. With all datasets combined, slider-swarm
achieved a significant reduction in error compared to both individuals and WoC aggregation.
Collaborative Forecasting Using “Slider-Swarms” 391

6 Conclusion
In this paper we introduced a novel ASI method called slider-swarms for collaborative
probabilistic forecasting in networked groups and showed that this method improves
group accuracy on a challenging limited-information forecasting task by over 10% as
compared to a traditional Wisdom of the Crowd aggregation (p < 0.001). We further
showed this improvement was not limited to questions where the majority was more
confident than the minority, but that slider-swarms also produce better probabilistic
forecasts on questions where a 40% minority is more confident than a 60% majority–a
much harder domain, although this result was not statistically significant.
To explain how the slider-swarm allows groups to improve their collective accuracy,
we showed that the more confident individuals in sliders swarms tend to concede their
position less frequently than lower confidence individuals. This means that individuals
in this system are not just reporting their confidence level at the outset, but instead
are actively adjusting their beliefs in real-time response to the displayed beliefs and
behaviors of other individuals in the group, enabling the swarming system to produce
collective answers that the group can better agree upon.
We then examined the performance of slider-swarms on three real-world forecasting
tasks: the Smile Test, the Real/Fake Faces test, and forecasting the winners of the 2022
Academy Awards. We observed significant decreases in group errors as indicated by
superior brier scores on each of these forecasts when the groups used slider-swarms
as compared to a standard Wisdom of the Crowd aggregation. The results showed
between 9.7% and 14% superior accuracy on each task when using slider swarms—and
a statistically significant 11% increase in accuracy overall.
These initial results suggest that the slider-swarm method is a viable and effective
tool for amplifying groupwise forecasting accuracy across a range of conditions, and that
slider-swarm can be successfully applied to real-world forecasting tasks to significantly
reduce group forecasting errors.

References
1. De Condorcet, N.: Essai sur l’application de l’analyse à la probabilité des décisions rendues
à la pluralité des voix. Cambridge University Press, Cambridge (2014)
2. Boland, P.J.: Majority systems and the Condorcet Jury theorem. Statistician 38, 181 (1989).
https://doi.org/10.2307/2348873
3. Larrick, R.P., Soll, J.B.: Intuitions about combining opinions: misappreciation of the averaging
principle. Manag. Sci. 52(1), 111–127 (2006)
4. Rosenberg, L.: Artificial Swarm Intelligence, a human-in-the-loop approach to A.I. In: Pro-
ceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix,
Arizona, pp. 4381–4382. AAAI Press (2016)
5. Rosenberg, L.: Human Swarms, a real-time method for collective intelligence. In: Proceedings
of the European Conference on Artificial Life 2015, ECAL 2015, pp. 658–659. MIT Press,
York (2015). ISBN 978-0-262-33027-5
6. Halabi, S., et al.: Radiology SWARM: novel crowdsourcing tool for CheXNet algorithm
validation. In: SiiM Conference on Machine Intelligence in Medical Imaging (2018)
392 C. Domnauer et al.

7. Rosenberg, L, Willcox, G., Halabi, S., Lungren, M, Baltaxe, D., Lyons, M.: Artificial swarm
intelligence employed to amplify diagnostic accuracy in radiology. In: 2018 IEEE 9th Annual
Information Technology, Electronics and Mobile Communication Conference (IEMCON),
Vancouver, BC (2018)
8. Befort, K., Baltaxe, D., Proffitt, C., Durbin, D.: Artificial swarm intelligence technology
enables better subjective rating judgment in pilots compared to traditional data collection
methods. Proc. Hum. Factors Ergon. Soc. Ann. Meet. 62(1), 2033–2036 (2018)
9. Askay, D., Metcalf, L., Rosenberg, L., Willcox, D.: Enhancing group social perceptive-
ness through a swarm-based decision-making platform. In: Proceedings of 52nd Hawaii
International Conference on System Sciences (HICSS-52). IEEE (2019)
10. Rosenberg, L., Willcox, G.: Artificial swarm intelligence. In: Bi, Y., Bhatia, R., Kapoor, S.
(eds.) IntelliSys 2019. AISC, vol. 1037, pp. 1054–1070. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-29516-5_79
11. Metcalf, L., Askay, D.A., Rosenberg, L.B.: Keeping humans in the loop: pooling knowledge
through artificial swarm intelligence to improve business decision making. Calif. Manag. Rev.
61(4), 84–109 (2019)
12. Rosenberg, L., Pescetelli, N., Willcox, G.: Artificial Swarm Intelligence amplifies accuracy
when predicting financial markets. In: 2017 IEEE 8th Annual Ubiquitous Computing, Elec-
tronics and Mobile Communication Conference (UEMCON), New York City, NY, pp. 58–62
(2017)
13. Willcox, G., Rosenberg, L., Schumann, H.: Group sales forecasting, polls vs. swarms. In:
Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2019. AISC, vol. 1069, pp. 46–55. Springer,
Cham (2020). https://doi.org/10.1007/978-3-030-32520-6_5
14. Schumann, H., Willcox, G., Rosenberg, L., Pescetelli, N.: “Human Swarming” amplifies
accuracy and ROI when forecasting financial markets. In: 2019 IEEE International Conference
on Humanized Computing and Communication (HCC), Laguna Hills, CA, USA, pp. 77–82
(2019). https://doi.org/10.1109/HCC46620.2019.00019
15. Willcox, G., Rosenberg, L.: Short paper: swarm intelligence amplifies the IQ of collaborating
teams. In: 2019 Second International Conference on Artificial Intelligence for Industries
(AI4I), pp. 111–114 (2019). https://doi.org/10.1109/AI4I46381.2019.00036
16. Bernstein, M.J., Young, S.G., Brown, C.M., Sacco, D.F., Claypool, H.M.: Adaptive responses
to social exclusion: social rejection improves detection of real and fake smiles. Psychol. Sci.
19(10), 981–983 (2008). https://doi.org/10.1111/j.1467-9280.2008.02187.x
17. CIPLAB, Yonsei University: Real and Fake Face Detection (2019). https://www.kaggle.com/
datasets/ciplab/real-and-fake-face-detection. Accessed 26 Apr 2022
Learning to Solve Sequential Planning
Problems Without Rewards

Chris Robinson(B)

6000 Noah, Louisville, KY 40258, USA


[email protected]

Abstract. In this paper we present an algorithm, the Goal Agnostic


Planner (GAP), which combines elements of Reinforcement Learning
(RL) and Markov Decision Processes (MDPs) into an elegant, effective
system for learning to solve sequential problems. The GAP algorithm
does not require the design of either an explicit world model or a reward
function to drive policy determination, and is capable of operating on
both MDP and RL domain problems. The construction of the GAP lends
itself to several analytic guarantees such as policy optimality, exponential
goal achievement rates, reciprocal learning rates, measurable robustness
to error, and explicit convergence conditions for abstracted states. Empir-
ical results confirm these predictions, demonstrate effectiveness over a
wide range of domains, and show that the GAP algorithm performance
is an order of magnitude faster than standard reinforcement learning
and produces plans of equal quality to MDPs, without requiring design
of reward functions.

Keywords: Sequential planning · Unsupervised learning · Agents

1 Introduction
This paper presents an algorithm created specifically to solving arbitrary plan-
ning problems without requiring a reward function or pre-defined transition
model. The impetus for such agents is based on two principles: (1) the idea
that crafting reward functions introduces a risk of bias; and (2) objective based
learning models reduce the potential for knowledge re-use.
The Goal Agnostic Planning (GAP) algorithm applies these principles by
combining an MDP-like planner with an RL-based learning mechanism, inte-
grated with a composite datastructure combining a hypergraph, pointer arrays,
and linked lists. This datastructure is populated and updated throughout learn-
ing so that Dijkstra’s algorithm may be used to find an optimal maximum proba-
bility path between an observed current state and any reachable goal state which
exists. GAP agents therefore require no modification to re-use already learned
domain knowledge when presented with an alternate goal state. They do not
require manual construction of a transition graph or reward function.
With thanks to Joshua and Ellen Lancaster.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 393–413, 2023.
https://doi.org/10.1007/978-3-031-18461-1_27
394 C. Robinson

In addition to achieving the objective of producing learning agents with the


previously described properties, we are also able to analytically prove several
valuable features of the algorithm, including exponentially bounded rates of
goal convergence; learning rates proportional to the reciprocal of epoch number;
polynomial computational complexity in both space and time; and explicit con-
ditions and performance impacts for agents learning under state abstractions or
uncertainty.
To demonstrate the effectiveness of GAP agents in practice, we explore three
example domains to demonstrating the accuracy of the analytic predictions. On
these problem domains, we evaluate the performance of the GAP agent against
performance baseline set by an MDP planner, and a learning baseline set by
Q-Learning.

1.1 Related Work and Literature Review


Planning for sequential problems is a very well-studied topic, and naturally
invites the question as to why another approach is necessary. In this section, we
discuss well-established methods for solving sequential problems, and highlight
limits and restrictions of these methods addressed by the GAP algorithm. Our
objective is to demonstrate that extant methods have design–related limitations
that are intimately related to world design and reward function selection.
RL agents operate in a state/action framework, learning a quality function
for maximizing prospective rewards. A common through–line for reinforcement–
based systems is the necessity of reward function design for convergence. Perfor-
mance of an RL agent is predicated on the quality of this function, a relationship
explored in detail by [14]. An additional limitation expressed both in [9] and [5]
is goal orientation. Reward functions are constructed in relation to a specific
objective, so training applies only to that goal. We remove both these limits
entirely by separating learning from both rewards and specific goals.
While considered more flexible than most machine learning systems, MDPs
are reliant on careful modeling of the system in question. [4] discusses the con-
struction of action sets for MDP formulation as a design methodology, illustrat-
ing the presence of implicit optimality conditions. The author in [17] investi-
gates the use of reinforcement learning to supplant reward functions, showing
that reward design effects the success of the planner. The author in [16] dis-
cusses problems associated with identifying probabilities for goal achievement
and reaching dead ends in the MaxProb problem. We are able to explicitly derive
these probabilities for the GAP agent. A special case of Markov Decision Pro-
cesses is the Stochastic Shortest Path (SSP) problem, most notably investigated
in [1]. SSP problems seek to identify an optimal policy for stochastically varying
costs, similar to MaxProb. The author in [6] discusses outstanding issues with
policy definition related to the implementation of SSP policy determination. We
find a polynomial-time, globally optimal solution in this special case of the SSP.
Efforts towards integrating learning and planning domains to improve per-
formance have seen success as well. [2] presents Graphplan, which operates on a
task graph, extended to probabilistic planning in [3]. However, they acknowledge
GAP Algorithm 395

limitations of overlapping action results– an issue the GAP algorithm resolves.


[8] presents a probability-based belief model in their Abstraction Augmentation-
a notion we extend and adapt to state abstraction as a transform. [12] combines
reinforcement learning with search-based planning, implementing their DAR-
LING algorithm. However, they still implement reward-based training and focus
on semantic planning, losing the computational benefits of graph-based systems.
We advance these concepts by unifying their properties and reducing detriments
by eliminating design-dependence.
Use of model simplification and abstraction to reduce state space size and
improve planning have also been investigated. The author in [7] evaluates state
abstractions as applied to tree search, similar to our state-mixing interpretation
of abstractions but lacking the analytic power of our model. The author in [15]
evaluate model reductions for automated planning. They note a goal-state map-
ping condition analogous to our convergence condition for abstracted domains,
and use connected component analysis to identify the presence of dead ends- a
method we simplify through our trap net analysis. In [11], the authors develop
a system to learn abstractions in a probabilistic planning domain. Their agents
are designed for symbolic planning rather than graphical planning; however, and
are constructed in the standard reward–based framework for MDPs, lacking the
design–agnostic elements our design implements.
While success has been seen with these methods, we can see that there are
still inherent limits imposed by world construction and design of reward func-
tions. We address these problems in tandem by modeling the planning task as a
lower order combinatorial problem operating on a 3–dimensional datastructure,
confining the space complexity to O(n3 ), and the time complexity to O(n2 ) using
Dijkstra’s Algorithm. Our hypergraph data structure allows for planning of any
task within a domain and learns a representative model by observation without
human design influence. It combines aspects of prior work approaching these
outstanding problems in a coherent single system: removing the reward function
and implementing non–search planning using the maximally probable path as
the action policy.
The rest of the paper is organized as follows: In the next section we describe
the datastructures and algorithms which comprise the GAP system. In Sect. 3,
we present a thorough analysis of the optimality, efficacy, efficiency, learning, and
robustness to perturbations of GAP agents. Section 4 presents Empirical results
on three different problem domains, including complex hierarchical spaces, and
performance comparisons to Q-Learning and Markov Decision Process planning
to establish performance baselines. Finally, Sect. 5 concludes the paper, summa-
rizing results, and discussing limitations and future goals.

2 GAP Algorithm
In this section, we discuss the construction and operation of the GAP algorithm,
including the composite augmented hypergraph datastructure, use of Dijkstra’s
algorithm in the context of this algorithm, and the learning mechanism employed
in training.
396 C. Robinson

Fig. 1. Array/Linked List Showing the Indexed Cell Locations within the Array, Con-
taining Pointers to the Corresponding Elements in the Sorted Linked List, which itself
Contains the Data Component Associated with each Array Cell, and is Organized into
Columns Containing the Same Number of Observed Instances.

Definitions: A GAP ‘agent’ is the portion of the system capable of making


decisions and effecting the world. It is defined by the capability to register a set
of perceptual states (denoted S) and take a set of actions (A), which can impact
the world and possibly alter the state. At any given point in time k, the agent
can observe an initial state, si ∈ S, and subsequently take an action al ∈ A,
resulting in a state change to a final state sf (note that sf may be identical to
si ). Such a series is henceforth referred to as an occasion: ok = al (si ) → sf , as
distinguished from a more traditional state/action pair.
We implement a learning system using these, recorded within a 3-dimensional
structure of size |S|×|S|×|A|, INC, in which the location IN C[i, j, l] contains the
number of times occasion al (si ) → sj has been observed. Each action may have
multiple results, hence the construction of a hypergraph. From sums along slices
within this array the relative probability of differing occasions can be computed.
This structural change allows us to contend with the challenge of overlapping
action results identified in [3].
An ordered series of occasions we term a sequence. Solutions produced by
the planning algorithm are sequences, and are represented as two ordered lists
σog = [{s1 , s2 ...sg }, {a1 , a2 , ...ag−1 }], where a1 (s1 ) → s2 , a2 (s2 ) → s3 , and so on,
and we define the joint probability for the whole sequence:

P (σ) = P (al (si ) → sj ) (1)
∀ok ∈σ

However, as si is fixed but al and sf are not, this presents two possibilities for
probability models, one referenced against resultant states and one referenced
against actions taken. In the first model, we elect to choose actions based on
the most probable outcome of taking an action from a given state, the latter the
most likely action to cause a transition:

IN C[si , sj , al ] IN C[si , sj , al ]
P (al (si ) → sj ) =  ; or  (2)
∀s IN C[si , s, al ] ∀a IN C[si , sj , a]
GAP Algorithm 397

In the qualitative sense, the a priori probability model selects actions most
likely to result in goal achievement, and the a posteriori policies selects state
changes most likely to reach the goal. This model is simple, but allows an elegant
learning system to be designed around it.
It is possible to define subgraphs (denoted AF I) embedded within the hyper-
graph which contain all edges of the optimal solution. One such subgraph con-
tains transitions (si , sj ), stored as an |S| × |S| × 2 array. In this array, the
component < si , sj , 0 > is the maximum probability associated with the si → sj
transition, and component < si , sj , 1 > is the index of the corresponding action:

IN C[i, j, argmax{P (al (si ) → sj )}]


AF I[si , sj , 0] =  l
∀s IN C[si , s, al ]

AF I[si , sj , 1] = argmax{P (al (si ) → sj )}


l
The second contains maximally likely final states with respect to actions
taken. This graph can be represented on an |S| × |A| × 2 sized array, in which
members at < si , al , 0 > represent the probability associated with the most likely
result of taking action al from state si , and < si , al , 1 > represents the index
of sf . Each of these represents a traditional graph, which we use for efficient
computation of solution sequences.
We represent these useful graphs with a datastructure that combines an array
with a parallel linked list, as illustrated in Fig. 1. Each element in the array is
a pointer to a member of the linked list. In his way, the linked list need not be
searched for member elements, and ordering of the list can be maintained using
single operations on the linked list members. We combine the array/linked lists
representing the maximal likelihood subgraphs with the full INC hypergraph to
form our composite datastructure for learning, in which each cell of the IN C
array points to the linked list objects in the AF I objects corresponding to that
occasion, making for efficient addressing and updating of the learned relation-
ships, shown in Fig. 2.
Because the linked list members are sorted by increment counts, each link
may only move ahead one link in the list at any time. Using the convenient
sorting and addressing features of the array/linked list, each update to the opti-
mal subgraph can be incorporated using O(1) steps, as detailed in Algorithm 1.
The maximum likelihood subgraph is thus perpetually embedded within AFI.
Using this embedded graph, we can infer a maximally likely sequence between
any state si , and any goal state, sg using Dijkstra’s algorithm to find maximum
probability subtrees rooted at si . In Algorithm 2, we formalize this algorithm.
Due to the structure of the augmented hypergraph, the computational efficiency
of this method is O(|S|2 ) for the a priori probability model and O(|S| · |A|) for
the a posteriori model.
It is worth noting at this point that due to the graph–based nature of the
planning and the learning mechanism, conventional prepositional aspects of plan-
ning algorithms such as add/delete lists and preconditions are redundant- GAP
agents learn these relationships autonomously.
398 C. Robinson

Fig. 2. Augmented Hypergraph Data Structure: A 3 Dimensional Array, Each Cell of


which Contains Pointers to Members of Two Array/Linked List Objects, Each Con-
taining a Pointer to the Corresponding Sorted List Associated with that State/Action
or State/State Pair, Allowing for Immediate Retrieval.

3 Analytical Evaluation
Phrased in terms of Markov Decision Processes, the GAP algorithm produces
a policy π(si ) such that the action taken at any step is the first action in the
maximal probability sequence between si and sg , or:
⎛ ⎞
 


π(si ) = argmax P (oj ) 
⎠ (3)
σig
∀oj ∈σig 
k=0
the first action in the most–probable sequence σig from state i to state g. We
can show that this policy is globally optimal, using the known optimality of
Dijkstra’s algorithm and the properties of the AF I array/linked lists.
Theorem 1. The policy illustrated by Eq. 3 produces the maximum likelighood
sequence for achieving a given goal state sg .
Proof. We proceed by contradiction. Presume that there exists an optimal solu-
tion sequence σog which contains an occasion not allocated AFI. By either of
Eq. 2, AFI must be sorted in descending order. Because the probabilities are
in [0, 1], Eq. 1 is monotonically decreasing. The first node in the sequence will
have the maximum probability edge of all leading from si to si+1 , and thus
any alternate path to this node is bounded by that single probability. Because

probabilities are monotonically decreasing, the sequence σog will have probabil-
ity greater than or equal to the assumed solution, and thus either σog is not
optimal, or both paths are.
To analyze the behavior of GAP agents, we can use Markov process analysis
techniques, with Eq. 3 as the chosen policy. We build the tree of maximal prob-
ability paths rooted in an arbitrary goal state, derived from AF I, and it TP (g) ,
illustrated in Fig. 3.
GAP Algorithm 399

Algorithm 1. Linked List Subgraph Maintenence


function MaintainLL(IN C, (si , sf , al ))
occasionLink = IN C[si , sf , al , 0]

if occasionLink.prev == N one then


return 1
if occasionLink.prev.count
> occasionLink.count then
return 1
if occasionLink.prev.count
== occasionLink.count then
occasionLink.prev = occasionLink.prev.prev
occasionLink.post = occasionLink.current
occasionLink.current =
occasionLink.prev.current

Algorithm 2. Sequence Inference Algorithm


function SequenceInfer(AF I, (si , sg ))
bound ← si .edges
perm ← [(si , 1.0)]
edges ← []
while sg ∈/ perm do
jointP rob(sj ) := perm[bound[j]][1] · bound[j].P
smaxP ← argmax(jointP rob)
sj
perm ← (smaxP , jointP rob(smaxP ))
bound = (bound ∪ smaxP .edges) − [e|e(1) = smaxP ]
edges ← bound[sj ]
solution = [edges[sg ]]
while solution[−1][0] = si do
solution ← edges[solution[−1][0]]
return solution

Because each action is assumed to be non-deterministic, it will


include probabilities for arriving at non–intended states as well: al (si ) :
{(sj1 |Pj1 ), (sj2 |Pj2 ), ...} to construct stochastic vectors: ti = AF I[si , π(si ), :].
For sg the operation of the agent effectively terminates, so we adopt the absorb-

T
ing state method discussed in [16]: tg = 0 0... 1.. 0... . From these vectors, we
produce the transition table, Pg :

Pg = t0 t1 ... tg ... tj (4)

We then take the state distribution sk , where k is step time. The state occu-
pation distribution as a function of time is given by: sk = Pgk · s0 which represents
the stochastic vector of probable states evolved from s0 .
400 C. Robinson

Fig. 3. Conversion of Maximal Probability Path Tree to Markov Chain. Highlight-


ing States s2 , s1 , and g, with Maximal Probability Actions a2 and a1 Linking them,
and Additional Action-Caused Probability Links among Themselves and Other States
Shown in the Markov Network

Presume that we order the stochastic vectors comprising Pg such that sg


corresponds to the last element:

Ts 0
Pg = 
tg 1

Where Ts is the transition matrix internal to only non-goal states, tg is the
vector of transition probabilities from {si ∈ S|i = g}, and the final column is
the stochastic vector of sg . Then:

Tsk 0
Pgk =  (5)
tg · Σk−1 l 
l=1 Ts + tg 1
.
From this, we can see that theprobability of reaching the goal state at step
k−1
k is given by P (si → sg |k) = tg · m=1 Tsm + tg . Because Tg is strictly positive
definite, Tgk is as well, and consequently P (si → sg |k) is monotonically increasing
in k, so Pg has no steady state. Thus sg is an attractor state, as it is identical
to its own start-state distribution, and no other state can be an attractor unless
there is a zero probability of transitioning out from that state.
It is also possible for state sequences which are not attractor states, but
which present no path to the goal once reached, to be non-steady attractors (as
illustrated in Fig. 4). We can define a subset of states, tnet, to represent the
states associated with this. We can re-cast Pg in the following form, noting that
states in tnet can transfer between one another, but not to other states not in
tnet:
GAP Algorithm 401

⎛ ⎞ ⎛ ⎞
Ts∈tnet
/ 0 0 Tsk∈tnet
/ 0 0
⎜ 0⎟
Pg = ⎝ Ts∈tnet Ttnet 0⎠ ; Pgk = ⎝Σk−1 j k−1−j
j=0 Ttnet Ts∈tnet Ts∈tnet
/
k
Ttnet ⎠ (6)
tg|i∈tnet  k−1  j 
/ 0 1 Σ t
j=0 g|i∈tnet
/ T s∈tnet
/ 0 1

Fig. 4. Illustration of Subgraph Segment from which No Path to the Goal Exists, yet
Contains Multiple Transition state Cycles. Such Regions can Present Non-Steady State
Attractors from which the Agent cannot Progress to the Goal, hence being Considered
‘trapped’ in the Subgraph.

From which we can see that P (si∈tnet → sg |k) = 0 for all k. Further, we will
define a system parameter Lmax : the longest minimum length path between any
two states. For any reachable state si , P (si → sg |Lmax ) > 0. In any graph the
|S|
maximum path length is |S|, so it suffices to check Pg : any state i for which
P (sg |si , k = |S|) = 0 is necessarily a member of a trap net. We can then use Eq. 6
to determine the probability at any point in time that the system has become
stranded in a trap net:
 
k−1 j
P (st ∈ tnet|k) = 11×|tnet| · j=0 T tnet T s∈tnet T k−1−j
s∈tnet
/ T k
tnet
0 · st (7)

Attractor states and trap nets together represent all the ‘dead ends’ for a
GAP algorithm, analytically identifiable from the form of Pg , making dead-end
removal algorithms, such as FRET [10] or reachability analysis [15], unnecessary
for GAP agents. This is a convergent behavior model in which the long-term
behavior of the agent can be statistically parametrized, and thus fully define the
goal–convergent behavior of the agent, resolving the problem discussed in [16].
Further, we may examine GAP behavior in terms of the L1 norm of Tsk .
Because all columns are stochastic, the maximum absolute column sum is paired
with the minimum probability single step goal transition, thus:

Tsk 1 ≤ (1 − min Pg [si , g, a])k


∀a,i
402 C. Robinson

Often, min∀a,i Pg [si , g, a] = 0, however, at k = Lmax , all states from which the
goal is reachable have a non-zero transition probability:
 k
Ts 1 = 0 k < Lmax
(8)
Tsk 1 ≤ TsLmax k−L
1
max
otherwise

Allowing us to calculate threshold goal achievement rates without project-


ing the system forward arbitrarily. For some minimum transition probability
threshold Pthresh :

log(1 − Pthresh )
kp ≥ + Lmax (9)
log(TsLmax 1 )
k −L
Rewriting the relation as 1 − TsLmax 1p max ≤ Pthresh , we can see that the
probability of transition to goal is bounded by an exponential growth rate- the
minimum probability threshold reached is limited by an exponential asymptotic
function approaching unity.
Perturbation Model. To examine the effect of error and state abstractions,
we use a probability transform acting on transition values. Consider a mapping
α() which transforms a state space S into a more compact space α(S). This
probabilistic mapping model is similar to that used by [8], but benefits from the
structure of AF I. Presume that we have an |α| × |S| transformation matrix, αT
which contains in each cell α[j, i] the probability P (α(si ) = α(j)) that the ith
‘true’ state is mapped onto the j th abstracted state. For a state vector st , the
corresponding abstracted probability vector is sαt = αT · st , or, for general time
propagation: sαt = αT · Pgt · s0
Given a learned AFI subgraph for the abstracted space, Pα , we also have
sαt = Pαt sα0 , and since sα0 = αT s0 we can construct a relation from the equiv-
alence αT Pgt = Pαt αT :
 t
Pα = αT Pgt αT+
Pgt = αT+ Pαt αT
where αT+ is the pseudoinverse of αT . It is notable that this transform does not
allow for conversion into the true state space, even if αT is known perfectly, as
αT+ cannot unmix states which are combined. Recognizing that both arrays must
be stochastic transforms, due to the action on s:

αT s αT g + αT+s αT+g
αT =   ; α =
1 − 1αT S 1 − 1αT g T 1 − 1α+ 1 − 1α+
TS Tg

Transforms between the probability spaces allow us to analyze convergence


in the abstracted space. We assume that the abstracted model is convergent, and
wish to show that the ‘true’ system will converge as well. Taking Eq. 5 where we
k−1
annotate: tαg · l=1 Tαsl
+ tαg = Vp :
k
Tsk 0 α+ αT+g Tαs 0 αT s αT g
=  T s + · · 1 − 1αT S 1 − 1αT g

P (si → sg |k) 1 1 − 1αT S 1 − 1αT+g Vp 1
GAP Algorithm 403

Expanding Pgk lets us calculate the probability of goal transition in the true
space, and using the relations 1Tαs
k
= 1 − Vp , and 1αT+g = ||αT+g ||:

P (si → sg |k) = 1 + ||αT+g ||(1αT s − 1) − ||αT+g ||Vp αT s − 1αT+s Tαs


k
αT s

We presumed that Pα is convergent, and thus we can note the limiting behavior
k
of Tαs and Vp : limk→∞ Vp = 1 and limk→∞ Tαs k
= 0, from which the limiting
behavior of P (si → sg |k) can be determined: limk→∞ P (si → sg |k) = 1−||αT+g ||1

Convergence of Pg can be expressed as P (si → sg |k) → 1, so:

lim P (si → sg |k) = 1 = 1 − ||αT+g ||1 → 0 = ||αT+g || (10)


k→∞

Which shows that the convergence of the true system to the goal, given
convergence of the abstracted state, is predicated on the transform between
the true goal states and the abstracted goal states being onto, analogous to,
but distinct from, the convergence conditions derived in [15], and mirroring the
observability model utilized in [7].
Given this condition on the abstraction function, we can also determine the
performance impact of the transform. Beginning with the relation ||Tsk ||1 ≤
||Ts ||k1 for the true state system:

||Ts ||k1 ≥ ||αT+s Tαk αT s ||1 + ||αT+g ||1 (1 − ||Tαk ||1 ||αT s ||1 )

||Tαk ||1 , ||αT s ||1 are strictly in [0, 1], but ||αT+s ||1 is not, and so this derivation
applies only to Pα → Pg . Convergence of the abstracted model implies conver-
gence of the true model, but not the converse. From this inequality, we can then
replicate the prior analysis for the abstracted case:

log(1 − Pthresh )
kpα ≥ + Lmax (11)
log(||αT+s ||1· ||Tαk ||1 · ||αT s ||1 )
Which describes how the inclusion of the abstraction modifies the minimum
expected time to achieving the goal state. By examining this expression above, we
can make some inferences about the impact of αT on convergence performance:

kpα > kp ||αT s ||1 · ||αT+s ||1 < 1
(12)
kpα ≤ kp ||αT s ||1 · ||αT+s ||1 ≥ 1
We can use the product above as a rough measure of the ‘quality’ of
an abstraction, the degree to which it effects performance, by: Q(αT ) =
1
||α || ·||α+ ||
So that Q(αT ) is directly correlated to the impact αT has on
Ts 1 Ts 1
performance, resolving the metric problem discussed in [13]. Empirically, we can
also approximate this using kp and the measured kpα :
k −k
kpα − kp log(||αT+s ||1 · ||αT s ||1 ) p
k −L

= → ||Tαk ||1 p max = Q(αT ) (13)


kp − Lmax log(||Tαk ||1 )
404 C. Robinson

Using this metric, we can measure the impact of perturbation model, under-
writing the effectiveness of the GAP for operating under an abstraction or uncer-
tainty.
Learning Model. We can model learning an abstraction which becomes more
accurate as learning progresses. Using transform starting with the initial assump-
tion of a uniform random distribution: αT 1 = |α| 1
· 1 and αT+1 = |S|1
· 1. We can
approximate expected learning curves with an amortized update at each step k:
k
a single state has been visited |S| times, and total counts can be expressed as
k
sαi . Combining the prior occasions with the new, for k+1
|S|  |S| steps gives:

k 1 1 ksαi + si
s αi = sαi + si · k =
|S| |S| |S| + 1
|S|
k+1
Which, in aggregate, gives the expression across the full transition array as
kP +Pg
2 · 1 · Pg · 1, or:
1
a recurrence relation: Pαk+1 = αk k+1 , Pα1 = |S|
 
1 · Pg · 1 k − 1 +
Pαk = + P g = αT k Pg αT k (14)
k|S|2 k
Which we can express in similar block fashion as above:

1 1 + |S|(k − 1)Ts 1
αT k = αT k Pg
k|S| 1 + |S|(k − 1)P (g) 1 + |S|(k − 1)

Using the general form for αT k :


 
1 − k|S| k−1 1|S|−1 k−1
1+ +
Ts + 1αT s = + Ts − αT g1 + αT g1αT+s
k|S| k k|S| k
Simplifying, and taking limit case where k → ∞:

(1 − αT g1)αT+s = 1 − αT g1

αT+s = I → αT s = I
Demonstrating conclusively that as Pα is learned, GAP agent training will
be convergent. Equation 14 also allows us to determine the amortized form of
the transition array over time- we can express it as:

1 1
Pαk = − Pg + Pg
k |S|
1
In which the terms |S| − Pg and Pg are clearly time invariant, thus the
average learning curve will follow a reciprocal pattern kpα (k) = A k1 + B. B is
naturally the asymptotic average path to goal length, kp . We can evaluate the
initial behavior of the system given the form for αT 1 and Eq. 14:
2log(|S|) − 2log(|S| − 1)
A = (kp − Lmax )
log(||Tα1 ||1 )
GAP Algorithm 405

|S|−1
||Tα1 ||1 can be directly calculated from αT 1 Pg αT+1 as |S| , and thus:

2(Lmax − kp )
kpα (k) = + kp (15)
k
Which establishes the average form for the learning curve for GAP agents as
an offset reciprocal function of step number.

4 Empirical Experiments
In this section we demonstrate the effectiveness of the GAP algorithm in learning
across a diverse array of archetypal learning and planning domains.
Training Process. To train the agent, the AFI datastructure is initialized with
uniform random values. Upon observations of occasions, the corresponding INC
cells are updated and links in AFI sorted. We artificially induce error in some
trials by a random threshold process which executes a random non–planned
action. Simulation models are designed to output string states when polled for
information, and a simple hash algorithm generates a lookup table for the agent
to use.
We demonstrate the effectiveness of the GAP algorithm by measuring param-
eters related to performance characteristics. We calculate best fit equations
(“Ak −1 ”) and measure of their accuracy: percentage off–linear (“%OL”) aver-
 |k [n]−(A 1 +B)|
ages of linear regressions on the plots of ( k1 , kp ): ∀n∈N N1 p kp [n]n We also
calculate and compare approximations of kp and Lmax . For kp , the fit curve for
Eq. 15 (“kp I”) and by average performance after convergence (“kp II”). Lmax
comparisons are made between Eq. 13 (“Lmax I”) and Eq. 9 (“Lmax II”). We
ground performance with comparisons to Q-Learning and MDP policies, using
reward function R(si , al ) = log(P (al (si ) → si+1 )) + log(P (σ(i + 1, g))) to mimic
the probability maximizing function. In our results, “QL kp ” is the average num-
ber of steps to reach the goal for the trained QL agent, “QL Ep.” is the number
of epochs for the QL agent to converge (where “NC” indicates failure to converge
after 1000 epochs), and “MDP kp ” is the average shortest path to the goal found
by an MDP planner using Value Iteration, to set a performance floor.
STRIPS-Type Problems. First, we implement a STRIPS-type planning prob-
lem, as schematically represented in Fig. 5. The agent is in a world with move
operations that move it through space, a pair of world manipulating actions, and

Fig. 5. Illustration of a STRIPS World, Containing Linked Location States (Li ) and
Multiple Independent Actionable States (V1 and D1 )
406 C. Robinson

Fig. 6. Learning Curves for the STRIPS Problem Across Levels of Induced Error from
0% to 50%

a state space including possession of an item, location, and status of the door,
for a total state space of size |S|max = 52.
Figure 6 showcases the learning curves of the GAP algorithm on this problem
at each induced error level independently. Each curve is the average performance
over 50 trials. We can see from these curves that the learning tends to follow
the same reciprocal pattern as the general curve, with variance in asymptotic
performance shifting due to the increase in error rate elevating the expected
number of steps to reach the goal.
To reinforce the reciprocal relationship, we also plot the linearization of these
curves along with the off–linear percent labeled for these curves. For each plot
but one, the deviation from linear fit is in the single digits, with the greatest
deviation being for the 30% curve, with a 15% average off–linear error. These
measurements serve to validate the prediction of Eq. 15 that the GAP algorithm
will express reciprocal learning curves.
In addition to these linearized plots showing correlation between k1 and steps
to goal, we also highlight the correlations between kp as predicted by the asymp-
totic behavior of the data itself and the fit reciprocal curve, and calculate Lmax
from Eq. 15 and as predicted by the threshold in Eq. 8. Both measures are pre-
sented on Table 1, along with the corresponding percent errors. Here, we can see
that the differences between the asymptotic kp and the fit function are small,
ranging from 0.87% to 7.12% for induced error rates up to 40%, and the differ-
ence between the measured and predicted Lmax is 7.8%, indicating very close
correspondence between the observed performance and the predictions of Eqs. 15
and 8. These successive curves illustrate effective learning at high levels of error-
convergence occurring within 20 epochs even at 50% error.
Of note is the 50% error case, with error roughly twice that of the next largest.
However, an introduction of 50% error into the action of the agent is extremely
substantial, and it is reasonable to expect that the learning performance will
degrade. Qualitatively speaking, as the induced error rate increases, Pα behaves
more and more like a random uniform stochastic process. Referring back to
Eq. 13, we can see that the limit of kpα (k) will grow until the difference between
kpα (0) and the asymptotic performance is negligible. In more rigorous terms,
GAP Algorithm 407

Fig. 7. An example of one randomly generated Ill–conditioned maze used in these


Maze/TAXI problems

limk→∞ kpα (k) → Lmax , and so the function kpα (k) no longer properly behaves
as a reciprocal, but as a constant function, exactly the expected behavior of an
attempt to learn a uniform random process.
Maze/TAXI Domain. The TAXI and Maze problems are canonical study
cases for machine learning systems. In the TAXI problem, the agent must visit
a list of locations, pick up a ‘passenger’, and deliver it to a specific destina-
tion. We complicate the problem by performing navigation in a maze. For the
agent, actions are cardinal direction movements, and pickup and drop off actions.
States include local observations of the maze topography, direction to the tar-
get ‘passenger’, and whether a passenger is currently carried. Additionally, note
that we do not perform training for fixed TAXI destinations and mazes, but
rather generate a random maze and passengers for each training epoch. We use
a relative measure, requiring the agents to learn broader patterns rather than a
rote problem. Rather than restricting ourselves to simple mazes, we allow non-
uniform spacing. Such a maze is illustrated in Fig. 7. As a result, the maximal
state space size is variable, however for the maze generation parameters used,
averages to |S|max = 18, 432.

Table 1. Comparison of measured and predicted values for analysis, calculated from
the performance on the strips problem learning curves, and comparisons to QL and
MDPs

Pthresh kp I kp II %E Ak−1 %OL QL kp QL Ep MDP kp


0% 18.53 18.10 2.30% Lmax I 25.30 54.9 7.4% 38.5 18 17.0
5% 20.04 20.21 0.87% Lmax II 27.29 62.1 6.8% 42.1 17 20.2
15% 22.81 22.76 2.11% %E 7.8% 53.5 14.9% 43.5 19 20.1
30% 30.15 28.14 7.12% 25.9 5.9% 49.9 21 23.8
40% 43.82 42.07 4.15% 6.2 1.7% 53.4 29 59.2
50% 65.81 57.40 14.65% 0.08 2.9% 59.4 27 68.6
408 C. Robinson

Table 2. Comparison of measured and predicted Lmax across abstractions for the com-
plex Maze/TAXI domain with joint abstractions, along with QL and MDP performance
baselines.

kp I kp II %E Ak−1 %OL Lmax I Lmax II %E QL kp QL Ep MDP kp


AI wA 30.2 28.3 6.7% 69.9 9.2% 61.7 69.1 10.6% 127.3 265 30.4
AII wA 23.4 22.4 4.5% 8.1 6.3% 32.2 36.8 12.5% NC NC 23.5
AI w/oA 396.0 452 12.4% 505 7.2% 527.1 573.1 8.0% NC NC 297
AII w/oA 38.0 37.2 2.2% 24.9 8.4% 40.4 43.8 8.3% 221.4 269 33.5

In Fig. 8, we plot the learning curves the Maze/TAXI problem across levels of
induced error ranging from 5% to 30%. We observe two trends: the asymptotic
kp ’s proportionality to the error rate, and the correlation between initial per-
formance and long term performance. We also note the presence of ‘adaptation
bumps’ between epochs 4 and 8. This is correlated to changes in effectiveness as
the agent encounters large changes in the random maze, and indicates adaptive
learning. Note that GAP agent learning consistently converges within 10 epochs.
To investigate abstraction performance, we use three versions of the state
definition. AI, representing the 8 neighborhood cells; AII is similar to AI, but
includes only the four cardinal directions; and wA, or ‘with Action’, adds the
additional information of the most recent action the agent has taken. We produce
four different state generation methods with these: ‘AI wA’, using AI and wA
together, ‘AII wA’, and AI and AII both without wA (nominally ‘AI w/oA’ and
‘AII w/oA’). By joining the different models in this way, we can compare the
relative impact of each transform using Eq. 11.
We measure the same indicators as before, tested at all six error levels and dis-
played on Table 2. Table 2 additionally presents the calculated values for Lmax .
We find that the pairs of values are typical for the GAP algorithm thus far, and
on the appropriate scale for the performance values observed. Further, the QL
agent fails to learn in either the AII wA or the AI w/oA case, indicating that
the GAP agent can effectively learn problems which QL cannot.

Fig. 8. Performance of the GAP algorithm across multiple levels of induced error on
the Maze/TAXI problem space
GAP Algorithm 409

Table 3. Measured kpα and corresponding |α+ T α| estimates, with quadrants repre-
senting pairs of composed abstractions.

AI AII
+
PT hresh kpα |α T α| kpα |α+ T α|
1% 30.19 1.05 23.40 1.15
5% 54.16 1.09 24.58 1.11
10% 52.49 1.06 31.72 1.21
15% 67.30 1.57 35.43 1.77
20% 42.12 1.02 31.66 1.14
25% 58.45 1.05 29.70 1.08
wA 1.144 (±12.5%) 1.249 (±14.1%)
1% 396.17 1.01 38.00 0.99
5% 272.13 1.00 30.88 0.99
10% 285.07 1.00 37.76 1.00
15% 418.00 1.01 35.83 0.99
20% 729.29 0.99 36.64 1.00
25% 464.91 1.01 51.31 0.00
w/oA 1.004 (±0.3%) 0.996 (±0.2%)

Table 4. Calculated |α+ α| ratios across abstractions and predicted transform measure,
derived from the entries in Table 6 and Eq. 11

Q(α) AI AII I → II AIwA→AIIw/oA


wA 1.144 1.249 1.091 Meas: 0.871
w/oA 1.004 0.996 0.992 Pred 1: 0.958 (+10%)
wA → w/oA 0.877 0.798 Pred 2: 0.791 (−9%)

A joint Pα can be constructed by multiplying transforms, via Eq. 11, we can


use kp and Lmax , along with the fit functions for kpα as a function of log(1 −
Pthresh ), to estimate ||αT+s ||1 · ||Tαk ||1 · ||αT s ||1 . Table 3 contains the estimated
values of the L1 norm for |α+ T α| at each error level. These show little variance
across error level, as expected for constant transforms.
Since the abstractions are in pairs, submultiplicity of the L1 norm lets us
estimate of the impact between pairs, from ‘AI’ to ‘AII’, both in the ‘wA’ and
the ‘w/oA cases, and compare these. On Table 4 are these values, and the level
of correspondence between the ratios. For a grounded measure, we compare the
transforms from ‘AI wA’ to ‘AII w/oA’ with two steps to the direct transform,
the former estimates 0.958 and 0.791, both within 10% of the latter which is
0.871. These results validate Eq. 11.
Tower of Hanoi Domain. Conceptually simple, the puzzle consists of a number
of disks and three or more pegs on which these disks may be stacked in order of
410 C. Robinson

Fig. 9. Illustration of a Traditional Tower of Hanoi (ToH) problem. This graphic shows
the 3-peg, 4-disk, variant of the problem, T oH3,4

size, as represented in Fig. 9. Problems are usually represented as T oHp,d , where


p is the number of pegs, and d is the number of disks. The scope of the state
space is |S|max = pd , and the action space at |A|max = p2 .
We investigate three instances of the problem. Figure 10 plots the average
learning curves for T oH3,3 , T oH3,5 , and T oH4,5 over error rates ranging from
0% to 25%, and the reciprocal best fit curves for each. In addition to the curves,
we also present the errors associated with the reciprocal fit, showing agreement
to the model. Table 5 presents the results of these experiments, and the baseline
comparison values for QL and MDP agents. Note again that the GAP agents
converges after 15 epochs or less, while QL agents take in excess of 90, or do not
converge.
We also test four abstractions, ‘AI’, ‘AII’, ‘AIII’, and ‘AIV’. AI- Directly
converts disks on each peg to a numerical state; AII- Encodes the sums of disk
indices on each peg; AIII- Lists pairs the number of disks on each peg and the
topmost index; and AIV- Lists only the number of disks.
Table 6 presents the essential measurements for the entire battery of experi-
ments, spanning T oH3,3 , T oH3,5 , and T oH4,5 ; all four abstractions across error
rates from 0 to 20%. In these experiments, we see several cases in which case QL
agents fail to converge- in particular the T oH4,5 experiment with AII is inter-
esting, as the convergent performance takes approximately 15 times longer than
the GAP agent to reach the goal, indicating that it is a threshold case where
learning becomes ineffectual for QL agents. We also observe that the AIII and

Fig. 10. Average learning curves for the GAP algorithm over the three investigated
ToH domains, T oH3,3 , T oH3,5 , and T oH4,5 at varying error levels, along with reciprocal
Fit curves
GAP Algorithm 411

Table 5. Chart of the correlation measures for the GAP algorithm learning the tower
of hanoi problem, across error level and problem complexity class

Pthresh kp I kp II %E Ak−1 %OL QL kp QL Ep MDP kp


5% 16.5 15.6 5.5% 110 19.3% 19.2 141 11.4
ToH(3,3) 15% 47.3 47.8 0.9% 117 7.8% 36.7 138 35.2
20% 61.9 63.4 2.4% 126 1.8% NC NC 39.9
5% 32.7 30.1 11.4% 35 5.4% 37.2 95 30.8
ToH(3,5) 15% 34.8 37.1 6.4% 144 5.8% 100.2 137 34.3
20% 44.9 48.7 8.2% 388 16.1% NC NC 51.2
5% 206.4 201.5 2.4% 1678 12.3% 287.4 135 175.4
ToH(4,5) 15% 696.7 707.5 1.6% 1336 2.3% NC NC 599.1
20% 2052.2 2006.9 3.0% 133 1.2% NC NC 1766.6

Table 6. kp and Lmax comparisons for the GAP algorithm learning the ToH problem
with various abstractions and across complexity classes

Abst kp I kp II %E Lmax I Lmax II %E QL kp QL Ep MDP kp


AI 16.5 15.6 5.6% 15.6 17.5 11.4% 27.2 139 17.1
T oH3,3 AII 27.8 31.5 13.5% 8.1 8.8 7.3% 37.1 123 26.9
AIII 21.6 17.3 20.2% 17.22 14.8 16.3% N/A N/A 20.0
AIV 17.0 15.3 9.7% 31.5 35.1 10.3% N/A N/A 12.9
AI 35.3 34.2 3.1% 62.1 59.4 4.7% 41.8 115 32.6
T oH3,5 AII 31.4 35.1 11.8% 64.9 69.0 5.9% 36.1 163 37.5
AIII 31.0 31.0 0% 31 31 0% N/A N/A 31.8
AIV 31.0 31.0 0% 31 31 0% N/A N/A 31.1
AI 254.5 256.9 0.9% 112.06 104.5 6.8% 226.9 72 162.9
T oH4,5 AII 278.1 241.3 13.2% 101.9 112.6 10.5% 4068 43 273.6
AIII 1267.2 1354.6 6.9% 391.9 371.5 5.2% N/A N/A 1076
AIV 2017.7 1843.2 8.6% 719.9 749.8 4.1% N/A N/A 1899

AIV cases for the T oH3,5 case unilaterally converge to the optimal number of
steps, presumably incidentally.

5 Conclusions
In this paper we have presented the GAP algorithm, which uses an elegant
datastructure and carefully chosen action policy to efficiently learn solutions to
sequential planning problems without requiring design of a reward function or
world model. We highlighted the relationships between extant reward and mod-
eling based systems which indicate the detriments of using rewards to drive solu-
tion finding. We proposed to fill this gap which additionally allows for planning
between arbitrary states using the same training data, and operates in low-order
polynomial time thanks to the use of the augmented hypergraph datastructure.
We showed how the design of the algorithm creates useful properties, which
enable analytic proof for several valuable characteristics, including global opti-
412 C. Robinson

mality of the action policy, exponentially bounded goal achievement rates, pre-
cise identification of dead-end state probabilities, conditions for convergence
under abstracted, perturbed, and error transforms, a measure for the perfor-
mance impact of said transforms, learning convergence, and the form for the
average learning curve for the agent.
Batteries of experiments on three demonstration domains highlighted effi-
cient learning and convergence properties of the GAP agents, which consistently
learned an order of magnitude faster than QL agents, and to solve problems
which the QL agents failed to, with performance levels comparable to MDP
agents despite not being provided with a transition model or reward function.
We used the STRIPS domain problem to establish fundamental effectiveness of
the algorithm. The Maze/TAXI domain was used to illustrate the power of the
GAP algorithm in a complex hierarchical, relativistic, and dynamic domain with
over 18,000 states, and demonstrated validity of the L1 norm–based abstraction
performance analysis by comparing multiple composed transforms. We used the
Tower of Hanoi domain to illustrate effective performance over multiple levels of
single-domain complexity, across a range of error rates, and with multiple state
space transforms.
The GAP algorithm has outstanding limitations we would like to address.
Firstly, though the cubic-order hypergraph is less extensive than many world
models which grow exponentially, it is still relatively inefficient. A dynamically
allocated structure would improve performance. Additionally, the planning per-
formance can be improved, especially by implementing heuristics, such as A*.
Such an algorithm using non-biased, structural heuristics is a goal for future
development. Also, the familiarization phase, (Eq. 14) can lead to bias during
initial learning, but introduction of an implicit learning rate may ameliorate
this.
There are also some other topics we would like to investigate going forward.
The abstraction mechanism allows opportunities to develop unsupervised hierar-
chical decomposition functions for state spaces. Alternatively, ihe action policy
can be altered to use a statistical selection of actions, rather than argmax.
Finally, the effectiveness on a dynamic and relative domain suggests a rigorous
model for adaptation and learning re-use can be constructed.

References
1. Bertsekas, D.P., Tsitsiklis, J.N.: An analysis of stochastic shortest path problems.
Math. Oper. Res. 16(3), 580–595 (1991)
2. Blum, A.L., Furst, M.L.: Fast planning through planning graph analysis. Artif.
Intell. 90(1–2), 281–300 (1997)
3. Blum, A.L., Langford, J.C.: Probabilistic planning in the graphplan framework.
In: Biundo, S., Fox, M. (eds.) Probabilistic planning in the graphplan framework.
LNCS (LNAI), vol. 1809, pp. 319–332. Springer, Heidelberg (2000). https://doi.
org/10.1007/10720246 25
4. Dimitrov, N.B., Morton, D.P.: Combinatorial design of a stochastic markov decision
process. In: Operations Research and Cyber-Infrastructure (2009)
GAP Algorithm 413

5. Grzes, M.: Reward shaping in episodic reinforcement learning (2017)


6. Guillot, M., Stauffer, G.: The stochastic shortest path problem: a polyhedral com-
binatorics perspective. Eur. J. Oper. Res. 285(1), 148–158 (2020)
7. Hostetler, J., Fern, A., Dietterich, T.: Sample-based tree search with fixed and
adaptive state abstractions. J. Artif. Intell. Res. 60, 717–777 (2017)
8. Hunter, A., Thimm, M.: Probabilistic reasoning with abstract argumentation
frameworks. J. Artif. Intell. Res. 59, 565–611 (2017)
9. Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-
directed exploration with reinforcement-learning algorithms. Mach. Learn. 22(1),
227–250 (1996)
10. Kolobov, A., Mausam, M., Weld, D.S., Geffner, H.: Heuristic search for general-
ized stochastic shortest path mdps. In: Twenty-First International Conference on
Automated Planning and Scheduling (2011)
11. Konidaris, G., Kaelbling, L.P., Lozano-Perez, T., Learning symbolic representa-
tions for abstract high-level planning: From skills to symbols. J. Artif. Intell. Res.
61, 215–289 (2018)
12. Leonetti, M., Iocchi, L., Stone, P.: A synthesis of automated planning and rein-
forcement learning for efficient, robust decision-making. Artif. Intell. 241, 103–130
(2016)
13. Lüdtke, S., Schröder, M., Krüger, F., Bader, S., Kirste, T.: State-space abstractions
for probabilistic inference: a systematic review. J. Artif. Intell. Res. 63, 789–848
(2018)
14. Matignon, L., Laurent, G.J., Le Fort-Piat, N.: Reward function and initial values:
better choices for accelerated goal-directed reinforcement learning. In: Kollias, S.D.,
Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 840–
849. Springer, Heidelberg (2006). https://doi.org/10.1007/11840817 87
15. Pineda, L., Zilberstein, S.: Probabilistic planning with reduced models. J. Artif.
Intell. Res. 65, 271–306 (2019)
16. Steinmetz, M., Hoffmann, J., Buffet, O.: Goal probability analysis in probabilistic
planning: exploring and enhancing the state of the art. J. Artif. Intell. Res. 57,
229–271 (2016)
17. Szepesvári, C., Littman, M.L.: Generalized markov decision processes: Dynamic-
programming and reinforcement-learning algorithms. In: Proceedings of Interna-
tional Conference of Machine Learning, vol. 96 (1996)
Systemic Analysis of Democracies
and Concept of Their Further
Human-Technological Development

Bernhard Heiden1,2(B) and Bianca Tonino-Heiden2


1
University of Applied Sciences, Villach, Austria
[email protected]
2
University of Graz, Graz, Austria
http://www.cuas.at

Abstract. Democracies have been invented in Greece approximately


2500 years ago, as principle of dividing political power. Since then the
basic principle has not been developed substantially further, although
modern democracies have invented improvements. The objective of this
paper is to introduce a new understanding principle into the direction
of a technology driven democracy. After a problem focused analysis we
use the axiomatic systemic method and natural language logic argu-
mentation for a basic system construction for a new general system of
democracy in general by the personalization of nations or states and more
specific as an illustration by division of state dimensions into (a) person
mapping and (b) territory. The personalized state or nation culminates
then in a world-nation-trade-map, for personal implementation of mar-
ket value oriented nation selection. This leads then, when technologically
implemented, to the ability of every person in the world to choose every
traded nation to be its future valued member. This further development
of democracy means then effectively a phase reversal of world to nation
into the human right direction of the individual, by this fulfilling high-
est ethical standards and giving back the control of the world to each
individual and personal responsibility, by making possible the choice of
being part of his favourite nation or state. Although this is only a very
short sketch of a new kind of higher ordered democracy, we invite people
from all over the world to join this vision into the human right direction
of peace- and powerful humanity.

Keywords: Democracy · Future technologies · Humanity · Human


rights · World-nation-trade-map

1 Introduction
Change in Society. Today we feel the necessity of change in society, although the
words are yet missing for the phenomena that are arising around us rapidly. Past
technologies have led to an unprecedented growth in technology, communication,
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 414–425, 2023.
https://doi.org/10.1007/978-3-031-18461-1_28
Analysis and Concept of Democracies Human-Technological Development 415

knowledge and population. Yet one thing has remained the same, the terrain of
the earth. Humanity has now at least two problems with technology: The power
individual humans and small groups have relative to each other and the task
to distribute the power among individuals. Each problem is increasing also as a
consequence of technologies.
This means we have a power control problem of individuals, which is usually
called politics, and which has historically a multifaceted room of solutions.

Industry 4.0. Industry 4.0 can be regarded as one ruling actual paradigm of
industry and society (cf. [2–4]). As it effects the whole bandwidth of society and
technology it also entails technologies in the digitalization area, which is essential
part of the actual humanity transformation. In this specific paper we focus on the
widely applicable aspect of personalization, which is represented in industry and
society by personalized products, services, machines and algorithms to mention
only a few, in each case increasing the value of the underlying, by giving choice
or power to the affected individual.

A New Democratic Principle. In this paper we introduce for the first time a
new democratic principle that further divides power into mainly (a) territorial
and (b) personalized membership. This leads then as a result, in the infinite
regress, to the personalized state ’in the pocket’. This will solve some basic
human problems in a new fashion, and above all put the human rights on a
new humanity standard, that would, from the future perspective, never have
been possible without current and future technologies. As an optimistic outlook
remains that with the fundamental basics of logic in every aspect of our daily
lives as well as in our systems, law, ecology and -nomy, we will reach the next
steps of necessary progress of humanity.

Scientific Motivation. The scientific motivation of this paper can be seen from
the systemic perspective. As systems are becoming larger and larger, there takes
place a change in some system properties, as each system in the world is sub-
ject to size scaling, as one universal law. We see the globalization of all sorts of
effects in economy and society, e.g. [19,21] and big data, e.g. [16,17]. This means
in a globalized world that the interconnectivity increases on the one side, and
the velocity of role out of models or prototypes, like businesses, technologies or
products on the other side, accelerates which can be identified with regard to
the basic law of mechanics as an acceleration increase of this “model role out” in
the transcribed understanding. This means that together with the driving forces,
all dependent systems accelerate, and with this negative trends are enhanced,
as well as positives. So for this reason, the negative trends, have to be better
controlled globally. The global trend of democracy, although actually it has been
documented, that the democracy index is going down, and Karin Schmidlechner
[20] seems to be pessimistic with regard to the global development of democ-
racy increase, my opinion is that the democracy is globally increasing on the
long run, as this is the only possibility to increase order, and hence global eco-
nomic and social or human efficiency by means of cooperation. For this sort of
416 B. Heiden and B. Tonino-Heiden

processes Bianca shaped the term to be conficient [12]. So what are today defi-
ciencies of democracy or other of less decentralized state-, company- or people’s-
organizations, will increase in urgency of an efficiency reshape. So in this paper
we focus on what is one of the root causes of democracy, and decentral organi-
zation in general, which will be the new paradigm of the future, and of future
technologies increasingly, as central systems reach their limits. The limits are
multifolded, but as the corresponding system limits are coming nearer we need
to have concepts how to reshape all sorts of systems, and in the technological
sense, force driven systems, into the direction of highly dense osmotic functional
systems, in production [5,11,23], information [26], organizational, societal and
political ones. We will focus in this work on democratic further development,
but the concept, can be applied to all sorts of mentioned systems, when adapted
specifically. In this sense we introduce the close connection between democratic
and decentral concepts, which is the systemic generalization, which then applies
to these systems. Hence with this new approach it might be valid to speak from
democratic production, information, organizations, societies and also politics.
In this very concept of systemic perspective, democratization can be regarded
as the partitioning of forces, their orgitonization according to origiton theory,
which means that the system is analyzed and partitioned into smaller coupled
and decoupled cybernetic units, that increase potentially overall system order,
by means of an emergence contraction process (see also [6]), which means that
information density is increased by simultaneous emerging properties of inte-
grated higher order power control in the here applied case.

2 Used Methods and Goal


Scope. The scope of this work is to give a systemic analysis of democracy, which
plausbibly introduces a sound new and potentially order increasing further devel-
opment of the latter, by means of guiding Axioms. We then path the way into the
human right direction of evolving future technologies, intended to be a potential
work program with the goal of overall and effective humankind order increase.

Method. In this paper we use the method of natural language logic argumenta-
tion, and axiomatization, to form a theory set for the goal that we are aiming
at.

Goal and Research Question. The goal of this paper is to translate the principle
of personalization, as a core principle or paradigm of Industry 4.0 to the more
general context of democracy and future technologies requirements. With this,
we shall sketch shortly, how and why, with the help of now and future technology,
this will possibly, as a projection, lead to a higher ordered and dynamically
stabilizing and self-organizing world. The dominating research question of this
article will be: How can power be divided further in democratic or decentral
systems to decrease system absolute individual power and increase potentially
order beyond the previous system state?
Analysis and Concept of Democracies Human-Technological Development 417

Limits of the Work. As this work tries to enlarge the frame for future technolo-
gies, it is naturally limited to the focused results, that can be further generalized
on the one side, but need also be tested in different possible systems, as they
should be widely applicable. So the most challenging limit, is how much room the
generalization provided by this model shall take place, to sufficiently “control”
the regarded system in question. Technical limits are, that there can be given
only the most important ideas of these principles, and technical details will need
future follow up applications and research.

Content of the Work. In this work we first make in Sect. 3 a systemic analysis of
democracy and formulate it with basic Axioms, which we then later enhance by
further Axioms, into the human right direction of a new democratic paradigm. In
Sect. 4 we give then an application example of a new democratic implementation,
including the basic power structure of the newly shaped nation or state or system
and the individual as an overall system of “trade units”. We further give a short
sketch how current technological developments can support the implementation
of this potential democratic paradigm in all sorts of systems, from the individual,
over industrial production to the society. In Sect. 5 we give a conclusion and
an outlook for applications and future research in this field of order enhanced
democracy.

3 Systemic Analysis of Democracy: Future Technology -


Personalized Democracy

According to Luhmann [14, p. 1022] “Freedom and equality are initially still
‘natural’ attributes of human individuals. Since they are not found realised in
civil society, they are upgraded to ‘human rights’, the observance of which can be
demanded - up to the human rights fundamentalism of these days.” (translated
from German). This points out the development of human rights on the one side,
leading to a formulation of the UN-Charta of human rights after WWII [24,25].
So this frozen state, that is leading to an absolutism, indicates the border of
actual human rights. According to self-organizing theory the far from “equilib-
rium” state of order, in conjunction with chaos theory can be reformulated as an
order trajectory, which can potentially bifurcate into order increase and decrease.
It cannot stay fixed, as a fundamentalistic approach or development would sug-
gest. So the question is in which direction future development of humanity will
develop. Into the direction of higher order, which means an effective increase
in human rights standards, or a decrease which means the opposite, or a weak-
ening of the latter. In fact today’s advance in weakening these, has effectively
brought back the state of military confrontation in Europe, which means that
we have an actual partial earth global oscillating order backshift behind the
WWII status, while having at the same time highly advanced technologies in
all knowledge areas. The effective order decrease can be observed, e.g. by the
economic, decline, which can be regarded as a rough integrating function of all
human earth activities. We mention this greater problem only from a side-view,
418 B. Heiden and B. Tonino-Heiden

as we focus on new democracy principles and their technology relatedness. It


can be seen directly from the above argument, that the human rights and their
effective implementation are crucial for human development.

3.1 Basic Axioms

A fundamental basic principle should be that societal developments have to


avoid, as good as possible, any contradictions in their regulations. For instance,
if there is a rule “there will never be an x mandatory”, a contradiction rule
“there has to be an x mandatory” can never be formulated, because of its logical
impossibility of human understanding since Aristotle [1] and Leibniz [18] and
Boole in newer days. The important point is with regard to information, that
only by this information dense systems can be achieved, and so it is also a means
of optimization of a system, by the scientific goal function “truth”, as according
to Luhmann, science operates in and with the medium truth.
We now formulate the first Axiom:

Axiom 1. Continued division (decoupling) is a necessary prerequisite for “frac-


tal” (distributed) increase of order.

The principle of Axiom 1, which is similarly also known from Machiavelli


[15], can be seen in Fig. 2, where on the y-axis, the partition or division grade is
depicted, which indicates the potential increase in power, the fractionalization
of processes. For the modern world to the modern individual this means a func-
tionalization of an increasing number of tasks in his life and processes in the
world as a generalization. The driving force between this phenomenon, can be
seen also in technical available machine solutions, that give rise to decision pos-
sibilities increase, or technology driven and human enhancing cybernetizations.
With regard to this technological development we can now formulate the second
Axiom:

Axiom 2. Personal power is increasing. This is partly due to Axiom 1.

There are several reasons for the increase in power of humans. One main
reason is that the effective available increasing decision room by technologi-
cal enhancements like, e.g. Artificial Intelligence (AI), ambient intelligence, etc.
solutions is increasing.
With regard to the human right property of possible autonomy as part of
the specific living condition, we can formulate the third basic Axiom:

Axiom 3. Principle of circular economy or principle of non-exploitation.

Axiom 3 can be regarded as the principle that the functionality of humans


shall be increased, with regard to the personal dimension as well as the material
one. This Axiom comprises Axiom 1-2 as economy is a meta-function of material
trade processes, which are subject to the power of persons.
Analysis and Concept of Democracies Human-Technological Development 419

3.2 New Democratic Axioms


The following Axioms go into more detail of the basic functioning and con-
struction architecture of a personalized state or system. First we have the basic
guiding principle for all the decisions, according to the meta-goal to be functional
or calculations or with regard to human decision criteria rational:
Axiom 4. The valued elements of the state or system are the decision precon-
dition for the individuals’ valuing process or condition.
First the value of the elements of the state are subjected to a unit-ification
in form of a measured value as a unit. This unit is then the base for the conse-
quent calculations, and valuation according to the world-nation-(stock)-market
(cf. Fig. 1).
Building on the value Axiom we can formulate the next Axiom:
Axiom 5. The state (s) or nation or system applies to citizens (c) for being
their state, etc. and not citizens apply to the state or nation or system to be
their citizens (phase reversal principle)(s → c ∧¬(c → s)).
This Axiom sketches the direction in the infinite regress in the (human right)
direction of full personalization of the nation or state or system, which is (in
principle) a paradox, due to the interaction of the whole and the part, and the
part being part of the whole. It may be mentioned here, that the citizen must
have, a right to be part of the basic world state, which is then the new human
right to be a world-nation citizen. The basic value is the existential value, the
right to live to mention one of the then essential basic human right.
From this we can formulate an abstract Axiom related to groups of indi-
viduals in the world-nation context, which is a consequence of the shift of the
state to the individual (state2individual) relation according to Axiom 5 to a
decision relation from the individual to the state (individual2state) relation and
the basing value proposition of Axiom 4:
Axiom 6. The “Völkerbewegung” (movement of people) is (increasing) virtual.
The virtual Völkerbewegung is done by a mouse- or “decision”-click of choice.
Instead of moving to a territory the territory is assigned by the choice of the
individual to be part of the nation, according to the choice in the world nation-
trade-map. This needs in fact information technology to manage the information
flows, across the global applications and stakeholders involved by contract and
physicality, e.g. territory are physical location or presence.
This virtualization of decision processes together with the phase reversal phe-
nomenon then accelerates as logical consequence not only the decision processes
but also makes the circular economy more feasible, as the freedom of personal
individual-state relation increases, whereas the necessity of physical transport
processes decreases.
We can now introduce a higher order Axiom, related to the properties of
division and fractality (Axiom 1) which hence allows also to better control (cf.
e.g. [7]):
420 B. Heiden and B. Tonino-Heiden

Axiom 7. The state or system can be divided in the functional parts: territory,
state-contract and individual.
The division of the state or system according to Axiom 7 into several parts,
makes it possible that there is a general decoupling and fractionalization. This
means, that each of the components can be distributed over the world. Hence this
is a decentral process that is intrinsically democratic in nature itself, a second
order cybernetics or higher ordered origiton. According to Axiom 4 then the
uniting value, which is as well the self-organizing element, due to the personalized
decision of the individuals for the nation or state or system. According to Axiom
1 this increases potentially order and leads to a decoupling and by this to a
stabilisation of processes (see also Fig. 2). This complexity growth according to
Fig. 2 as an increasingly potentially ordering is also depicted by an increasing
multidirectionality (cf. also [9]).
When we now reshape the arguments and sum it up in an overall information
dense process we can give the following integrating Axiom:
Axiom 8. The ethics of the world, and individuals is increasing by applying
increasing personalization of decisions.
Axiom 8 can be made plausible, as (a) to fulfil the condition of separation of
territory and state-contract, this needs a higher order, as the organization has
to be standardized, e.g. for the executive forces and to be accountable on the
whole world due to general global consented or universal applicable laws. The
contracts then guarantee the higher order, but the construction of a state which
is based on these fundamentals, has, with necessity, a higher order. Further (b)
ethics is a meta-function of individual ethics. This can hence be regarded as
a multicriterial functional optimization problem leading to a more integrated
solution, where the individual as the central state-stakeholder of the nation or
state or system is at the same time making the decisions. This is all together
also a higher order back-coupling process, as the individual is in a controllable
fashion self-affected by its previously decided decisions, and for this will easily
be motivated to take on responsibility for his own actions. This also gives the
bridge to the naturally following Axiom from the previous argumentation:
Axiom 9. Legal or system mandats have to be avoided as human understanding
has to be increased.
With Luhmann this Axiom can be interpreted insofar as the explicit has to
be increased over the implicit, which is committed by the communication dimen-
sion [13]. In Luhmann’s communication theory the triangle (1) understand (2)
inform and (3) communication are one system. The elements (1–3) are auto-
poetic systems, which means that they are cybernetical closed or decoupled and
structurally interwoven or coupled. So Axiom 9 states that the personal dimen-
sion in communication is of fundamental importance and priority in society. The
choice to enforce rules by mandatory legislative obligations, could over-control
a running system, by dictating abstract rules over living beings, that could bet-
ter decide locally and personally for their own good in live. So the enforcing
Analysis and Concept of Democracies Human-Technological Development 421

mandate should be as minimal as possible, as otherwise the potential harm will


increase, as abstract rules are (a) possibly erroneous (b) they are of second order
with regard to decision priority as in mechanical rules no direct consciousness is
involved and (c) if they are right or wrong they are delayed with regard to their
right, which makes necessary a right system, that is always of second choice to
an compared immediate human right (own) decision.
This is especially true, when the effects of the individual decisions, are
increasingly back-coupling in nature via the further developed democratic world-
system according to the above sketched enhanced democracy model, which needs,
triggered by the multiplicity of decision processes, future technologies and ratio-
nal decision support solutions.

4 World Nation Trade Map


In Fig. 1 the World-Nation-Trade-Map is depicted, which comprises a fractalized
nation or state or system, and which is realized by means of future technology
according to Axioms 1-9. An important order criteria will be the valuation. It will
be crucial to calculate the value of each asset, to assess the economic dimension,
but other dimensions may as well be addressed, that are, e.g. important for
human live quality, working conditions, healthy environment conditions, family
friendlyness, engaging or inspiring learning environment conditions.

4.1 Glocal Production, Economy, and Law


The production dimension can be derived from Fig. 2 from the intermediate
region between individuals and nation or state or system, as a big organization.
The principle is also here the same, although, the actual state of companies and
concerns is today partly slightly similar to the proposed personalized nation or
state or system state or overall system arrangement. The economy is the func-
tional system which is trade, over all companies and nations or states or systems
involved. Also what we observe today, is that when nations or states or systems
and concerns1 begin to interact, then a lock-downstepping as an order decreasing
functionality can take place, which is closely linked to a spontaneous coupling
or negative resonance (cf., e.g. [8]). So one major part in as well production,
economy and law as in politics and other functional regions is that an effective
stabilizing decoupling takes place, as a self-organizing multicriteria optimization
process.
Especially with regard to law, this will be crucial for stability also, as this
special condition of a nation or state or system regulates the allowed or forbidden
conditions in the state, etc. and with this the possible consent of the potential
citizens. To adjust, compare and valuate in a way useful for the costumers of the
future personalized state, we will need AI, that precalculates then all those value
1
Here compare also to the meaning of concern: “it is my concern”, e.g. means that
the task is related to me. In this interpretation the state, is structural like a concern,
it relates to me.
422 B. Heiden and B. Tonino-Heiden

Fig. 1. World nation trade map - basic elements and composition principle.
partition or dimension or

personalized focus controllability


... ...
possibilities of choice

...
territory
decoupling grade

work  ... all functionsi


state-contract
privat property 
people

nation: territory
individual  and people ... world

size
# persons
m2 land, area

Fig. 2. Individual-nation-world;  indicates bidirectionality grade.


Analysis and Concept of Democracies Human-Technological Development 423

composing factors, to make decisions possible, that are increasing rational, e.g.
with computationally enhanced logic AI, the lambda computatrix (cf. e.g. [10]).

4.2 From Ambient Intelligence to the Glocal Smart World

The increasing interconnected world, means, that we are surrounded by


machines, which fit smart into our world. The world will then be very smart
and rich in information, wherever those structures are established. It is clear,
that from those cybernetic structure the order of the local system will depend
strongly. As we have with Industry 4.0 the paradigm of the (1) smart machine,
the (2) smart product and (3) the enhanced or smart user (see, e.g. [22]).

5 Conclusion and Outlook


In this work we have first investigated the state of democracies today, from the
systemic standpoint of globalization and potential borders. From this analysis we
have concluded the necessity of functionalizing the democratic principle with the
merge of systemic and selforganizational means to get a new form of democratic
structure as an approximation. We have developed nine Axioms that describe
the overall process, especially with regard to a personalized, fractionalized deci-
sion process, which culminates in the technology driven world-nation-map which
guarantees not only optimization by the decisions of individuals, but also increas-
ingly high ethical standards. With this means the role of future technological
applications have been projected in different research fields from computation,
communication, security over production, economy and law to ambient intel-
ligence and the smart world. Finally researchers around the globe are invited
to contribute in research and application to further develop and make real this
vision of one increasingly human, hypersocial, hyperfunctional, hyperrational
and hyperintegrated world. The outlook will be the state or nation or system
in the pocket, for a powerful a truly increasingly strong neutral or thoroughly
equilibrated world, as the new “natural” state of state or nation or system, which
can be chosen by the individual from the world-nation-(trade)-map. This will
then be the basic future technology of personalized future democracy.

References
1. Aristoteles. Organon. Hofenberg (2016)
2. Bauernhansl, T., Michael ten, H., Birgit, V.-H.: Industrie 4.0 in Produktion,
Automatisierung und Logistik. Wiesbaden, Springer (2014). https://doi.org/10.
1007/978-3-658-04682-8
3. Granig, P., Hartlieb, E., Heiden, B. (eds.): Mit Innovationsmanagement zu Indus-
trie 4.0. Springer, Wiesbaden (2018). https://doi.org/10.1007/978-3-658-11667-5
4. Heiden, B.: Wirtschaftliche Industrie 4.0 Entscheidungen - mit Beispielen - Praxis
der Wertschöpfung. Akademiker Verlag, Saarbrücken (2016)
424 B. Heiden and B. Tonino-Heiden

5. Heiden, B., Knabe, T., Alieksieiev, V., Tonino-Heiden, B.: Production organiza-
tion: some principles of the central/Decentral dichotomy and a witness application
example. In: Arai, K. (eds.) Advances in Information and Communication. FICC
2022. Lecture Notes in Networks and Systems, vol. 439, pp. 517–529. Springer,
Cham (2022). https://doi.org/10.1007/978-3-030-98015-3 36
6. Heiden, B., Tonino-Heiden, B.: Diamonds of the orgiton theory. In: 11th Interna-
tional Conference on Industrial Technology and Management (ICITM). Oxford,
UK (2022). Online
7. Heiden, B., Tonino-Heiden, B.: emergence and solidification-fluidisation. In: Arai,
K. (ed.) IntelliSys 2021. LNNS, vol. 296, pp. 845–855. Springer, Cham (2022).
https://doi.org/10.1007/978-3-030-82199-9 57
8. Heiden, B., Tonino-Heiden, B.: Lockstepping conditions of growth processes: some
considerations towards their quantitative and qualitative nature from investiga-
tions of the logistic curve. In: Arai, K. (eds.) Intelligent Systems and Applications.
IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 543, pp. 695–705.
Springer, Cham (2023). https://doi.org/10.1007/978-3-031-16078-3 48
9. Heiden, B., Tonino-Heiden, B., Alieksieiev, V.: System ordering process based on
Uni-, Bi- and multidirectionality – theory and first examples. In: Hassanien, A.E.,
Xu, Y., Zhao, Z., Mohammed, S., Fan, Z. (eds.) BIIT 2021. LNDECT, vol. 107, pp.
594–604. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-92632-8 55
10. Heiden, B., Tonino-Heiden, B., Alieksieiev, V., Hartlieb, E., Foro-Szasz, D.:
Lambda computatrix (LC)—towards a computational enhanced understanding of
production and management. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.)
Proceedings of Sixth International Congress on Information and Communication
Technology. LNNS, vol. 236, pp. 37–46. Springer, Singapore (2022). https://doi.
org/10.1007/978-981-16-2380-6 4
11. Heiden, B., Volk, M., Alieksieiev, V., Tonino-Heiden, B.: Framing artificial intelli-
gence (AI) additive manufacturing (am). In: Procedia Computer Science, vol. 186,
pp. 387–394. Elsevier B.V. (2021). https://doi.org/10.1016/j.procs.2021.04.161
12. Heiden, B., Walder, S., Winterling, J., Perez, V., Alieksieiev, V., Tonino-Heiden,
B.: Universal Language Artificial Intelligence (ULAI), chapter 3. Nova Science
Publishers, Incorporated, New York (2020)
13. Luhmann, N.: Die Wissenschaft der Gesellschaft, 3rd edn. Suhrkamp, Verlag, Berlin
(1994)
14. Luhmann, N.: Die Gesellschaft der Gesellschaft, 10th edn. Suhrkamp Verlag, Frank-
furt/Main (2018)
15. Machiavelli, N.: Der Fürst / Il Principe. Philipp Reclam jun. Verlag GmbH (1986)
16. Pentland, A.: Building a New Economy: Data as Capital, MIT Press, Cambridge
(2021)
17. Pentland, A., Lipton, A., Hardjono, T.: Building the New Economy Data as Cap-
ital. MIT Press, Cambridge (2021)
18. Russell, B.: Philosophie des Abendlandes - Ihr Zusammenhang mit der politischen
und der sozialen Entwicklung. Europa Verlag Zürich, 3rd edn., History of Western
Philosophy (Routledge Classics) (2011). (englisch)
19. Scharmer, O., Käufer, K.: Leading from the Emerging Future - From Ego-System
To Eco-System Economies - Applying Theory U to Transforming Business, Society,
and Self. Berrett-Koehler Publishers Inc., San Francisco (2013)
20. Schmidlechner, K.: Überlegungen zur Geschichte und aktuellen Situation von
demokratischen Gesellschaften. Institut für Kinderphilosophie. 14.-17. Oktober
2021
Analysis and Concept of Democracies Human-Technological Development 425

21. Senge, P., Scharmer, C.O., Jaworski, J., Flowers, B.S.: Presence - Exploring Pro-
found Change in People, Organizations and Society. Nicholas Brealey Publishing,
London (2007)
22. smartfactory. http://www.smartfactory.de/. (Accessed 04 Apr 2014)
23. Tonino-Heiden, B., Heiden, B., Alieksieiev, V.: Artificial life: investigations about
a universal osmotic paradigm (UOP). In: Arai, K. (ed.) Intelligent Computing.
LNNS, vol. 285, pp. 595–605. Springer, Cham (2021). https://doi.org/10.1007/
978-3-030-80129-8 42
24. UN. Statement of essential human rights presented by the delegation of
panama (1946). https://digitallibrary.un.org/record/631107?ln=en. (Accessed 28
Sept 2022)
25. UN. Allgemeine Erklärung der Menschenrechte (1948). https://www.humanrights.
ch/de/ipf/grundlagen/rechtsquellen-instrumente/aemr/. (Accessed 28 Sept 2022)
26. Villari, M., Fazio, M., Dustdar, S., Rana, O., Ranjan, R.: Osmotic computing: a
new paradigm for edge/cloud integration. IEEE Cloud Comput. 3, 76–83 (2016)
Using Regression and Algorithms
in Artificial Intelligence to Predict
the Price of Bitcoin

Nguyen Dinh Thuan(B) and Nguyen Thi Viet Huong

University of Information Technology, VNU-HCM, Ho Chi Minh, Vietnam


[email protected], [email protected]

Abstract. Cryptocurrency is a topic that is no longer strange in the


investment world. Bitcoin is considered a very famous cryptocurrency
and has a large amount of investment across the globe. Therefore, in
recent years, the Bitcoin investment field has been attracting much
research to help investors in this field maximize profits. In this study,
using regression and algorithms in artificial intelligence such as K-Nearest
Neighbors (K-NN), Neural Network (NN), Decision Tree (DT), Sup-
port Vector Machines (SVM), Random Forest (RF), and Linear Regres-
sion (LR) to predict the opening price of Bitcoin. We are using the
hybrid model of the LR algorithm with K-NN, NN, DT, SVM, and RF
algorithms to improve Bitcoin price prediction performance. The study
results show that most algorithms predict well, and the hybrid model
has better prediction results. This prediction result shows that the hybrid
model has the potential to be applied in practice to improve the accuracy
of Bitcoin opening price prediction.

Keywords: Cryptocurrency prediction · Bitcoin prediction · Machine


learning · Hybrid model

1 Introduction
In today’s rapidly developing technology era, the introduction of cryptocurren-
cies is an inevitable part of society. The trend of using cryptocurrencies has only
appeared in recent years but is gradually dominating and promises to replace
cash in the future. Cryptocurrencies that are becoming familiar to investors,
such as Ethereum [24], Ripple [20], and especially Bitcoin [12], introduced by
Satoshi Nakamoto in October 2008, have made crypto money stand out more
than ever. Bitcoin was born and became famous thanks to blockchain technol-
ogy, which can be transacted directly without an intermediary organization. As
a result, Bitcoin can better secure cryptocurrencies at a lower cost. The value of
Bitcoin has been proliferating in recent years, attracting large volumes of trans-
actions and investments into the sector worldwide. This makes investors willing
to invest in Bitcoin to get significant profits from this digital currency. Big tech
companies have gradually accepted Bitcoin as payment. https://www.cnbc.com/
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 426–438, 2023.
https://doi.org/10.1007/978-3-031-18461-1_29
Predict the Price of Bitcoin 427

In February 2021, Tesla announced that it had purchased $1.5 billion in bitcoin
and accepted Bitcoin as a payment for its cars. This has made the cryptocur-
rency investment market, especially Bitcoin, more exciting in recent years. Bit-
coin price prediction is always a hot topic, attracting many researchers. The use
of regression and algorithmic artificial intelligence aims to focus on solving the
following problems:
– First, find the best prediction algorithm among the single models.
– Second, use the best performing algorithm among individual algorithms to
hybridize with other algorithms to improve prediction performance and accu-
racy.
– Finally, compare the two-hybrid and single models to find the model with the
best accuracy and performance to predict the future Bitcoin price accurately.
Using artificial intelligence algorithms and hybrid models will be tested to
predict the opening price of Bitcoin. The prediction results of the models will
be compared to find the suitable model with the best accuracy to predict the
Bitcoin price. This study will use Bitcoin transaction history data to predict
Bitcoin price.
The rest of the paper is organized as follows: Sect. 2. Description of related
works; Sect. 3. Presentation of algorithms in artificial intelligence used in this
study; Sect. 4. Representation of hybrid method; Sect. 5. Experimental produc-
tion and evaluation of results; Sect. 6. Conclusion and future development direc-
tion.

2 Related Works
Many studies on Bitcoin price prediction are done and use many methods such
as regression, machine learning, and deep learning to predict the Bitcoin price
trend. This prediction is made on the data set of Bitcoin’s transaction history.
The machine learning method is widely used and achieves promising results,
especially in the short-term prediction of the Bitcoin market trend.
Zheshi Chen et al. [2] used RF, XGBoost, Quadratic Discriminant Analy-
sis, SVM, and Long Short-term Memory models to predict the Bitcoin 5-minute
interval price. The best result obtained was 67.2% which outperformed the sta-
tistical method. The SVM model is still used in the research of Dennys C.A.
Mallqui and Ricardo A.S. Fernandes [11], and the authors [11] further used the
ANN model to predict the maximum, minimum, and closing price direction of
Bitcoin. Research results show that the best prediction model improves more
than 10% inaccuracy. Besides, this study also shows the mean absolute percent-
age error from 1% to 2%. In addition, the authors [7] use time series prediction
models such as ARIMA, FBProphet, and XG Boosting. This study shows that
the best predictive model is ARIMA, with an RMSE score of 322.4 and an MAE
score of 227.3. The authors [9] use SVM and LR models to predict Bitcoin prices
using a time series that includes daily Bitcoin closing prices from 2012 to 2018.
As a result, the SVM model has a better performance than the LR model in
Bitcoin price prediction.
428 N. D. Thuan and N. T. V. Huong

Besides, Zhenyuan Wu [22] uses Convolutional Neural Network (CNN) to


compare with other machine learning models such as LR and KNN based on
the data set from the https://www.kaggle.com/Kagglewebsite. The purpose of
the study was to compare the prices of cryptocurrencies based on their intrinsic
relationship to Bitcoin. The study results show that the best model with an
accuracy of more than 0.95 and this best model is CNN. The authors [17] used
two machine learning models, SVM and LR. The authors [17] are using the
ether cryptocurrency daily closing price time series. The results obtained from
the experiment are that the accuracy of SVM (96.06%) is higher than that of
the LR method (85.46%). In addition, SVM can achieve up to 99% accuracy by
adjusting the parameters.
Deep learning is a method used by many researchers today to predict cryp-
tocurrency prices in general and Bitcoin in particular.
Muhammad Rizwan et al. [19] researched Bitcoin price prediction using deep
learning algorithms. Use Gated Recurrent Network (GRU) to forecast Bitcoin
price in USD. As a result, these models with GPUs and CPUs beat deployed
GPUs by 94.70% in GPU training time.
Suhwan Ji et al. [8] conducted a comparative study of Bitcoin price pre-
dictions using deep learning. They used a deep neural network (DNN), LSTM
model, complex neural network, and deep residual network to predict Bitcoin
price. The result obtained in regression is that LSTM predicts slightly better
than other algorithms. Moreover, DNN predicts better than other algorithms in
classification.
Besides, the hybrid method between statistics and machine learning or deep
learning is being used in many studies because most of these hybrid models give
better accuracy and performance results than the single models.
Yuze Li et al. [10] studied the prediction of Bitcoin and algorithmic trans-
actions using econometric models, machine learning models, and deep learning
models. They proposed a new two-dimensional combined deep learning model
based on data analysis, VMD-LMH- BiGRU. As a result, the VMD-LMH-BiGRU
outperforms other models.
The authors [4] propose a new approach in this study called MRC-LSTM,
which combines a Multi-scale Residual Convolutional neural network (MRC) and
LSTM to perform Bitcoin closing price predictions. This experiment showed that
MRC-LSTM performed significantly better than many other types of network
topologies. In addition, this study also forecasts the addition of two cryptocur-
rencies, Ethereum and Litecoin.
The authors [13] use ARIMA and machine learning algorithms to predict
the next day’s closing price of Bitcoin. Besides, using a hybrid model between
ARIMA and machine learning algorithms. The study results are that most of
the hybrid models have significantly better performance than the single models.
In addition, in the last three years, there have been several typical articles
on cryptocurrency price prediction research and especially Bitcoin, such as: Jen-
Peng Huang [5] (2021) study on Bitcoin profit prediction based on the Data
Mining method with 26 input variables and an associated label variable was
Predict the Price of Bitcoin 429

used. The methods used in this study include SVM, Deep Learning, and RF. As
a result, RF is the best performing algorithm with 70.5% accuracy.
Hakan Pabuccu et al. [14] (2020) predicts Bitcoin price movements, having
applied machine learning algorithms such as SVM, ANN, Naive Bayes (NB), RF,
and LR. The research results are that RF has the highest predictive efficiency,
and NB has the lowest predictive performance in continuous data. ANN has the
best predictive performance in discrete data, and the worst-performing algorithm
is NB.
Haerul Fatah et al. [3] (2020) used data mining to predict cryptocurrency
prices. The three cryptocurrencies used in this study are Bitcoin, Ethereum,
and NEO. The machine learning algorithms used are KNN, SVM, RF, DT, NN,
and LR. Experimental results show that the most accurate prediction algorithm
is SVM.
Most studies on Bitcoin price often use single models, but some suggest using
hybrid models. So in this study, we will predict the opening price of Bitcoin
the next day with single models and recommend hybrid models to improve the
algorithm.

3 Artificial Intelligence Models

3.1 Linear Regression Model

Regression tells us about the relationship between a dependent variable and one
or more independent variables. The regression includes many types of problems;
among them is Linear Regression. Linear regression [1] was invented around
200 years ago. So, it is considered one of the classic models in regression problems.
It tells us about the linear relationship between a dependent variable and one
or more independent variables. It has the following form:

Y = A + BX (1)
In Eq. (1), Y is the dependent variable, X is the independent variable, A
is the base coefficient, and B is the coefficient. The objective of this study is
to use a linear regression model to predict available prices from the respective
independent variables.

3.2 Support Vector Machines Model

Support Vector Machines (SVM) were proposed by Vapnik et al. in the 1970 s
Then became famous in the 1990 s s because SVMs performed exceptionally
well in multidimensional space. SVM belongs to the class of supervised math
problems. To find the most optimal hyperplane, where the margin is the distance
from the nearest point of that layer to the subdivision. The idea of SVM is that
the margin of the two layers must be equal and must be as large as possible; this
is illustrated in the equation below:
430 N. D. Thuan and N. T. V. Huong

n
1
ξi + ζi∗ )
2
min( w + C (2)
2 1=1


⎪yi − < w, xi > −b ≤ ε + ξi

⎨b+ < w, x > −y ≤ ε + ζ ∗
i i i
In there:

⎪ξi , ζ ∗
≥ 0


i
i = 1, ..., n
Support Vector Regression (SVR) [15] is a regression model that uses the
Support Vector Machine algorithm to predict the value of a continuous variable,
which in this study is the price of the cryptocurrency Bitcoin.

3.3 K-Nearest Neighbors Model


The K-NN regression model [21] is based on the KNN model - one of the simplest
methods in Machine Learning. KNN is called a non-parametric method. (There
are no specific assumptions about the function to be learned.) Prediction for a
new instance based on its nearest neighbors in the training data. The outstanding
feature of this model is that K-NN can learn a complex function in a short
time without losing information. KNN regression is based on calculating the
mean of K nearest neighbors or using the inverse distance weighted average of K
nearest neighbors. K-NN regression uses the following three distance measures
for continuous variables.

k
i=1 (xi − yi )
Euclidean: 2
k
Manhattan: i=1 |xi − yi |
k q 1/q
Minkowski: ( i=1 |xi − yi | )

3.4 Decision Tree Model


The Decision Tree Regression model [18] is a regression model based on the
Decision Tree model. It can handle both categorical and numerical data. Decision
Tree Regression is considered a set of rules in the form of “IF-THEN”. The idea
of this algorithm is to split the data set into smaller subsets. Each inner node
represents an attribute to check for incoming data. Each branch/subtree of a
node corresponds to an attribute value of that node. Each leaf node represents
the dependent variable. When a tree is learned to predict a new case using
attributes going from the base to the leaves.

3.5 Random Forest Model


The Random Forest Regression model [6] is a regression model based on the
Random Forest model - a method proposed by Leo Breiman in 2001. This algo-
rithm is easy and highly efficient. Random Forest works well with very big-size
problems without overfitting. However, its theoretical nature is quite difficult to
Predict the Price of Bitcoin 431

understand. The main idea of this algorithm is to predict based on the combina-
tion of many decision trees by averaging all the predictions of the individual trees.
Each of these trees is very simple but random and grows differently, depending
on the choice of training data and attributes.

3.6 Neural Network Model


Neural Network belongs to the group of supervised math problems. It mimics
the biological nervous system of the human brain. Neural Network is a network
structure created from the interconnection of artificial neurons. The Neural Net-
work Regression model [16] is a regression model based on Neural Network tissue.
It can be thought of as a highly distributed and parallel information processing
structure. It can learn, remember, and generalize from the training data.

4 Hybrid Methodology
4.1 Hybrid Model Based on LR with SVM, KNN, NN, DT, and RF
G. Peter Zhang [23] in his research, proposed a hybrid method between two
models, ARIMA and ANN, to predict time series. Which time series data consists
of two components, linear and non-linear time data. The idea of this hybrid
approach is a combination of the two components mentioned above. Furthermore,
these two components are represented by the following equation:

yt = Lt + Nt (3)
In (3), yt is the time series value, Lt is the linear component, and Nt is the
non-linear component. We will first use the ARIMA model to predict the linear
component and the time series value. The ANN model will predict the non-linear
component from the error prediction value obtained from the ARIMA model.
The following equation determines the error values from the predicted ARIMA
model:
i
et = yt − L (4)
In (4), et is the error value after using the predictive ARIMA model at time
t, yt is the value of the time series at time t, L i is the predicted value of the
ARIMA model at time t. The ANN model will be used to predict the value of
et - the error value obtained from the prediction by the ARIMA model, which
is illustrated by the following equation:

et = f (et−1 , et−2 , ..., et−n ) + εt (5)

In (5), f is the non-linear function defined by the ANN model, et is a ran-


dom value obtained at t. From the above two equations, we get that N t is the

predictive value for the non-linear component, and Lt is the predictive value for
the linear component. The result of the forecast value at time t to be found is
y
t illustrated in the following equation:
432 N. D. Thuan and N. T. V. Huong

t + N
y
t = L t (6)
From the above idea, we propose hybridizing the LR model to predict the
linear part and then using SVM, KNN, NN, DT, and RF models to predict
the non-linear part of the remaining data. The test results compare the hybrid
models with single models to find the best model to forecast the Bitcoin price.

4.2 Deployment

The input is the time series value to be forecasted. There are two parts to the
time series value: the non-linear component and the linear component. After
the data goes through the preparation step: selecting the necessary attributes
and preprocessing, the LR model will be used for forecasting. After using the
predictive LR model, the output value will be a linear component of the time
series. Error-values (Difference error of predicted value and actual value from LR
model) will use SVM, KNN, NN, DT, and RF models, respectively, to predict
the result obtained in the component. The nonlinearity of the time series. Two
linear and non-linear results obtained from LR and SVM, KNN, NN, DT, and
RF models will be combined to give the final prediction result (Fig. 1.)
The process of the hybrid model goes through the following steps:
Step 1: Prepare and preprocess the data and find the best model for the
forecasted time series.
Step 2: Train the LR model with the training dataset and then make pre-
dictions on the test dataset. Calculate the error value of the prediction result
just practiced with the actual result.
Step 3: Using SVM, KNN, NN, DT, and RF models, respectively, to predict
the error of the results in Step 2.
Step 4: Combining the value prediction results in Step 2 and the error value
prediction results in Step 3 gives the prediction results of the combined model.
Step 5: Evaluate the model based on two parameters, RMSE and MAE, to
find the model with the best prediction results.

5 Experiment

5.1 Bitcoin Dataset

The dataset used in this experiment is the Bitcoin dataset of cryp-


tocurrency exchange Binance from the https://www.cryptodatadownload.
com/CryptoDataDownload website. This dataset is a CSV file per currency.
Hourly price history from 04:00 on August 17, 2017, to 00:00 on March 15,
2022. In the Bitcoin dataset, there are five attributes and 40081 data lines. The
specific properties are presented in Table 1 below.
Predict the Price of Bitcoin 433

Fig. 1. Hybrid LR with SVM, KNN, NN, DT, and RF

5.2 Software Used


In this experiment, we use Python language, and the supporting tool is Google
Colab. Besides, we use Python’s built-in libraries like Pandas to process data in
data frame form. Matplotlib plots graphs to visualize the data. Figure 2 below
illustrates Bitcoin’s opening price data. Numpy helps with math and matrix
operations in this experiment. And finally, the Scikit-learn library supports
machine learning and regression models.

5.3 K-Fold Cross-Validation Method


Dividing a dataset into training and testing datasets is fundamental in preparing
data for a training-ready machine learning model. Testing the model against the
validation set is necessary.
If a model does not perform well on the validation set, it will perform poorly
in practice. So, it can be said that K-Fold Cross-Validation is a concept that
helps ensure the stability of a model in machine learning.
Cross-Validation is a method of storing part of the data from a dataset and
using it to test the model is called a Validation set, and the remaining data used
to train the model is a Training set.
434 N. D. Thuan and N. T. V. Huong

Table 1. Properties in the Bitcoin dataset

Attribute Describe
Date Cryptocurrency trading day
Open Opening price/initial price of the cryptocurrency at a given time
High The highest price of the opening price
Low The lowest price of the opening price
Close Closing price/last price of a cryptocurrency at a given time

Fig. 2. Graph of opening price of Bitcoin

K-Fold Cross-Validation is a method where the dataset is divided into the


number of subsets k and subsets k − 1, then used to train the Training set model
and the remaining subset again the Validation set check the model. The model’s
score per Fold is then averaged to gauge the performance of the whole model.
This method is illustrated in Fig. 3 below.

Fig. 3. Illustrations for the K-fold cross-validation method


Predict the Price of Bitcoin 435

5.4 Evaluation Forecasting Models


To evaluate the accuracy of the single and combined regression models, we
use two parameters, mean absolute error (MAE) and Root mean squared error
(RMSE). The algorithm with the lowest value of those two parameters has the
best performance. Below is the formula for MAE and RMSE.
n
1
M AE = |yi − y
i | (7)
n i=1

n
1 
RM SE = (yi − y
i )2 (8)
n i=1

5.5 Predicting the Price of Bitcoin


The open price forecast test results of the hourly Bitcoin price history dataset
from 04:00 on August 17, 2017, to 00:00 on March 15, 2022, will be presented in
Table 2 for the single models and Table 3 for hybrid models.

Table 2. Error of single models

Model k = 5 k = 10
RMSE MAE RMSE MAE
SVM 11659.529 9981.240 9107.910 8200.751
DT 206.610 46.298 142.754 36.700
NN 24.327 13.286 20.218 12.990
KNN 232.066 70.294 169.977 58.278
RF 200.119 42.189 141.032 36.034
LR 8.415 0.527 6.390 0.518

Test Single Models: LR has the best forecast with RMSE and MAE. For k =
5, LR has an RMSE of 8.415 and an MAE of 0.527. At k = 10, the LR has an
RMSE of 6,390 and an MAE of 0.518.

Table 3. Error of hybrid models

Hybrid model k = 5 k = 10
RMSE MAE RMSE MAE
LR + SVM 8.410 0.506 6.381 0.480
LR + DT 16.109 1.171 15.618 1.191
LR + NN 25.216 15.406 12.081 6.566
LR + KNN 10.224 0.954 8.081 0.879
LR + RF 13.231 1.415 10.264 1.263
436 N. D. Thuan and N. T. V. Huong

Test Hybrid Models: LR + SVM has the best forecast with RMSE and MAE.
For k = 5, LR + SVM has an RMSE of 8,410 and an MAE of 0.506. At k=10,
the LR + SVM has an RMSE of 6.381 and an MAE of 0.480.
In both cases, k = 5 and k = 10 give the same results, and the hybrid model
gives better results than the single model. The hybrid LR model with SVM,
KNN, NN, DT, and RF uses the LR method to predict the linear component
of the time series and uses the remaining algorithms to predict the nonlinear
component. This dramatically improves performance compared to other single
models. The results show that the hybrid model gives more minor errors than the
single model. In particular, LR+SVM gives the best results in this experiment
because LR is the best predictive algorithm in single models. Furthermore, SVM
has a good predictive ability for small and less volatile data, so it did a good job
when using SVM in predicting error values. Besides, the worst algorithm in this
experiment is the SVM algorithm because the prediction ability with large-size
data is not good, so the results are bad.
We can see that the hybrid model gives excellent hourly Bitcoin price pre-
dictions. The hybrid models result in improved accuracy and performance over
the single models.
The following graph is obtained after experimenting with k = 5 comparing
the actual value, the predicted value by LR, and the predicted value by the
hybrid model LR+SVM. Figure 4 below.

Fig. 4. The graph illustrates the actual value, LR model, and hybrid model LR+SVM

6 Conclusion
Cryptocurrencies in general and Bitcoin are one of the potential investment
areas, but there are also many risks when their fluctuations are unpredictable.
The help of regression and algorithms in artificial intelligence has helped cre-
ate prediction methods with high accuracy. Experimental results show that the
hybrid model has high accuracy and better performance than the single model.
Therefore, the use of the hybrid model is very potential in predicting the future
price of Bitcoin and cryptocurrencies.
Predict the Price of Bitcoin 437

This experiment is a research direction on forecasting problems based on a


single and hybrid model. In addition, Bitcoin price prediction is also influenced
by external factors, including news, expert opinions, etc. We can combine these
factors to increase the price prediction of Bitcoin in the future. In addition,
it is possible to use more well-known algorithms in time series problems, such
as ARIMA, LSTM, etc. to compare single and hybrid models to find a high-
precision algorithm to increase predictability in this area. In future data mining,
we expect more accurate algorithms and prediction methods to make it easier for
investors and businesses to make investment decisions and get the best returns.

Acknowledgments. This research is funded by Vietnam National University


HoChiMinh City (VNU-HCM) under grant number DS2022-26-23.

References
1. Ali, M., Swakkhar, S.: A data selection methodology to train linear regression
model to predict bitcoin price. In: 2020 2nd International Conference on Advanced
Information and Communication Technology (ICAICT), pp. 330–335. IEEE (2020)
2. Chen, Z., Li, C., Sun, W.: Bitcoin price prediction using machine learning: an
approach to sample dimension engineering. J. Comput. Appl. Math. 365, 112395
(2020)
3. Fatah, H., et al.: Data mining for cryptocurrencies price prediction. J. Phys. Conf.
Ser. 1641, 012059 (2020)
4. Guo, Q., Lei, S., Ye, Q., Fang, Z., et al.: MRC-LSTM: a hybrid approach of multi-
scale residual CNN and LSTM to predict bitcoin price. In: 2021 International Joint
Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
5. Huang, J.-P., Depari, G.S.: Forecasting bitcoin return: a data mining approach.
Rev. Integr. Bus. Econ. Res. 10(1), 51–68 (2021)
6. Inamdar, A., Aarti, B., Suraj, B., Pooja, M.S.: Predicting cryptocurrency value
using sentiment analysis. In: 2019 International Conference on Intelligent Com-
puting and Control Systems (ICCS), pp. 932–934. IEEE (2019)
7. Iqbal, M., Iqbal, M.S., Jaskani, F.H., Iqbal, K., Hassan, A.: Time-series prediction
of cryptocurrency market using machine learning techniques. EAI Endorsed Trans.
Creative Technol. 8(28), e4–e4 (2021)
8. Ji, S., Kim, J., Im, H.: A comparative study of bitcoin price prediction using deep
learning. Mathematics 7(10), 898 (2019)
9. Karasu, S., Altan, A., Saraç, Z., Hacioğlu, R.: Prediction of bitcoin prices with
machine learning methods using time series data. In: 2018 26th Signal Processing
and Communications Applications Conference (SIU), pp. 1–4. IEEE (2018)
10. Li, Y., Jiang, S.: Hybrid data decomposition-based deep learning for bitcoin pre-
diction and algorithm trading. Available at SSRN 3614428 (2020)
11. Mallqui, D.C.A., Fernandes, R.A.S.: Predicting the direction, maximum, minimum
and closing prices of daily bitcoin exchange rate using machine learning techniques.
Appl. Soft Comput. 75, 596–606 (2019)
12. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. In: Decentralized
Business Review, p. 21260 (2008)
438 N. D. Thuan and N. T. V. Huong

13. Nguyen, D.-T., Le, H.-V.: Predicting the price of bitcoin using hybrid ARIMA
and machine learning. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds.) Future
Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science,
vol. 11814, pp. 696–704. Springer, Cham (2019) https://doi.org/10.1007/978-3-
030-35653-8 49
14. Pabuçcu, H., Ongan, S., Ongan, A.: Forecasting the movements of bitcoin prices:
an application of machine learning algorithms. Quant. Finan. Econ. 4(4), 679–692
(2020)
15. Peng, Y., Albuquerque, P.H.M., de Sá, J.M.C., Padula, A.J.A., Montenegro, M.R.:
The best of two worlds: forecasting high frequency volatility for cryptocurrencies
and traditional currencies with support vector regression. Exp. Syst. Appl. 97,
177–192 (2018)
16. Phaladisailoed, T., Numnonda, T.: Machine learning models comparison for bitcoin
price prediction. In: 2018 10th International Conference on Information Technology
and Electrical Engineering (ICITEE), pp. 506–511. IEEE (2018)
17. Poongodi, M.: Prediction of the price of Ethereum blockchain cryptocurrency in
an industrial finance system. Comput. Electr. Eng. 81, 106527 (2020)
18. Rathan, K., Sai, S.V., Manikanta, T.S.: Crypto-currency price prediction using
decision tree and regression techniques. In: 2019 3rd International Conference on
Trends in Electronics and Informatics (ICOEI), pp. 190–194. IEEE (2019)
19. Rizwan, M., Narejo, S., Javed, M.: Bitcoin price prediction using deep learning
algorithm. In: 2019 13th International Conference on Mathematics, Actuarial Sci-
ence, Computer Science and Statistics (MACS), pp. 1–7. IEEE (2019)
20. Saadah, S., Whafa, A.A.A.: Monitoring financial stability based on prediction of
cryptocurrencies price using intelligent algorithm. In: 2020 International Confer-
ence on Data Science and Its Applications (ICoDSA), pp. 1–10. IEEE (2020)
21. Singh, H., Parul, A.: Empirical analysis of bitcoin market volatility using supervised
learning approach. In: 2018 Eleventh International Conference on Contemporary
Computing (IC3), pp. 1–5. IEEE (2018)
22. Wu, Z.: Predictions of cryptocurrency prices based on inherent interrelationships.
In: 2022 7th International Conference on Financial Innovation and Economic Devel-
opment (ICFIED 2022), pp. 1877–1883. Atlantis Press (2022)
23. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network
model. Neurocomputing 50, 159–175 (2003)
24. Zoumpekas, T., Houstis, E., Vavalis, M.: Eth analysis and predictions utilizing
deep learning. Exp. Syst. Appl. 162, 113866 (2020)
Integration of Human-Driven
and Autonomous Vehicle: A Cell
Reservation Intersection Control Strategy

Ekene Frank. Ozioko1(B) , Kennedy John. Offor2 ,


and Akubuwe Tochukwu Churchill3
1
Computer Science Department, Enugu State University of Science and Technology,
Agbani, Nigeria
[email protected], [email protected]
2
Electrical and Electronic Engineering Department, Chukwuemeka Odumegwu
Ojikwu University, Uli, Nigeria
[email protected]
3
Computer Engineering Department, Enugu State Polytechnic, Iwolo, Nigeria
[email protected]

Abstract. With the advent of driverless automobiles, there is a tremen-


dous opportunity to increase the effectiveness of the traffic system, user
comfort, and reduce traffic accidents brought on by human mistake. It
seems inevitable that driverless and human-driven vehicles will coexist.
The difficulties in building new AV roads will be overcome by the coex-
istence of traffic, and this method builds on already-existing road infras-
tructure. Attempts to combine human and autonomous-driven cars have
raised a crucial issue in road traffic management: What effects can be
expected when a certain number of autonomous vehicles coexist peace-
fully with human-driven vehicles in terms of efficiency and safety? Two-
dimensional (2D) lateral and longitudinal vehicle behavior defines traffic
coexistence. It is adapted and improved upon the car-following model
idea to effectively depict a mixed traffic system. For secure, slick, and
effective mixed-traffic management, the “Cell Reservation-based Inter-
section Control Management Strategy” is suggested. The key contribu-
tions of this research include a guide to mixed traffic integration pattern,
an extension of the existing 1-dimensional homogeneous car-following
model strategies to a 2-dimensional heterogeneous traffic system, an
improvement in human-driven vehicle performance when autonomous
vehicle inter-vehicle distance is adjusted, and a method for harmonis-
ing speed in mixed traffic. A physics agent traffic simulator is created
and used to test three traffic management strategies, including the traffic
light method, the collision avoidance with safe distance method, and this
proposed method, in order to determine the benefits of the cell reserva-
tion traffic control strategy. Experiments with various vehicle type ratios
were done to validate the model. The collected findings show that the
cell reserve method outperforms the alternatives in terms of performance
gain.

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 439–476, 2023.
https://doi.org/10.1007/978-3-031-18461-1_30
440 E. F. Ozioko et al.

Keywords: 2-dimensional traffic control · Cell reservation system ·


Mix-traffic cooperation level · Speed harmonisation · Intersection
capacity utilisation

Overview

The format of this paper begins with an introduction to the mix-traffic man-
agement in Sect. 1. Section 2 follows the introduction with an assessment of the
state-of-the-art for a combination of human-driven and autonomous vehicle mod-
elling. In a 4-way intersection with a merging T-junction with a priority lane, the
suggested mix-traffic control approach via cell reservation strategy is explained
in Sect. 3. In this part, the behavior of autonomous cars (AV) with kind driving
characteristics and human-driven vehicles (HV) with aggressive driving charac-
teristics is modeled. A thorough discussion of the underlying nonlinear traffic
flow characteristics relating to the car-following model and safe distance under
study is also included here. Section 5 presents information about the carried out
experiments, a discussion, and an assessment of the outcomes. In Table 2, the
findings that corroborate the research premise are presented. The outcomes of
the three traffic control strategies that were compared to the suggested alter-
native are presented here. Section 8 concludes with some last thoughts and a
summary of our findings.

1 Introduction

To minimize the delay and lower the likelihood of accidents at a road intersec-
tion, a new traffic control model in a mixed environment at the merging road
utilizing the “textbf” cell reservation model is developed. In general, traffic inter-
sections are regarded as a major source of congestion. As a result, controlling
and optimizing mix-traffic flow at road crossings is essential as a foundation
for the integration of autonomous and human-driven vehicles. Additionally, as
autonomous vehicles have grown in popularity, issues with mixed traffic have
drawn academics to create a variety of related technologies to aid in the inte-
gration of these vehicles. Autonomous vehicles have recently been considered
as an alternate solution to several issues with road traffic, from lowering travel
time to providing convenient and safe driving. In order to improve traffic flow,
autonomous cars can exchange information like position and velocity in real-time
with one another or with a centralized controller. While human-driven vehicles
use traffic signals and the corresponding stochastic driver behavior, this fea-
ture enables the prediction of its speeds in managing traffic at the intersection.
The unpredictable nature of human driving behavior contributes to a delay in
decision-making when in motion. As a solution to traffic issues, autonomous
vehicles can exchange real-time car movement parameters and enhance HV per-
formance.
Integration of Human-Driven and Autonomous Vehicle 441

Fig. 1. A 4-way road intersection with double lanes

The suggested 4-way road intersection model is shown in Fig. 1, with vehicle
trajectories denoted by green arrows and the cross collision site, also known as
the reservation node, shown by a red dot. Given that AVs and HVs occupy the
same junction space, AVs are vehicles with wireless communication signs. The
wireless communication sign is outside the intersection, the control unit is the
box outside the circle, and L denotes the lane identification. The green trajectory
lines cross one other at these intersection cross-collision sites from various road
lanes or trajectories before continuing on to their final destination after passing
through the intersection region. The main duty of junction control is to assign
reservation nodes to vehicles in a seamless and sequential manner without caus-
ing a collision. Figure 2, presents a 3-way intersection model extracted from the
4-way model, where the route’s merging segment connected to the main road,
there was a continuous one-way flow of traffic. The type of intersection, which
is determined by the number of road systems and lanes involved, heavily influ-
ences the traffic intersection management strategy. The inquiry for this study is
centered on two different kinds of intersections: 3-way and 4-way intersections.
The amount of vehicle trajectories taken into account while deciding which inter-
section management tactics to use varies. In this scenario, drivers must choose
a trajectory based on the junction management rule depending on the goal or
destination of the vehicle. A small error in path trajectory judgment at the inter-
section location carries a substantial danger of many accidents. Depending on
the road and junction control strategy model, there are delays for vehicles at the
intersection.

Research Question. Since AVs are developing and HVs are not going away any-
time soon, it seems clear that AVs and HVs will have to coexist for a while. This
442 E. F. Ozioko et al.

study’s investigation and analysis are based solely on simulated traffic data that
was parameterized using the suggested methodology and conducted based on
traffic theories. Research on the integration of human-driven and autonomous
vehicles is still primarily focused on a few concerns. These are a few of these
difficulties:

– Because of the growing buzz surrounding autonomous vehicle integration, it


is essential to develop a strategy and set of rules for their coexistence. When
intersection cells are reserved, society will be able to better understand how
practical and safe it will be for autonomous and human-driven vehicles to
coexist.
– How can we approach the inquiry given that mix-driving necessitates several
intricate social interactions with a predictable impact on traffic?

We present the flaw in the microscopic simulation of hybrid traffic, provide a


fix, and demonstrate future research directions by addressing the aforementioned
research inquiries. The validation carried out while addressing the study issues
shows that the advanced traffic simulator appears realistic. Various 3-way and
4-way road intersection scenarios have been used in a few comprehensive investi-
gations. Even though there is still much work to be done in this area of research,
this strategy is designed to be competitive and make a useful contribution.

Hypothesis. When the road intersection cells are reserved in order, traffic flows
smoothly. Creating an approach to represent a 2-dimensional lateral and lon-
gitudinal driving behavior required for a realistic mix traffic flow model is the
main challenge of the drivers behavioural model. The interactions between cars
on the road are governed by elements such lateral vehicle displacement, driver
behavior, and the environmental impact of adjacent vehicles. The idea behind
vehicle collaboration is to use data obtained by using vehicle-to-vehicle com-
munication links to modify the movements of the vehicle, decrease idling time,
and minimize fuel consumption rate at a road intersection. In most cases, it is
presumable that autonomous vehicles approaching the intersection can interface
with the infrastructure and get data from the incoming traffic flow.

Justification for the RN Strategy. Following a thorough analysis of the literature


on the effects of autonomous vehicles on traffic and taking into account the
benefits and drawbacks of various techniques used to manage traffic at merging
highways, the following are the reasons we chose this approach:

– Combining the safe-distance model, road vehicle communications, and intel-


ligent vehicle communications (IVC) results in a reliable solution to the car-
following model in mixed traffic.
– Unmanned Vehicle: In order to improve traffic flow, items in their environ-
ment can communicate with one another and visualize or read important
information about other objects, particularly cars. For example, they can see
how far away a car is from a reference car position.
Integration of Human-Driven and Autonomous Vehicle 443

– Only AVs, not HVs, are capable of adjusting the inter-vehicle space, which
can be done to improve traffic flow after passing through the merging zone.
What would happen if the automobile ahead of you broke down? Don’t strike
him. To align the braking pattern of HVs with the merging of AVs, a longer
response delay and greater braking force are needed after the merge. An AV
and an HV both slowed down to 1 unit less than the speed of the car in front
of them in both situations.
Studying the effects of autonomous vehicles (AVs) on human-driven vehicles
(HVs) at a merging single-lane road with a priority lane and car-following model
may be regarded reasonable based on the aforementioned arguments.

2 Review of the State of the Art


Intelligent transportation systems are being used in autonomous vehicles, which
use sensors to perceive their surroundings and make the best decision possible in
real time to prevent collisions and mishaps. According to [1], road capacity can
be raised with a rise in vehicle cooperation level when their behaviors are homo-
geneous, but in this case, we are looking at a heterogeneous traffic behavior. Due
to the fundamental differences in the behavior of the two kinds of vehicles, this
complicates the study of mixed traffic. Moreover, the simulation results from the
study by [10] shown that when the number of motorized vehicles is greater than
70%, the road capacity can be boosted by 2.5 times by combining automated
(AVs) and human-driven (or manually-controlled) vehicles. Also, the works of
[4,9,23] demonstrate how the stability and effectiveness of traffic flow can be
increased by vehicles forming a platoon. The cell reservation-based scheduling
method reserves intersection cells to vehicles in order, using a reservation-based
scheduling methodology. In order to determine the best vehicle entry sequence
into the intersection based on specified priority rules, the scheduling is formu-
lated as an optimization problem. By enabling the vehicles to travel at any pace
up to the real-world city traffic speed restriction of 10 mph, the approach is more
realistic.
A reservation cell assignment and movement prediction goal of the suggested
mix-traffic management system is to assign reservations to cars. Traffic perfor-
mance is enhanced by extending the current road infrastructure with collision
point detection and control unit cell reservations. To keep one car safe from the
other, this tactic adheres to the safe distance model (the safe space is dependent
on car type). Along with this, the vehicles also scan the roadways for nodes
(where road-vehicle communication is used) and measure the distance between
each approaching vehicle and its reference node position. The strategy employs
a first in, first out policy to access the junction. Based on the type of vehicle
and which is closest to a reservation cell, the right of way is assigned. The exam-
ination of vehicle evolution and behaviour patterns shows that human-driven
vehicles behave more aggressively and react to stimuli later than those driven by
other animals. A T-junction with merging and priority lanes serves as the basis
for the model in Fig. 2, which methodically and accurately simulates mix-traffic
444 E. F. Ozioko et al.

flow while examining the effects of our approach on various traffic control mea-
sures. It is well known that HVs are composed of radical drivers who frequently
behave aggressively when they interact with AVs. By compelling AVs to stop
and giving them the right of way rather than waiting for them to pass, HVs are
given the priority to access and eliminate AVs. At intersections where a minor
street (non-priority route) meets a major roadway, the inter-vehicle distance is
typically taken into account (priority road). A priority road vehicle may roll
through an intersection if it has just arrived; otherwise, depending on the type
of automobile, it may start the movement from rest. A space between cars in
a conflicting traffic pattern is presented to human drivers who wish to perform
merging manoeuvre.
Depending on the vehicle mix, the pattern of signalised street vehicle arrivals
generates varied time intervals of different values. From [7], the inter-vehicle spac-
ing is often measured in seconds, and it is the distance between the first car’s
back bumper and the next one’s front bumper. The period between vehicles
arriving at a stop line on a non-priority route and the first vehicle pulling up to
the priority road is the “space” being discussed here. The earlier study by [2]
indicates that modelling delays for homogeneous traffic show a linear connection
for the same type of vehicle. A traffic collision results from the coexistence of
mixed traffic and non-uniform car behaviour, which cannot be predicted by such
linear models. Because of homogeneity in vehicle behaviour, this homogeneous
traffic situation can result in less inter-vehicle spaces being accessible. Addition-
ally, a discernible rise in the occupation time of low priority movements has been
seen.
Cell reservations at intersection collision spots, on the other hand, ensure
safe and ideal intersection management and may even be a less expensive way to
increase productivity in a mixed-traffic situation. Every instance in time (0.1 s)
to update the intersection status, all vehicles in the intersection environment
are verified. This kind of traffic management will effectively relieve and man-
age traffic congestion at road intersections. The underlying presumption is that
autonomous agents solely control how a vehicle navigates. The sophisticated
traffic simulator calculates the various delays that occur when moving vehicles
through an intersection. For the performance evaluation of the method for com-
parison, the intersection performance measures were defined. It is anticipated
that with the addition of autonomous vehicle capabilities like cruise control,
GPS-based route planning, and autonomous steering, it will be easier to govern
multi-agent behavior in mixed traffic, which will boost HV performance.

3 Methodology
The road model Fig. 2 outlines a single lane merging road system with its physical
properties.
How can coexistence occur when there are cars with diverse driving styles
using the same road system without significantly reducing the effectiveness of
traffic flow? In this situation, a merging T-intersection at an angle of 45◦ is
Integration of Human-Driven and Autonomous Vehicle 445

Fig. 2. Merging road model

being considered for cars sharing space to test the hypothesis. As seen in Fig. 2,
this situation involves vehicles coming from a different road merging onto a pri-
ority road at a junction in between them. The likelihood of a collision cannot
always be determined from the distance between the two objects in this sce-
nario. They think about the possibility that they would eventually collide if they
were traveling in the same direction. The main issue will inevitably come from
human-controlled vehicles since they lack the ability to be self-aware of their sur-
roundings and because their behavior is stochastic and more prone to error than
other types of vehicles. The proposed model took into account combining the
two approaches to traffic management, centralization and decentralization. In
the centralized approach, drivers and vehicles interact with a central controller
and the traffic signal to designate the intersection’s right-of-way access prior-
ity. With contrast, in the decentralized model, drivers and vehicles interact and
bargain for priority right-of-way access. The impact of autonomous vehicles on
intersection traffic flow has been the subject of several research studies [6]. The
author in [11] proposes the optimisation of traffic intersection using connected
and autonomous vehicles. Also, [5] considered the impact of autonomous cars on
traffic with consideration of two-vehicle types, which are distinguished by their
maximum velocities; slow (Vs ) and fast (Vf ), which denotes the fraction of the
slow and fast vehicles respectively.
446 E. F. Ozioko et al.

3.1 Coordination Protocol for Reservation Node

The driving agents can “phone ahead” and book the spaces they require along
their route using the reservation node system. Within the intersection region,
the SIR is divided into a n n grid of reservation tiles, where n is referred to as the
specifics of the reservation nodes system. At each time step, only one automobile
may reserve each RN. The following information is included in the request to
utilize RN.

– Vehicle type
– Vehicle arrival time
– Vehicle current velocity, though in the simulation, we maintain a maximum
velocity of 10 m/s for optimal performance.
– vmax and vmin
– amax and amin
– Vehicle trajectory with details of the requested RNs

A step by step summary of how a vehicle schedule a reservations node is as


follows:

– A request to schedule the car request is issued to the CU after the vehicle has
arrived. The present reservation must not be in conflict with any of the RN
time duration along the vehicle trajectory on the SIR, according to the CU.
– Let RNk be the k-th RN in the SIR, where 1 < k < n and n is the number
of RN’s along the vehicle route. Let tak be the requested time of arrival to
RNk and tdk be the departure time from RNk . At every time instance, the
CU checks whether the requested RN is available for reservations.
– If the RN is accessible, the CU will verify the availability of the following RN
along the vehicle’s trajectory at the following time step, and so on until the
vehicle exits the SIR.
– If the road camera detects HVs from the intersection, HVs are given prefer-
ential access to the reservation based on their vehicle type.
– If the request is unsuccessful or the requested RNk is not available at time
interval tak and tdk , the CU performs a search for the next available time for
the first requested RN. The CU iterate this process for another RS request.

The reservation node framework that coordinates traffic flow at the intersec-
tion could be explained using Fig. 3. This scenario only uses one RN to explain
the background of the reservation node process. From Fig. 3, it is assumed that
car A has the minimum time to the reservation node, so it is allowed to keep
moving. Cars B and C, which are further away than car A, are next checked to
see if the distance between them and another automobile is below the minimum
distance; if it is, the brakes are applied; if not, the vehicle is free to continue
travelling. For cars B and C, the method described above is repeated. Car B is
located closer to the reservation node than automobile C, hence in this straight-
forward example, car B comes in last. It is possible to repeat the procedure until
there is just one car left to pass the reservation node.
Integration of Human-Driven and Autonomous Vehicle 447

Fig. 3. Schematic of the reservation cell

Data Structure Scenario for the Reserve Node Approach Involves the
Reservation of Future Time Cells: The following is a description of the
structure of the gathered data, their connections, and the storage, actions, or
functions that can be performed on the data in the algorithm: As represented in
Fig. 4, in essence, one can reserve 10 s because the control unit (CU) stores an
array of the upcoming 100 time-steps (100 ms intervals) for each RN.
The mixed behaviour creates a highly complex traffic scenario that has a
considerable negative influence on the capacity and effectiveness of intersection
traffic flow. Since each vehicle behaves differently, uses a distinct communication
medium, and abides by a straightforward regulation at intersection zones, vehi-
cles with one behavioural pattern are dealt with a consistent protocol at inter-
sections. However, research reveals that because they are frequently caused by
human error, car accidents near merging highways are among the most prevalent
parts of traffic studies. This work has extensively researched the coexistence of
mixed behaviour, safe distance, and cell reservation within the suggested traffic
model framework. The majority of models created to analyse how traffic behaves
at intersections with mixed vehicle types are built to prevent collisions between
autonomous and human-driven vehicles, but they degrade traffic flow efficiency.
The influence of autonomous automobiles on merging roadways in a mixed envi-
ronment is explored in this article using various car mix proportions in addition
to researching mixed traffic behaviour. The crucial element for this strategy is
the vehicle occupation time, which is the amount of time a vehicle or group of
vehicles takes to cross at the intersection. At these intersections, the occupation
time of the vehicle mix ratio was investigated and analysed. Additionally, the
information on mixed-traffic volume and the percentage of occupation time by
each vehicle as shown in Table 2 were extracted from the simulation.
448 E. F. Ozioko et al.

Fig. 4. Scenario for managing conflicts in RN method

The United Kingdom’s left-side-of-the-road driving policy and the need that
drivers respect right-side traffic are the two road systems that are being taken
into consideration. The issues manifest themselves uniformly across all of the
model’s scenarios, regardless of the type of vehicle. Due to the merging angle,
this model typically has trouble detecting and treating the car that is following
from the other lane. By making sure that the car is looking at an angle that
covers the merging space, this issue is being solved. Looking at Fig. 2, it shows
merging T-junction whose non-priority road feeds into the priority road at node
8, not at 90◦ but at 45◦ , and this road is at a bend from the lane leading up to
it at node 12. The autonomous vehicles must rely on communication between
other vehicles and the road infrastructure because there is a traffic light signal
in this T-junction, while human drivers utilise their vision to detect signals from
traffic lights and other vehicles.
According to their communication capability, vehicles are classified as HV or
AV within the junction zone. The vehicle presence detector detects the presence
of vehicles, and the intersection controller classifies and specifies the vehicle type
based on communication capability. There is a control zone at the intersection
where the central controller gathers information from the connected vehicles
(lane id, time, position, velocity, and the number of vehicles in a platoon). The
sample data structure below represents the data from the RN algorithm:
Cars 1 and 2 are initially driven from road nodes 11 and 7, respectively, in
order to go from node 8 to their final destination at node 9:
– Car1: RN data structure = [0,., 0] (100 0s)
– Car2: RN data structure = [0,., 0] (100 0s)
While the HV employs the traffic light signal control, all the AV interact with
one another and the central controller as they approach the intersection to assign
an RN. For efficiency and central control, the two forms of vehicle control media
are in the sink. The traffic schedule is determined by a number of predetermined
choices, including the access protocol, the distance/position of the cars from the
merging point, the kind of cars, and the number of cars in the platoon or queue.
The two types of vehicle mixed behavior to the occupancy time with the
traffic flow in a merging single lane road system is described by the mathematical
Integration of Human-Driven and Autonomous Vehicle 449

relationships model that was established. At a merging T-junction with a priority


road segment, a mix-traffic model is suggested, in which case we have a vehicle
mix of human-driven and driverless cars entering the intersection at the same
time. The suggested model for a one-way merging T-junction with a priority
road segment and the nodes’ dimensions is shown in Fig. 2. The common goal
or target for the mixed-vehicles on the two road system with start nodes (7 and
11) is node 9. At node 8, the car leaves the portion with two roads and merges
onto the priority road. Orienting vehicle trajectories from start to target is the
intersection’s primary function. Both the dynamic headway and the difference
in driving behavior are taken into consideration. As a result, driving behavior
can be divided into two categories: driving aggressively for humans and driving
gently for autonomous vehicles.
The resulting equations were applied to calculate the inter-vehicle and crit-
ical safe distances for reckless driving (HV). The queuing distance is calculated
using an existing method of clearing behaviour approach according to [7]. It is
also demonstrated that taking into account both reaction time and aggressive
driving behavior at the same time makes the calculation of the safe distance
more realistic.
To optimize the traffic model, the safe distance, car-following, and platooning
rules are applied. The suggested mix traffic model uses the safe distance and car-
following models by taking into account the position and dynamic headway of
the leading and preceding vehicles in this situation, both of which are on the
present lane or will merge into a single node. Based on the foregoing, the central
controller’s protocols for determining the traffic flow schedule under the following
model underlying protocols are as follows:
– The car movement priority is assigned to the road between nodes 7, 8 and 9
in Fig. 2.
– If the merging point (node 8) in Fig. 2 of the road system experiences vehicle
arrivals at the same time, then the priority to cross the merging point is given
to the road with a Human-driven vehicle in front.
– If both roads have the same vehicle types at the front, the priority road takes
precedence.

Design of the Road Model


The sum of all route lengths included in the model yields the length of the road.
The following formula is used to determine how long the road routes are:
We assume that if the route is horizontally straight, like 1 − 2 − 4 − 5 then
the route length is the difference of their x coordinates:

5(x) − 1(x) (1)

While if the route is vertically straight:


Therefore, the length of the road is calculated thus:
lroad = (1 − 2 − 4 − 5) + (11 − 12 − 14 − 15) + (7 − 6 − 8 − 9) + (17 − 16 −
18 − 19) + (12 − 8) + (2 − 18) + (16 − 4) + (6 − 14) + (12 − 16) + (2 − 6) + (8 − 4)
450 E. F. Ozioko et al.

lroad = 600 + 600 + 600 + 600 + 49.5 + 106.1 + 49.5 + 106.1 + 106.1 +
49.5 + 106.1
Therefore: lroad = 2972.9 m (approximately) lcar = 4.5 m (Average) v =
10 m/s
ncars = lroad /(S + lcar ) (2)
where
saf edistance , S is 5 m for AVs,
S is 7 m for HVs during platooning and
After merging, S is 3 m. The moment after a merge occurs when the vehicles are
traveling straight ahead at a constant speed.

Vehicle Model
There are two (2) types of vehicles being considered:

1. Intelligent transportation systems (ITS)-equipped autonomous vehicles


(AVs). The adoption of technologies like sensors and the Internet of Things
includes these ITSs.
2. Human-driven vehicles (HVs) use a human driver who uses their senses of
sight and hearing to keep an eye out for traffic signals from traffic signaling
systems, rather than an intelligent transportation system.

Interaction Between HV and AV: The AV is designed to be driven gently, with


the driver leaving all obstacle avoidance and environmental mitigation to the
vehicle. The following functions are included in the AV driving system:

– High-precision maneuvering and obstacle avoidance allow for seamless control


of the speed and acceleration.
– The AV has a safe distance 0f 3 s.

While the HV driving was developed as a standard driving system with the
following features.

– The AV, which is a real-time system, responds to a stimuli around 6 s faster


than human drivers do.
– The HV has a safe distance 0f 5 s.
– HD has a longer stopping distance than AV, however all depend on the vehi-
cle’s speed at the time.

Due to the inherent disparities in vehicle behavior between AVs and HVs,
it will be difficult to construct a safe distance model that results in collision-
free traffic flow. Human drivers are less accurate, more prone to errors, and have
stochastic behavior that makes them unpredictable. According to [22] given that
autonomous vehicles react almost instantly, whereas human drivers need roughly
6 s to react to unexpected occurrences, the following space should be maintained
between autonomous vehicles:
Integration of Human-Driven and Autonomous Vehicle 451

sr = v · tr (3)
While that for human-driven vehicles will be:

sr = v · tr + 6 (4)

where 6 s is the reaction time for human drivers [18].


To address the longitudinal and lateral mixed-vehicle behaviour process,
the created model incorporated the microscopic and macroscopic vehicle level
of vehicle modelling. The author in [15] noted that while the 2-dimensional
behaviour of heterogeneous cars has an impact on the intersection capacity,
the roadway and traffic have an impact on driving behaviour features. This
circumstance forces drivers to regulate their vehicles’ longitudinal and lateral
manoeuvres at merging places. When compared to the car-following model for
uniform traffic behaviour, this bidirectional behavioural characteristic is sophis-
ticated and results in attentive guiding, filtering, tailgating, and coexistence. In
order to assess traffic behaviour and create an all-inclusive numerical prototype,
a thorough analysis of the traffic parameters at the microscopic level is required.
The two controlling procedures or techniques that make up the mix-traffic
simulation methodologies are as follows:

1. Longitudinal Control for Car Following model (Fig. 5): One of the core tenets
of the car-following model is that for a given speed, “V” (mi/hr), vehicles
follow one another with an average distance, “S,” (m). In order to access the
Car-following model’s throughput, this parameter is important. The average
speed-spacing relation in Eq. 5 proposed by [19] deals with the longitudinal
characteristics of the road and is related to the assessment of the single-lane
road capacity “C” (veh/hr) in the following way:

V
C = (100) (5)
S
where the number 100 denotes the intersection’s default maximum carrying
capacity.
However, the average spacing relations could be represented as:

S = α + βV + γV 2 (6)

where α = vehicle length, L


β = the reaction time, T
γ = the inverse of a following vehicle’s average maximum deceleration to allow
for sufficient safety distance.
2. A car-following model’s macroscopic and microscopic behaviours are affected
by the vehicle’s lateral control. A car-following model created to simply affect
its management of the longitudinal pattern is affected laterally by the lateral
control, which results in lateral interference [17]. In this AVHVcontol model,
the main goal of the lateral behaviour is to address the features of driver
452 E. F. Ozioko et al.

behaviour in a mixed vehicle environment. The AVHV control introduces the


coupling model between lateral and longitudinal vehicle dynamics through
velocity vx control process and the front wheel steering angle λi derived from
the steering angle βv . The relationship between the vehicle velocity v, the
longitudinal velocity components vx , and the vehicle’s side slip angle θ is
represented in Eq. 7.

vx = v · cos θ (7)
In addition, the steering angle θ of the vehicle front wheel λi the angle of the
steering wheel, βv and steering ration iu is represented in Eq. 8.
βv
λi = (8)
iu
To handle longitudinal and lateral driving behaviour successfully at road
crossings, a combination of these two approaches is essential. The optimal veloc-
ity function was applied by the longitudinal car-following model to relax the equi-
librium value of the distance between vehicles. Additionally, after a vehicle cuts
in front, there are still issues with high acceleration and deceleration, however
the intelligent Driver Model corrected this issue. The lateral model determines
if lateral vehicle control is possible, necessary, and desirable by maintaining the
safe distance braking procedure. According to [12], the lateral approach model
is targeted on a streamlined decision-making process using acceleration.

Vehicle Queue
Queue calculates the amount of time a vehicle must wait before receiving the
right of way at a junction from another vehicle. Transportation planning rule
(TPR) may be necessary for operational analysis, depending on the intersection-
specific conditions and at the city’s discretion. TPR is used in queuing assess-
ments for transportation system plans. Stop-and-go traffic, slower speeds, longer
travel times, and increased vehicular queuing are features of traffic congestion.
The quantity of cars waiting at any intersection can be used to measure this.
It is the result of an intersection’s cumulative effects over a period of time. The
time difference between the average vehicle occupancy was used to analyze the
vehicle delay time for the three intersection control systems.

Car Following Model with Safe Distance


The car-following model keeps the leading vehicle’s behavior pattern. The
model’s attributes and analysis pattern Fig. 5 shows how a human being reacts
in a traffic situation, represented in drivers’ longitudinal behaviour following a
leading vehicle and maintaining a safe gap in between vehicle groups.
In a car-following model, the driving behavior depends on the immediate
vehicle’s ideal speed in front but not entirely on the leader. Since lane changes
and overtaking involve lateral behavior, this model does not take those into
account. The following three points could be used to characterize the behavior
of the car-following model in detail:
Integration of Human-Driven and Autonomous Vehicle 453

Algorithm 1: Car Behaviour Algorithm-Collision free method


Data: Default Gentle behaviour of AV, Aggressiveness in human drivers
psychology (quantified by random values )
Result: AVs and HVs Behaviour
1 for Every HV : do
2 Assign aggressiveness with the following attributes;
3 Randomised Reaction time ;
4 Randomised Safe distance (in time);
5 if The Vehicle is AV then
6 Maintain the constant Reaction time;
7 Maintain the constant Safe distance (in time);
end
8 if Due to their identical estimated arrival times (EAT), AV and HV must
compete for a road space (e.g. RN, traffic light or CCP) then
// (apply priority considerations);
9 Assign priority to HVs to move;
10 Decelerate the AV;
11 Then move the next Car (AV);
12 if he predicted arrival times for the two vehicles are different (EAT)
then
13 First shift the car with the shortest EAT;
end
14 At Intersection;
15 Vehicle to Vehicle and Infrastructural Communication Guides AV;
16 HV is guided by the traffic light control;
17 The CU sync the 2 control methods
end
18 if Emergency situation occurs then
The AV uses defensive driving, slowing down or speeding up as needed.;
end
end

Fig. 5. Car following model with safe distance

– Since there is no other vehicle to affect its speed, the leading car can increase
its speed to the desired level.
454 E. F. Ozioko et al.

– Because drivers aim to maintain a fair distance or time between vehicles, the
speed of the leading vehicle mostly determines the status of the following
vehicle.
– To avoid the collision, the braking procedure uses changing amounts of brak-
ing force.

Conditions for Safe Distance is Dependent

1. The braking maneuver is always carried out while decelerating continuously.


Maximum deceleration and comfortable deceleration are equivalent.
2. There is a constant “Reactiontime ” tr of 0.3 s for AVs and a randomised
reactiontime of 0.3 to 1.7 s for HVs.
3. For safety reasons, all vehicle must maintain a constant gap.

In order to better understand the hybrid vehicle moving behavior in which


the vehicle platoon employed to balance the traffic flow, we offer a novel math-
ematical model with aggressive elements and changeable inter-vehicle distance.
This model addresses the idea that a motorist recognizes a lead vehicle and
follows it at a slower speed. According to [3,8,16,24], when determining what
effect the changes to the driving conditions would have on traffic flow, the ability
to observe and estimate the vehicle response to its predecessor’s behavior in a
traffic stream is crucial. The following two presumptions are necessary for the
car that follows the leader concept to work:

– The collision avoidance strategy mandates that a driver keep a safe distance
from other moving vehicles, as seen in Fig. 6.
– The distance between the cars is directly inversely correlated with their speed.

Let δstn+1 represent the distance available for (n + 1)th vehicle,


δxsaf e represent the safe distance
t
vn+1 and vnt represents velocities
Therefore, the gap required for safety is given by

δstn+1 = δxsaf e + T · vn+1


t
(9)
where: T = sensitivity coefficient.
However Eq. 9 above could be expressed as:

xn − xtn+1 = δxsaf e + T · vn+1


t
(10)

When the above equation is differentiated with respect to time t:

vnt − vn+1
t
= T · atn+1 (11)

1
atn+1 = · [vnt − vn+1
t
] (12)
T
Based on the UK transport authorities, the model prototype’s random values
of (0.3 to 1.7) were selected for the safe driving distance for people [21]. Using
Integration of Human-Driven and Autonomous Vehicle 455

the sensitivity coefficient term produced by successive model generations, we


have
αl,s (v t )m
atn+1 = [ t e tn l ][vnt − vn+1
t
] (13)
xn − (xn+1 )
where l = headway
se = speed exponent
α =sensitivity coefficient.
Figure 6 is a background explanation of the recommended UK Highway Code
safe distance for a vehicle. According to the method’s baseline, the process of
braking and stopping a human-driven car travelling at 30 mph will take about
23 m. With the autonomous vehicle’s 0.1 s thinking distance, this is not the case.
This stopping distance s is a part of the thinking distance (the time it takes for
a driver to apply the brakes and the time spent travelling before they are used),
and it starts at the point at which the brakes begin to reduce the speed of the
vehicle by beginning the deceleration process. The stopping time (the length of
time or distance it takes the car to stop) is also included when calculating the
braking distance.

Fig. 6. Safe distance description for HV

According to [14], many researchers have spent their time to modelling driv-
ing behaviour, analysing conflict processes, and enhancing traffic safety in the
field of driving behaviour. The S.I units of meters, seconds, and kilograms con-
stitute the foundation for all values. To aid in the prediction of the car’s motion,
consideration is cantered on differentiating between conservative driving and
optimistic driving style: In order to drive cautiously, a vehicle must be able to
stop totally when the vehicle in front of it abruptly or completely stops, as would
occur in a crash. In this scenario, the leading vehicle should maintain a minimum
distance difference of 30 m. In [13], when driving with an optimistic attitude, it
is presumed that the vehicle in front of you would brake as well, and maintaining
456 E. F. Ozioko et al.

a safe distance will take care of the issue. When you’re reacting, the automobile
passes by:
sr = v · tr (14)
The safe separation between vehicles is set to be variable for the HVs and
constant for the AVs based on the aforementioned hypotheses. The safe distance
figures, which are expressed in seconds, accurately capture the distance corre-
sponding to the current vehicle speed. The implication of Condition 1 is that
the leading vehicle’s required stopping distance is provided by

v12
s= (15)
2·a
From condition 2 it follows that to come to a complete stop, the driver of the
considered vehicle needs not only braking distance v 2b2
, but also an additional
reaction distance vδt travelled during the reaction time (the time to decode and
execute the breaking instruction needed).
Consequently, the stopping distance is given by

v2
δx = vδt + (16)
2·b
Finally, if the stopping distance is taken into account and the gap’s’ is more
than the necessary minimum final value of 0, condition 3 is satisfied.

v2 v2
δx = δt + − 1 (17)
2b 2 · b
The “safe speed” is determined by the speed “v” for which the equal sign is
valid (the maximum speed).

vsaf e (s, v1 ) = b · δt + b2 δt2 + 2 · (s − s0 ) (18)

What occurs when the vehicle in front automatically applies the brakes? In
order to stop and avoid colliding with the available space, you must have enough
time (response time) to deploy an automated brake. Using the 2 s rule proposed
by [20], a distance of 20 m to begin braking is ideal if v = 40 m/s on the highway.

Condition for the Minimum Distance y[m] from the Lead Vehicle
The merging AV chooses to enter the intersection if the gap between the lead
vehicle and the following one exceeds the computed value of y.

For AV
y =v·t (19)
where
– t[s] = Transit time of the T-junction
– v[km/h]= velocity of coming vehicle
Integration of Human-Driven and Autonomous Vehicle 457

y can be related to the intersection capacity estimates by

c=v·y (20)
and
y = lcar + treaction · v + a · v 2 · t (21)
where
– lcar = vehicle length
– t = reaction time
– a= deceleration rate
– v = speed
According to the analysis equations shown before, the following formula can
be used to determine the inter-vehicle distance for the various types of cars:

For HV
y = v · (t + 1.8) (22)
where the value 1.8 is the inter-vehicle time of transit for HV.
Nevertheless, taking into account the human worry caused by AV, we have
added a stopping distance of d to ensure our safety. We have
y =v·t·d (23)
where d is the safe distance.
In the interest of transparency, the times required for halting, braking, and
reacting have been itemized.
v2
ss = v0 · tl + 0 · aF (24)
2
The principle of “first in, first out” dictates that from Fig. 8, the right of
way belongs to the human driver of the first vehicle, then to the driver of the
first autonomous vehicle, and so on, in ascending order of time of arrival. You
should also take note of the similarities in the plots, which show that the cars
1 and 2 that are driving aggressively as they approach a curve have a similar
velocity pattern to the cars 1 and 2 that are driving gently as they slow down
to maintain a safe distance.

Simulation Parameter Values


In order to simulate real-world traffic patterns and achieve a higher level of
control over the variables affecting experiment outcomes, the dimensions of the
road as specified in the Fig. 2, the following parameter values were used:
– Vmax = 10 m/s (maximum velocity)
– Amax = 9.9 m/s2 (maximum acceleration)
– Dmax = −9.9 m/s2 (maximum declaration)
– MCar = 1200 kg (mass of car)
– Fm = 2200 N (moving force)
– Fb = 1200 N (braking force).
– C = 100 cars (intersection capacity)
458 E. F. Ozioko et al.

3.2 Traffic Flow Model

The traffic state


q = k.vt (25)
(where q = volume, v = speed and k = density)
vf k
vk = Vf − · k = vf (1 − ) (26)
kmax kmax
where vf = free flow speed kmax = max traffic density

Fig. 7. Two cars straight movement model

Fig. 8. Two cars straight movement model with braking

From Eqs. 1 and 2, we have:

k − k2
q k = vf · ( ) (27)
kmax

Traffic Flow Procedure

– Filling out the form with autonomous vehicles and human-driven vehicles,
for the sake of simplicity, let’s suppose that HVs are on-road A and AVs are
on-road B. Road A is a direct route, thus the HVs do not need to negotiate
any turns or bends as they go along it.
– While road B links or combines with road A in the middle at node 8, which
is located after a turn and at a crossroads.
Integration of Human-Driven and Autonomous Vehicle 459

– The AVs slow down significantly as they approach the curve in order to gauge
how far the other vehicle (HV) may be from the closest RVC server or node
based on how close they are to intersection node 8.
– The RVC server determines, and this is of the utmost importance, how far
apart both vehicles are from one another.
– After then, the RVC makes use of this information in order to award RN to the
vehicle. It shows a traffic signal for the human driver in the HV that pushes
them to move, slow down, or stop, and it sends a signal to the autonomous
vehicle that tells it to decelerate, maintain driving, or stop.
– As a consequence of this, other cars that are following the car that slows down
while communicating with an RVC node or due to traffic or while arriving
at an intersection will also slow down to obey the safe-distance model by
judging how far they are from the car that is in front of them. This is done
by comparing how far away they are from the car that is in front of them
(which is where Inter-Vehicle Communication applies).
– When they reach this location, two cars coming from opposite roads must
first comply with the rule of the merging algorithm before they can combine
into a platoon.

Vehicle Routing
This is a reference to a series of procedures that are carried out in sequential
order in order to guide cars in an effective manner from the starting point to the
destination. When a vehicle departs the node that served as its origin, it has the
option of taking any one of a number of alternative routes to reach its destination
or objective. The road node system is utilized in the process of selecting traffic
trajectories in the planned road intersection in order to transport vehicles from
the starting node to the destination nodes.
The developed traffic routing system maps the vehicle’s path and aim in an
effective manner, beginning from the start node and linking all joining nodes
as well as the destination. The process of determining the best route to take
is reliant on the type or design of the road intersections as well as the existing
traffic rules. The edges indicate the vehicle trajectory, and the nodes themselves
stand in for lanes of traffic on the road. The reservation nodes are situated along
the edges. The vehicle path is directed from its origin node all the way to its
destination node by the routing algorithm. Table 1, is a representation of the
routing process table dictionary, which is the location where the routing algo-
rithm begins the process of updating the nodes of the intersection configuration
using the road node catalogue. This is done in order to compute the node routing
requirement for each vehicle trajectory. A practical routing algorithm is devised
with the help of the UK traffic regulation in order to facilitate the movement of
cars from the starting point to the destination in an actual traffic scenario. The
road nodes system is used to develop the intersection-based routing protocols
that are designed for the vehicular communications process of picking a traffic
path.
460 E. F. Ozioko et al.

Table 1. Road layout table

RN id Node route RN id Node route


1: [11, 12, 14, 15] 7: [5, 4, 2, 1]
2: [11, 12, 8, 9] 8: [5, 4, 8, 9]
3: [11, 12, 16, 17] 9: [5, 4, 16, 17]
4: [7, 6, 8, 9] 10: [19, 18, 16, 17]
5: [7, 6, 2, 1] 11: [19, 18, 14, 15]
6: [7, 6, 14, 15] 12: [19, 18, 2, 1]

In a 4-way intersection, the road Node Layout Table 1 defines the road edges
that identify all of the possible route connections that can be taken from the
starting node to the destination node for a given trajectory. In the simulation,
the associated modules known as the Layout List and the Road System are
responsible for drawing out the road design by connecting lines between the
nodes. The layout list specifies all of the central nodes that appear in the road
diagram, together with the coordinates that relate to those nodes. In this sce-
nario, a dictionary is created to record the pattern of how to navigate depending
on the road layout and the regulations that govern traffic. The system that leads
each vehicle from the start node to the target node is called the routing pattern
mechanism, and it is guided by an estimation of the routing traffic latency. The
ICU makes use of the road nodes in order to compute the following parameters:
car position, individual and total delays of the vehicle at each lane, and then
it determines the routes that vehicles will take. Each vehicle is responsible for
defining its own itinerary, beginning at the starting node and ending at the des-
tination node. Intersection state It is defined as a column vector of all road lane
delays. From the traffic model in Fig. 2 with vertices (road lanes) L1 , to L2 , and
their corresponding connecting edges of 7, 8, 9, 11, and 12, the traffic state at
time t, is described by
It = [L1t , L2t ]T (28)
where:
Li−j (t) is the delay from Li to Lj as a function of time representing the
dynamic nature of the traffic flow.

Vehicle Movement Algorithm


However, if we look ahead to see how the cars will choose their path to the
destination, we can see that each vehicle has a defined route. This route is created
by identifying all of the node-id along the vehicle’s trajectory or path between
the starting node and the ending node, and then performing an analysis on each
of the nodes within each identified route based on a metric function value that is
calculated for each identified route. The metric function might have parameters
that are associated with each of the road nodes in the system. These parameters
Integration of Human-Driven and Autonomous Vehicle 461

might include a node-to-node distance parameter, traffic movement regulations,


crossing time, and models for straight and curve movement.

Car Physics for Curved Movement: To simulate car movement at the curve as
described in Fig. 9, one needs some geometry and kinematics and considers forces
and mass. The curve movement model is illustrated in Fig. 9 which describes
how vehicles move to their coordinate position. This experiment is doomed to
fail in the absence of the curve movement model due to the requirement that
the vehicles keep their lane track at all times.

– Curve Movement to turn left or right


– Length of the Circle = 2π ∗ radius
– Update time, degree,
– Degree = 360 ∗ carspeed /length of Circle
– Carspeed = (V elof x + veloof y)0 .5
– Actual Degree = T ime − initialtime ∗ degree
– Time After Ending Curve = (ActualDegree − EndDegree)/360 ∗
lengthof Circle

During Platooning, the safe distance is maintained at 5 m for AVs and 7 m


for HVs, respectively.

For AVs: ncars = 2972.9/(5 + 4.5)


ncars = 312.93(approx.)
Therefore, ncars = 312 cars for AVs

For HVs: ncars = 2972.9/(7 + 4.5)


ncars = 258.51 (approx)
Therefore, ncars = 258 for HVs

Based on the above calculations, the road capacity for the different category
of cars are calculated as follows:

– capacity for AVs = 396 cars


– capacity for HVs = 312 cars of the road
462 E. F. Ozioko et al.

Algorithm 2: Car Movement Algorithm


function Start to destination node movement;
Assign vehicle type upon entering the intersection zone ;
1 for Car movement is equal true do
Carspeed = carvelocity multiply by the carmagnitude ;
carvelocity on xaxis = speed multiply by cosine theta;
carvelocity on yaxis = speed multiply by sin θ.
2 for Car movement is equal to False: do
3 Decelerate by initialising the acceleration to zero;
Stop
end
caracceleration on xaxis = 0.0;
caracceleration on yaxis = 0.0;
Carspeed = carvelocity multiply by the carmagnitude ;
carvelocity on xaxis = speed multiply by cosineθ;
4 carvelocity on yaxis = speed multiply by sinθ;
5 for All the next node is a Road node do
6 if node is a valid RoadNode object;
7 check edges and append connected nodes to destination list;
8 append this node to destination lists of connected node;
9 Decelerate the car by multiplying the acceleration by 0;
Stop
end
end

When the front wheels turn at an angle of ‘theta’ while the car maintains a
constant speed, the vehicle precisely depicts a curved circular path. Maintaining
a constant speed for the vehicle while simulating the mechanics of turning at
both low and high speeds will result in the best possible performance. It is
possible for the wheels of a car to have a velocity that does not correspond with
the orientation of the wheel. This is due to the fact that when travelling at high
speeds, one may see that a wheel may be travelling in one way while the body
of the car is still moving in another direction. This indicates that a velocity
component is perpendicular to the wheel, which causes frictional forces to be
generated.

After Merging: This is when all of the vehicles have converged onto a single
roadway, where they are now incorporated and shared by all drivers. From this
point on, the minimum safe distance between the two types of vehicles is kept at
3 m for the following reasons: we assume that there will be no overtaking, and
the vehicles will maintain the same relative and constant speed:
Therefore, ncars = 2972.9/(3 + 4.5) ncars = 396.39 (approx.) Then, ncars =
396 for both AVs, and HVs. The route can accommodate a maximum of 396
vehicles at once, including both AVs and HVs. This technique, on the other hand,
is primarily focused on the security of intersections, specifically on the question
Integration of Human-Driven and Autonomous Vehicle 463

Fig. 9. Model of curved vehicle movement

of how to avoid a collision involving vehicles exhibiting varying behaviors and


roadways that are merging into one another.

Cell Reservation System Procedure


Any plan for the administration and control of road intersections should have as
its major focus the resolution of vehicle conflicts at those crossroads. While this
is the case, the most difficult issue is how to manage a mix of human-driven and
autonomous vehicles. This is due to the fact that the two distinct types of vehicles
behave differently and utilize various medium for their control messages. This
established variance in the car behavior affects the efficiency of the intersection
management and, in fact, decreases the intersection’s capacity significantly when
compared to managing a homogenous car behavior. As a result, the benefits of
autonomous cars are negatively impacted, which is a significant setback for the
industry. This variation in the car’s behaviour places severe constraints on the
two cars category as reflected in Fig. 7, Fig. 7 when compared to homogeneously
managing the same type of vehicles in an intersection. Take note that the letter’s’
denotes speed and not velocity in this context, but the letter ‘S’ is used to
indicate distance in this article. Because it is easier to do scalar operations and
make comparisons with scalars than it is with vectors, speed is utilized to decide
the outcome of this impact.

scar behind + −a ∗ t = scar in f ront −1 (29)

Because of this, the automobile that follows the one in front of it will have
a lower velocity (and speed), and as a result, it will be less likely to collide with
the car that is in front of it.
464 E. F. Ozioko et al.

After carrying out an experiment across a number of iterations with varying


ratios for the AVs and HVs, it was discovered that reducing the number of AVs
in traffic while simultaneously raising the number of HVs improved safety. On
both the minus and plus sides, each decrement and increase was 5%. In every
instance, the percentage composition of cars on the priority lane was equivalent
to that on the non-priority lane, which means that there were exactly 50% of each
type of vehicle on the priority lanes and on the non-priority lanes. Repeating this
Experiment over 21 iterations with decreasing and increasing ratios of AVs and
HVs respectively produced the following vehicle occupancy/congestion matrix
in Table 2. After a great deal of trial and error, the following were determined
to be the appropriate safe distance and platooning distance:

– ssaf e = 3 m (we converted to pixel from the screen as the default safe distance)
– QAV = Ssaf e + 3 = 6 m, (Queening distance for autonomous vehicles)
– QHV = Ssaf e + 5 = 8, (Queening distance for autonomous vehicles)
It is important to keep in mind that these values pertain to the pixel rep-
resentation that is displayed on a computer screen. The following are the
reaction thresholds and braking forces that are utilised for automobiles and
heavy vehicles:
For AVs,
– Sreaction AV <= Ssaf e + 1 m <= 4 m (AV reaction threshold)
– FAvreactionb AV = 60000 N (braking forces)
– Sreaction HV <= Ssaf e + 3 <= 6 m (HV reaction threshold)
– Freactionb HV = 72000 N (braking forces)

Intersection Capacity Assessment


The capacity of a junction and the traffic signal, which together define the per-
formance of the traffic system, are directly related to the efficiency of the flow
of traffic.

capacity = max traffic volume


q = k.vt (30)
density
1
k= (31)
vTh + L
Th = time gap(temporal distance) L = length of vehicle

HA
v
Ch = qmax = (32)
vTh + L
Integration of Human-Driven and Autonomous Vehicle 465

Table 2. Vehicle ratio occupancy matrix

S/n0 % of AV % of HV Occupancy Time Mean time


time (s) difference
1 100 0 193.1 – 1.2
2 95 5 194.1 1.0 1.2
3 90 10 195.1 1.0 1.2
4 85 15 196.1 1.0 1.2
5 80 20 197.1 1.0 1.2
6 75 25 198.1 1.0 1.2
7 70 30 199.1 1.0 1.2
8 65 35 200.1 1.0 1.2
9 60 40 201.1 1.0 1.2
10 55 45 202.1 1.0 1.2
11 50 50 203.7 1.6 1.2
12 45 55 204.5 0.8 1.2
13 40 60 206.0 1.5 1.2
14 35 65 207.5 1.5 1.2
15 30 70 209.0 1.5 1.2
16 25 75 210.5 1.5 1.2
17 20 80 211.0 0.5 1.2
18 15 85 212.5 1.5 1.2
19 10 90 214.1 1.6 1.2
20 05 95 216.6 2.5 1.2
21 0 100 217.6 1.0 1.2

VA
v
Ca = (33)
vTa + L
When HV and AV are coupled, one will be able to generate the expected
influence that AV will have on HV when implemented on a graph with different
parameters. This will be possible since HV and AV will be working together.
Ca vTh + L
= (34)
Ch vTa + L
For traffic mix, n represent Av
capacity cm is now dependent on n
n represent the ratio of AV integrated into the road.
v
cm = (35)
nvT a + (1 − n)vTh + Lpkw
Considering maintaining a greater gap between an autonomous vehicle and
a vehicle driven by a human in order to prevent the annoyance of drivers.
466 E. F. Ozioko et al.

1
cm = (36)
n2 vT aa + n(1 − n)vTah + (1 − n)vThx + L

Intersection Capacity. The capacity of a facility is determined by the flow of


traffic within an intersection’s speed range.
v.pf
CLSA = qs · pf = (37)
vTh + L
The capacity for AV can be achieved by reducing Ta (time of departure)

Other intersection traffic capacity estimation approach


1. Shortening of headway between Av
2. The rate at which the vehicle group is moving. When maintaining a fixed
density, the faster the speed, the greater the volume of traffic.

4 Experiments
This scientific approach explores the various management strategies of mixed
traffic (AVs and HVs) at a road intersection in order to provide an alternative
control strategy that allows for the movement of traffic to be conducted in a
manner that is both safe and efficient. This inquiry procedure included doing an
evaluation of the existing cutting-edge strategy for managing mixed traffic.
The operation of the city’s road traffic system is mostly based on the control
strategy, capabilities, and the traffic lights that are in use at the junctions. The
use of a method for managing traffic is dependent on the drivers and other vehi-
cles on the road, as well as the control signals. There are a variety of different
types of traffic control media, some of which include traffic light signals, roadside
traffic signs, wireless communication, and road markings. The management of
traffic necessitates clear and effective communication between the vehicles that
utilize the roads (traffics) and the infrastructure that supports them. The plan is
to apply intersection cell reservation to mixed traffic management on the proto-
type simulator and analyze their performance in comparison to the most recent
developments in the field. The effectiveness of each control method is judged
according to the impact that its strategies have on the performance of the rele-
vant traffic parameters. The prototypical model for a road intersection consists
of the crossing over of two streets or roads that are perpendicular to one another.
When two or more automobiles drive up to a four-way stop simultaneously, the
road segments maintain the same crossing angle with traffic coming from the
right having priority by default to go first at a four-way stop. In this paradigm,
the control of the traffic light signals is shared with the control of the wireless
communication between vehicles and between vehicles and infrastructure. The
following methods of controlling traffic at intersections are utilized during the
experiments: the Traffic Light, the Collision Avoidance System, and the Cell
Reservation System.
Integration of Human-Driven and Autonomous Vehicle 467

The following traffic control strategies were evaluated based on their effec-
tiveness and level of safety in relation to the research criteria, which made use
of the traffic control framework and method.

Traffic Lights (TL) Control Method


The conventional method of traffic control is known as the traffic light control
technique. This method was developed to utilize a static cycle timing system
in order to control the flow of traffic through road intersections. The timing
of the signal is planned so that it will rotate at a consistent time among all
of the phases or traffic routes. The traffic light control system is a roadside
indicator that directs vehicles driven by humans across a road section by using
the status of a light color variable that changes periodically. This occurs in
cycles. Despite the fact that the traffic light was developed for vehicles driven by
humans, its operations are now synchronized with the wireless communication
control protocols used by autonomous vehicles. The system of traffic lights uses
a sequential approach to allocate the predetermined amount of time to each
vehicle trajectory. Infrared sensors are incorporated into modern traffic light
systems for the purpose of optimizing their performance. These sensors detect
the optimal density of traffic signal control in response to the ever-evolving road
traffic situation, and they supply the control process with valuable information
regarding the flow of traffic. It is possible to make an accurate forecast of future
traffic flow performance by using TL’s historical traffic statistics and information.

Collision Avoidance with Save Distance (CAwSD) Control Method


The collision avoidance techniques explain how the interaction between traffics
and the road system is modeled as a series of conflict points in order to reduce
the likelihood of collisions.
In contrast to the method of controlling traffic lights, the traffic signal control
method does not require a phase assignment or a specific cycle time. At each
point along its trajectory, the traffic that is approaching the intersection checks
to see if there is another traffic that is also sharing the collision points with it.
In a real-life traffic scenario, the vehicle arriving parameters of position, speed,
and time are used to determine which vehicle would be expected to yield to the
other vehicle in order to avoid a collision. Conflicting vehicles that arrive at the
intersection at the same time are unable to enter the intersection simultaneously
because they share the same collision point. However, these vehicles are able
to move concurrently once the intersection has been reached because it ensures
that they do not share the same collision point simultaneously. This method
takes an analytical approach by calculating the probability of traffics arriving
at a conflict point simultaneously and the delay that results as a result of this,
as reflected in Fig. 13. When vehicles coming from different routes share the
same collision point, there is a chance that they will collide with one another.
Since the behavior of human-controlled vehicles is unpredictable, and since they
are more prone to errors in prediction, it stands to reason that this will be the
468 E. F. Ozioko et al.

primary source of the problem. The consideration is based on two different kinds
of vehicles, each of which has a maximum speed that is different from the other:
slow (Vs ) and fast (Vf ) which denotes the fraction of the slow and fast vehicles,
respectively.

Reservation Nodes (RN) Control Method


This RN technique, which is described as being proposed, is a reservation-based
algorithm. It works to schedule the entrances of vehicles into the intersection
space by assigning each instance of a collision cell to a specific vehicle and reserv-
ing a collision cell for that vehicle. A request must be made in order to use the
intersection collision point, and reservations must be made in accordance with a
predetermined protocol before any vehicles can move through the intersection.
This time-saving schedule was devised so that the relative speed of the vehicle in
relation to the reservation cell could be computed, and the cell itself was given a
sequence in which the vehicles would arrive. car’s distances to other cars before
it is calculated, and before the search for the shortest distance to the reservation
node begins. After this, the environment’s primary collision avoidance system
sends a signal to the car telling it to brake and slow down, or if this is not the
case, to continue driving forward. At cross collision points, a model that takes
into account safe vehicle distance, reaction time, and relative distance has been
proposed as a way to maximize the delay while simultaneously reducing the
probability of accidents. This decentralized method of traffic management is a
strategy in which drivers and vehicles communicate with one another and nego-
tiate for access to the cross collision point based on their relative proximity to
the intersection and the order in which they are given access to the intersection.

5 Result Discussion and Evaluation

The purpose of this experiment is to test the hypothesis, which proposes that a
road intersection cell with reserved space will result in more effective movement
of vehicles. When the Avs inter-vehicle distance is changed, the performance
of HVs improves, and the amount of time that a vehicle is occupied by its
occupants grows when the proportion of human-driven vehicles rises. An analysis
of variance in the time analysis of different ratio simulation tests is conducted
Table 2 which gives statistics for the variation in time occupancy with vehicle
mix ratio. This is because of the behavioral differences between cars driven by
humans and those driven by computers or other autonomous systems.
Integration of Human-Driven and Autonomous Vehicle 469

Fig. 10. 50% capacity

Stability:
In the contest of this research, traffic flow stability as represented in Fig. 14 is
analysed with the number of traffic braking in response to the volume for the
different control methods under the same condition. At road intersections, the
effectiveness of the flow of traffic depends, in part, on the stability of the flow
of traffic, which can be evaluated by counting the number of times a control
method causes vehicles to brake. The consistency of the flow speed is a metric
that can be used to measure the stability of the traffic. It is a condition in which
all vehicles move at the same optimal speed and maintain the same safe distance
from one another. A speed fluctuation impacts the vehicle flow stability when in
motion as shown in Fig. 14. It has been discovered that the various approaches to
traffic control are associated with varying degrees of predictability. The process
of maintaining a safe distance between vehicles requires both deceleration and
acceleration, which causes a disturbance in the flow stability of the entire system.

Fig. 11. 100% capacity


470 E. F. Ozioko et al.

Discussions
The methodology that has been proposed for analysing the impact of combining
AVs and HVs will assist in determining the integration pattern of an autonomous
vehicle for the transition period between the two types of vehicles. In addition,
traffic engineers can estimate the capacity of a road intersection in a mixed-traffic
environment by using the models developed in this study. These models can be
used by traffic engineers Fig. 10 and Fig. 11. According to the findings of this
investigation, autonomous vehicles are not only significantly safer but also more
time-efficient and contribute to the reduction of road congestion. It is evident
from Fig. 11 that intersection efficiency increases with an increase in the ratio
of an autonomous vehicle. This is due to the fact that AVs combine and inter-
pret the sensory data they receive from their surroundings in order to identify
appropriate navigation paths, obstacles, and appropriate signage. The perfor-

Fig. 12. Vehicle occupancy matrix

Fig. 13. Travel time delay


Integration of Human-Driven and Autonomous Vehicle 471

mance metrics relating to throughput and delay described in Fig. 13 are used to
conduct the measurement of intersection efficiency using traffic parameters. The
performance of various traffic control strategies is analyzed by using a variety of
parameter values based on simulations to see how the different parameter values
affect the throughput performance of the system.
In each simulation, the values of the vehicle mixed ratio were made higher in
order to establish the impact of the ration variation on the integration pattern
and to guide the pattern. Under each of the three different approaches to traffic
control, the performance of a variety of ratio cases is analyzed and compared.
Because of this trend, the HV will benefit from the AV’s inefficiency in a scenario
in which they co-exist.

Fig. 14. The number of braking occurred

6 Contributions to Knowledge

As a result of the work that was done, some new knowledge that is founded on
the knowledge that was already available has been created. They are as follows:

– Guide to mixing traffic integration pattern


– Describe 2-D mix-traffic behaviour effectively
– Increases HVs performance when AV inter-vehicle distance is adjusted
– A speed harmonisation method for mixed traffic
– Serves as a mixed driving behaviour model
472 E. F. Ozioko et al.

7 Future Research Direction


The mixed traffic management scheme has the potential to be improved by future
research work in the following four main categories:
Drivers Behaviour Models

– Incorporate the drivers’ decision to accept or reject RN offer


– Investigate the factors that influence the driver’s behaviour

Vehicle Models

– Model varying vehicle lengths to reflect the real city traffic situation

Road Intersection Model

– Extend the strategy to a multi-lane, multi intersection road network


– Investigate the cooperation level between AV and HV

Traffic Flow Model

– Conduct research into the effect that safe distance and reaction time distri-
bution have on one another.
– Make use of machine learning in order to manage traffic and provide real-life
physics.
– Investigate non-compliance to an emergency

8 Conclusion

The cell reservation method is novel in two ways: first, it addresses a 2-


dimensional traffic flow problem in heterogeneous traffic by using an existing
1-dimensional car-following model to compensate for unexpected changes in
human-driven vehicles; and second, it addresses vehicle collisions by assigning
individual vehicles sequentially to the intersection reservation cells. Both of these
innovations contribute to the method’s overall sense of innovation. In order to
effectively smooth out the flow of traffic, the algorithm controls the bottleneck
that is caused by the mix of traffic with variable speeds.
This proposed model involves interpolating the behavior of human-driven
vehicles and autonomous vehicles while adjusting the distance between vehicles
using a model called the acceptance safe distance model.
The strategy described above has been implemented on the model that has
been developed, and the model has been calibrated with realistic parameters,
vehicle distribution, and vehicle ratio mixes. Because it centrally synchronises
both the AH and HV parameters at the same time, the method of cell reservation
seems to have the potential to be an effective one. When it comes to traffic
management, the ability to predict vehicle velocities made possible by AV’s
real-time traffic parameter sharing feature is invaluable. The integration plan for
autonomous vehicles and a mixed traffic control system is given some scientific
Integration of Human-Driven and Autonomous Vehicle 473

support by this body of work. It will make mixed-traffic more efficient, it will
help alleviate traffic congestion at road intersections, and it will provide technical
support for future research in traffic control systems. The use of hybrid vehicles
that combine human and automated driving is gradually becoming the standard
across the globe. The widespread development and implementation of innovative
technologies in the management of vehicles and traffic will significantly advance
urban traffic control systems and provide support for the implementation of
intelligent transportation on a broad scale.
The cell reservation method was used to investigate the effect that driverless
cars would have on human-driven cars at a road intersection with merging lanes
by measuring the distance between vehicles using the inter-vehicle distance. The
vehicle occupation time was observed at a merging road as reflected in Fig. 12,
and mixed mathematical relations relating to occupation time of different vehicle
types were developed. A vehicle ratio occupancy pattern was developed as a
valuable tool for evaluating the process of integrating autonomous cars onto
public roads as a result of our findings. This pattern will serve as a basis for
future research.
The following are the most important takeaways from this research:

1. When cells at road intersections are reserved, the efficiency of the flow of
traffic is increased.
2. It has been demonstrated that the introduction of autonomous vehicles onto
public roads will have a beneficial effect on the operational effectiveness of
vehicles driven by humans.
3. The length of time that a vehicle is occupied is reliant on the traffic mixed
ratio.

8.1 Summary
The process of integrating autonomous vehicles into existing traffic systems has
been supported by the development of related traffic technologies. This process
is essential for making full use of the advantages offered by autonomous vehicles.
In a merging T-junction, a mathematical model describes the two different
types of vehicle mixed behaviour to the occupation time with the traffic flow.
It has been observed that the ratio of autonomous vehicles to other types of
vehicles in a mixed traffic flow has an effect on the amount of time that a
vehicle spends occupied in that flow. Additionally, increasing the distance that
separates vehicles results in a higher throughput. The methodology that has been
proposed will be useful in determining the integration pattern of an autonomous
vehicle for the transition period involving mixed vehicle types. Additionally, the
models that were developed as a result of this research can be utilised by traffic
engineers in order to estimate the capacity of a merging road intersection in an
environment with mixed traffic.
According to the findings of the investigation, autonomous cars are not only
significantly safer but also more time-efficient and contribute to the reduction of
road congestion.
474 E. F. Ozioko et al.

The work that has been done up to this point represents steps toward a
system of safe and efficient mixed traffic management schemes that will assist
in the implementation of an environment with mixed traffic integration. The
objectives of this project have been identified, autonomous cars have arrived to
stay, and it is inevitable that they will co-exist with human-driven cars. This
is an essential goal because our reliance on these autonomous cars is growing
at an ever-increasing rate. To this end, there is a potentially fruitful method
of managing traffic mix that is amenable to implementation. The experimental
results that were generated hold out hope for the production of a traffic sched-
ule that will maintain the state of the art in the management of mixed traffic
environments.
The findings are based on an intersection that can accommodate a total
of one hundred vehicles and has a variable proportion of both driver-less and
human-operated vehicles. Looking at the result in Table 2, the research hypoth-
esis is supported by the fact that the result that was obtained demonstrates
that an increase in the ratios of autonomous cars is inversely proportional to
a decrease in the amount of time spent simulating, and this is observed. As
a result, we have reached the conclusion that the efficiency of the intersection
increases with an increase in the ratio of autonomous cars to human-driven cars,
which demonstrates that autonomous cars improve the efficiency of the flow of
traffic. We have investigated the possible repercussions that could result from
allowing driver-less cars and vehicles driven by humans to coexist on the road.
Our evaluation was carried out using parameters that are consistent with the
actual operating environment of the city’s traffic flow system. This ensured that
our findings are as accurate as possible. However, despite their use of real-time
event-driven-based control models, modern traffic lights are built to simulate
a homogeneous traffic system. The AVHV control model, on the other hand,
allows for wireless communications for controlling AVs in addition to support-
ing a traffic schedule that includes a traffic signal light to control HVs. This
control method involves the dynamic representation of a mix-traffic system at
road intersections in order to help plan, design, and operate traffic systems while
they are moving through time. This is done in order to improve the efficiency of
these processes. The utilisation of reservation cells to improve the performance
of the traffic flow was selected as the direction to take the research in. When
compared to other methods, such as using traffic lights or avoiding collisions,
increasing the traffic flow throughput can be accomplished by reserving one of
the twelve intersection reservation cells for a vehicle at each and every instance.
The findings that were obtained indicate that the cell reservation strategy has a
performance margin of approximately 18.2% performance margin.

Acknowledgments. My doctoral dissertation, which was supported by the Nigerian


Tertiary Education Trust Fund, was one of the factors that led to the completion of
this research.
Integration of Human-Driven and Autonomous Vehicle 475

References
1. Arnaout, G.M., Arnaout, J.-P.: Exploring the effects of cooperative adaptive cruise
control on highway traffic flow using microscopic traffic simulation. Transp. Plann.
Technol. 37(2), 186–199 (2014)
2. Asaithambi, G., Anuroop, C.: Analysis of occupation time of vehicles at urban
unsignalized intersections in non-lane-based mixed traffic conditions. J. Mod.
Transp. 24(4), 304–313 (2016). https://doi.org/10.1007/s40534-016-0113-7
3. Azlan, N.N.N., Rohani, M.Md.: Overview of application of traffic simulation model.
MATEC Web Conf. 150, 03006 (2018). EDP Sciences
4. Chan, E., Gilhead, P., Jelinek, P., Krejci, P., Robinson, T.: Cooperative control of
SARTRE automated platoon vehicles. In: 19th ITS World CongressERTICO-ITS
EuropeEuropean CommissionITS AmericaITS Asia-Pacific (2012)
5. Chen, D., Srivastava, A., Ahn, S., Li, T.: Traffic dynamics under speed disturbance
in mixed traffic with automated and non-automated vehicles. Transp. Res. Procedia
38, 709–729 (2019)
6. Duarte, F., Ratti, C.: The impact of autonomous vehicles on cities: a review. J.
Urban Technol. 25(4), 3–18 (2018)
7. Dutta, M., Ahmed, M.A.: Gap acceptance behavior of drivers at uncontrolled T-
intersections under mixed traffic conditions. J. Mod. Transp. 26(2), 119–132 (2018)
8. Gipps, P.G.: A behavioural car-following model for computer simulation. Transp.
Res. Part B Methodol. 15(2), 105–111 (1981)
9. Gong, S., Lili, D.: Cooperative platoon control for a mixed traffic flow including
human drive vehicles and connected and autonomous vehicles. Transp. Res. Part
B Methodol. 116, 25–61 (2018)
10. Huang, S., Ren, W.: Autonomous intelligent vehicle and its performance in auto-
mated traffic systems. Int. J. Control 72(18), 1665–1688 (1999)
11. Hussain, R., Zeadally, S.: Autonomous cars: research results, issues, and future
challenges. IEEE Commun. Surv. Tutorials 21(2), 1275–1313 (2018)
12. Kesting, A., Treiber, M., Helbing, D.: General lane-changing model MOBIL for
car-following models. Transp. Res. Rec. 1999(1), 86–94 (2007)
13. Lertworawanich, P.: Safe-following distances based on the car-following model. In:
PIARC International seminar on Intelligent Transport System (ITS) in Road Net-
work Operations (2006)
14. Li, H., Li, S., Li, H., Qin, L., Li, S., Zhang, Z.: Modeling left-turn driving behavior
at signalized intersections with mixed traffic conditions. Math. Probl. Eng. 2016,
1–11 (2016)
15. Matcha, B.N., et al.: Simulation strategies for mixed traffic conditions: a review of
car-following models and simulation frameworks. J. Eng. 2020, 1–11 (2020)
16. Mathew, T.: Lecture notes in transportation systems engineering. Indian Institute
of Technology (Bombay) (2009). www.civil.iitb.ac.in/tvm/1100 LnTse/124 lntse/
plain/plain.html
17. Miloradovic, D., Glišović, J., Stojanović, N., Grujić, I.: Simulation of vehicle’s
lateral dynamics using nonlinear model with real inputs. IOP Conf. Ser. Mater.
Sci. Eng. 659, 012060 (2019). IOP Publishing
18. Pawar, N.M., Velaga, N.R.: Modelling the influence of time pressure on reaction
time of drivers. Transp. Res. Part F Traffic Psychol. Behav. 72, 1–22 (2020)
19. Rothery, R.W.: Car following models. Trac Flow Theory (1992)
20. Saifuzzaman, M., Zheng, Z., Haque, M.M., Washington, S.: Revisiting the task-
capability interface model for incorporating human factors into car-following mod-
els. Transp. Res. Part B Methodol. 82, 1–19 (2015)
476 E. F. Ozioko et al.

21. UK Transport Authority: Driving test success is the UK’s copyright 2020 - driving
test success, May 2020. www.drivingtestsuccess.com/blog/safe-separation-distance
22. van Wees, K., Brookhuis, K.: Product liability for ADAS; legal and human factors
perspectives. Eur. J. Transp. Infrastruct. Res. 5(4) (2020)
23. Vial, J.J.B., Devanny, W.E., Eppstein, D., Goodrich, M.T.: Scheduling
autonomous vehicle platoons through an unregulated intersection. arXiv preprint
arXiv:1609.04512 (2016)
24. Zhu, W.-X., Zhang, H.M.: Analysis of mixed traffic flow with human-driving and
autonomous cars based on car-following model. Phys. A 496, 274–285 (2018)
Equivalence Between Classical Epidemic Model
and Quantum Tight-Binding Model

Krzysztof Pomorski1,2(B)
1 Faculty of Computer Science and Telecommunications, Technical University of Cracow,
ul. Warszawska 24, 31-155 Cracow, Poland
[email protected], [email protected]
2 Quantum Hardware Systems, ul. Babickiego 10/195, 94-056 Lodz, Poland

https://www.quantumhardwaresystems.com

Abstract. The equivalence between classical epidemic model and non-


dissipative and dissipative quantum tight-binding model is derived. Classical epi-
demic model can reproduce the quantum entanglement emerging in the case
of electrostatically coupled qubits described by von-Neumann entropy both in
non-dissipative and dissipative case. The obtained results shows that quantum
mechanical phenomena might be almost entirely simulated by classical statisti-
cal model. It includes the quantum like entanglement and superposition of states.
Therefore coupled epidemic models expressed by classical systems in terms of
classical physics can be the base for possible incorporation of quantum technolo-
gies and in particular for quantum like computation and quantum like communi-
cation. The classical density matrix is derived and described by the equation of
motion in terms of anticommutator. Existence of Rabi like oscillations is pointed
in classical epidemic model. Furthermore the existence of Aharonov-Bohm effect
in quantum systems can also be reproduced by the classical epidemic model.
Every quantum system made from quantum dots and described by simplistic
tight-binding model by use of position-based qubits can be effectively described
by classical model with very specific structure of S matrix that has twice big-
ger size as it is the case of quantum matrix Hamiltonian. Obtained results partly
question fundamental and unique character of quantum mechanics and are plac-
ing ontology of quantum mechanics much in the framework of classical statistical
physics what can bring motivation for emergence of other fundamental theories
bringing suggestion that quantum mechanical is only effective and phenomeno-
logical but not fundamental picture of reality.

Keywords: Epidemic model · Tight-binding model · Stochastic finite state


machine · Position-based qubits

1 Introduction to Classical Epidemic Model

Epidemic model can model sickness propagation and various phenomena in sociology,
physics and biology. The most basic form of epidemic model relies on co-dependence
of two probabilities of occurrence of state 1 and 2 that can be identified at the state of

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 477–492, 2023.
https://doi.org/10.1007/978-3-031-18461-1_31
478 K. Pomorski

being healthy and sick as it is being depicted in Fig. 1. It is expressed in the compact
way in the following way:

(s11 (t) |1 1| + s22 (t) |2 2| + s12 (t) |2 1| + s21 (t) |1 2|)
d
(p1 (t) |1 + p2 (t) |2) = (p1 |1 + p2 |2)
   dt  
d p1 (t) s11 (t) s12 (t) p1 (t)
= =
dt p2 (t) s21 (t) s22 (t) p2 (t)
 
s (t) s12 (t)
= 11 |ψclassical >=
s21 (t) s22 (t)
d
= Ŝt (p1 (t) |1 + p2 (t) |2) = |ψclassical > . (1)
dt

Fig. 1. Illustration of epidemic model referring to stochastic finite state machine being 2 level
system with 2 distinguished states 1 and 2. 4 possible transitions are characterized by 4 time-
dependent coefficients s1→1 (t) = s11 (t), s1→2 (t) = s12 (t), s2→1 (t) = s21 (t), s2→2 (t) = s22 (t).

Quite naturally such system evolves in natural statistical environment before the
measurement is done. Once the measurement is done the statistical system state is
changed from undetermined and spanned by two probabilities into the case of being
either with probability p1 = 1 or p1 = 0 so it corresponds to two projections:
   
10 00
P̂→1 = |1 1| = , P̂→2 = |2 2| = ,
00 01
 
10
P̂→1 + P̂→2 = Iˆ = (2)
01

 
00
P̂→1 |ψ classical = |1 = |ψ1 a f ter , P̂→1 = ,
01
 
10
P̂→2 |ψ classical = |2 = |ψ2 a f ter , P̂→2 = (3)
00

where occurrence of |ψ 1a f ter and |ψ 2a f ter occurs with probability p1 (tmeasurement )
and p2 (tmeasurement ).
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 479
 
s11 (t) s12 (t)
We notice that matrix Ŝ has 2 eigenvalues
s21 (t) s22 (t)

1
E1 (t) = [− (s11 (t) − s22 (t))2 + 4s12 (t)s21 (t) + s11 (t) + s22 (t)],
2

1
E2 (t) = [+ (s11 (t) − s22 (t))2 + 4s12 (t)s21 (t) + s11 (t) + s22 (t)] (4)
2
and we have the corresponding classical eigenstates
  2s21
ψE =  ×
1
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
 √
− (s11 −s22 )2 +4s12 s21 +s11 −s22
2s21 , (5)
1
  2s21
ψE =  ×
2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
 √
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
2s21 . (6)
1

We recognize that two states |ψE1  and |ψE2  are orthogonal, so ψE1 | |ψE2  =
ψE2 | |ψE1  = 0. We also recognize that


  − (s11 − s22 )2 + 4s12 s21 + s11 − s22
ψE1  ψE1 = [  ]2
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
2s21
+[  ]2
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )

4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−[  ] = nE1 (t), (7)
(2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2



  + (s11 − s22 )2 + 4s12 s21 + s11 − s22
ψE2  ψE2 = [  ]2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
2s21
+[  ]2
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )

4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−[  ] = nE2 (t). (8)
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2

It shall be underlined that necessary condition  for identification the superposition


of two classical
 eigenstates is expressed by (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) > 0
and (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) > 0 what pre-imposes some constrains on
real value functions s11 (t), s12 , s21 and s22 . The full classical state can be written as the
superposition of two ensembles with probabilities pI and pII expressed by the following
classical state
480 K. Pomorski

   
|ψ (t)classical = pI (t) ψE1 + p2 (t) ψE2
 √
2s21 − (s11 −s22 )2 +4s12 s21 +s11 −s22
= pI (t)  2s21
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 1
 √
2s21 + (s11 −s22 )2 +4s12 s21 +s11 −s22
+pII (t)  2s21
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 1
⎛ √ ⎞
− (s11 −s22 )2 +4s12 s21 +s11 −s22
⎜ √ ⎟
2
= pI (t) ⎝ 2s21 +(− (s11 −s222s) +4s12 s21 +s11 −s22 ) ⎠
√ 21
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 )
⎛ √ ⎞
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
⎜ 2s21 +(+√(s11 −s22 )2 +4s12 s21 +s11 −s22 ) ⎟
+pII (t) ⎝ ⎠
√ 2s21
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )
⎛ √ √ ⎞
− (s11 −s22 )2 +4s12 s21 +s11 −s22 + (s11 −s22 )2 +4s12 s21 +s11 −s22  
⎜ pI (t) 2s21 +(−√(s11 −s22 )2 +4s12 s21 +s11 −s22 ) + pII (t) √
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 ) ⎟ p1 (t)
= ⎝ ⎠ =
pI (t) √ 2s21
+ pII (t) √ 2s21 p2 (t)
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )

= |ψ (t)classical .
(9)
We have superposition of states with two statistical ensembles occurring with probabil-
ities pI (t) and pII (t) that are encoded in probabilities p1 (t) and p2 (t) that are directly
observable. We can extract probabilities pI (t) and pII (t) from |ψ (t)classical in the fol-
lowing way
   
1
 4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
pI (t) = ψE1  |ψ (t)classical = 1 − 
nE1 (t) (2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2


ψE1  |ψ (t)classical
   
4s21 (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1−  ×
(2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
 √   
− (s11 −s22 )2 +4s12 s21 +s11 −s22
√ √ 2s21 p1 (t),
× ,
2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(− (s11 −s22 )2 +4s12 s21 +s11 −s22 ) p2 (t)
(10)

1

ψE2  |ψ (t)classical
pII (t) =
nE2 (t)
   
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )

= 1−  ψE2  |ψ (t)classical
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
   
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1− 
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
 √ 
+ (s11 −s22 )2 +4s12 s21 +s11 −s22
× √ , √ 2s21
2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 ) 2s21 +(+ (s11 −s22 )2 +4s12 s21 +s11 −s22 )
 
p1 (t),
=
p2 (t)
   
4s21 (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
= 1− 
(2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ))2
 √ 
2
+ (s11 −s22 ) +4s12 s21 +s11 −s22
× √ p1 (t) + √ 2s21
p2 (t) . (11)
2s21 +(+ 2
(s11 −s22 ) +4s12 s21 +s11 −s22 ) 2s21 +(+2 (s11 −s22 ) +4s12 s21 +s11 −s22 )

Probabilities pI (t) and pII (t) will describe the occupancy of energy levels E1 and E2 in
real time domain of epidemic simplistic model. We have the same superposition of two
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 481

eigenergies as in the case of quantum tight-binding model. The same reasoning can be
conducted for N-th state classical epidemic model expressed as
(s11 (t) |1 1| + s12 (t) |2 1| + s13 (t) |3 1| + .. + s1N (t) |N 1|
+s21 (t) |2 1| + s22 (t) |2 2| + s23 (t) |3 2| + .. + s2N (t) |N 2| +
....
+s1N (t) |1 N| + s2N (t) |2 N| + s3N (t) |3 N| + .. + sNN (t) |N N|)
d
(p1 (t) |1 + p2 (t) |2 + .. + pN (t) |N) = (p1 |1 + p2 |2 + .. + pN |N) =
dt
⎛ ⎞ ⎛ ⎞
p1 (t) s11 (t) s12 (t) .. s1N (t) ⎛ ⎞
d ⎜ p2 (t) ⎟ ⎜ s21 (t) s22 (t) .. s1N (t ⎟ ⎝ p1 (t) ⎠
= ⎝ .. ⎠ = ⎝ .. ⎠ p2 (t) =
dt ..pN (t)
pN (t) sN1 (t) sN2 (t) .. sNN (t
⎛ ⎞
s11 (t) s12 (t) .. s1N (t)
⎝ .. ⎠ |ψclassical >=
s1N (t) s2N (t) .. sNN (t)
d
= Ŝt (p1 (t) |1 + p2 (t) |2 + .. + pN (t) |N) =
dt
|ψclassical > . (12)

Analytical Solutions of Simplistic Classical Epidemic Model


In principle we can also introduce weak measurement procedure that will be partly
omitted in this work. In very real way if we have the population of N individuals
possibly infected with COVID we can inspect N1 individuals, where N1 < N and
− −
we can introduce some corrections to p1 (tmeasurement ) → N1 [(N − N1 )p1 (tmeasurement )+
+ − −
N1 p1 (ttest )] = p1 (tmeasurement ) and p2 (tmeasurement ) → N [(N − N1 )p2 (tmeasurement ) +
1
+
N1 p2 (ttest )] = p2 (tmeasurement ), where p1 (ttest ), p2 (ttest ) are probabilities obtained by
testing N1 individuals what could correspond to weak measurement conducted on
assemble of N individuals. Let us consider the state of the system before measure-
ment and its natural evolution Such set of equations has two analytical solutions for
probabilities p1 (t) and p2 (t) expressed as
 t      
s (t)dt  t0t s11 (t)dt  p1 (t0 ) S (t,t ) S (t,t ) p1 (t0 )
exp t0t 11  t  = exp 11 0 12 0 =
t0 s21 (t)dt t0 s22 (t)dt p (t
2 0 ) S21 (t,t0 ) S22 (t,t0 ) p (t
2 0 )
      
U11 (t,t0 ) U12 (t,t0 ) p1 (t0 ) p (t ) p1 (t)
U21 (t,t0 ) U22 (t,t0 ) p2 (t0 )
= Û(t,t0 ) 1 0 =
p2 (t0 ) p2 (t)
(13)

with
 t  t  t  t
S11 (t,t0 ) = s11 (t)dt  , S12 (t,t0 ) = s12 (t)dt  , S22 (t,t0 ) = s22 (t)dt  , S21 (t,t0 ) = s21 (t)dt  , (14)
t0 t0 t0 t0

and
   
S11 (t,t0 )+S22 (t,t0 ) (S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U1,1 (t,t0 ) = e 2 + 
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
  
1
+ cosh
2
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) (15)

   
S11 (t,t0 )+S22 (t,t0 ) (S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U2,2 (t,t0 ) = e 2 − 
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
  
1
+ cosh
2
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) (16)
482 K. Pomorski

S11 (t,t0 )+S22 (t,t0 )   


2S12 (t,t0 )e 2 sinh 1
2 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U1,2 (t,t0 ) =  , (17)
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )

S11 (t,t0 )+S22 (t,t0 )   


2S21 (t,t0 )e 2 sinh 1
2 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
U2,1 (t,t0 ) =  (18)
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )

We obtain explicit formula for probabilities


   
S11 (t,t0 )+S22 (t,t0 ) (S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
p1 (t) = e 2 + 
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
  
1
+ cosh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) p1 (t0 )
2
     
2S12 (t,t0 ) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
+  p2 (t0 ) , (19)
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )

   
2S21 (t,t0 ) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
S11 (t,t0 )+S22 (t,t0 )
p2 (t) = e 2  p1 (t0 )
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
   
(S11 (t,t0 ) − S22 (t,t0 )) sinh 12 (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
+ − 
(S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
   
1
+ cosh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 ) p2 (t0 )
2

. (20)

It is useful to express the ratio of probabilities p1 (t) and p2 (t) in the following
analytical way as
p1 (t)
r12 (t) =
p2 (t)
   
1
= [(S11 (t,t0 ) − S22 (t,t0 ))p1 (t0 ) + 2S21 (t,t0 )p2 (t0 )] tanh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
2
 
+[p1 (t0 ) (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )] /
   
1
− [(S11 (t,t0 ) − S22 (t,t0 ))p2 (t0 ) + 2S21 (t,t0 )p1 (t0 )] tanh (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )
2
 
+[p2 (t0 ) (S11 (t,t0 ) − S22 (t,t0 ))2 + 4S12 (t,t0 )S21 (t,t0 )]

(21)
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 483

2 Equations of Motion for Classical Epidemic Model in Projector


Representation in Case of Time Independent Matrix S
Let us consider the equations of motion for the case of time-independent s11 , s12 , s21
and s22 . In Dirac notation it can be written in the following way:

1 1
(E1 |ψE1  ψE1 | + E2 |ψE2  ψE2 |)(pI |ψE1  + pII |ψE2 )
nE1 nE2
d d d
= (pI |ψE1  + pII |ψE2 ) = (( pI ) |ψE1  + ( )pII |ψE2 ), (22)
dt dt dt
since E1 , E2 , |ψE1  and |ψE2  are time independent. By applying |ψE1  and |ψE1 
on the left side we obtain and using orthogonality relation between |ψE1  and |ψE2  we
obtain set of equations
E1 E2
E1 d E2 d (t−t0 ) (t−t0 )
pI = pI , pII = pII , e nE1 pI (t0 ) = pI (t), e nE2 pII (t0 ) = pII (t).
nE1 dt nE2 dt
(23)
Sum of probabilities is not normalized. However physical significance between ratio
p1 (t) and p2 that is expressed by ratio

p1 (t) pI (t0 ) E1 E2
r12 (t) = = exp(( − )(t − t0 )). (24)
p2 (t) pII (t0 ) nE1 nE2
It means that Rabi oscillations or more precisely change of occupancy among levels
is naturally build in classical epidemic model. Still superposition of two states is main-
tained so the analogy of classical epidemic model to quantum tight-binding model is
deep.

3 Case of Constant Occupancy of Two Eigenergy Levels


in Classical Epidemic Model
We consider the case of pI (t) = constant1 and pII (t) = constantII . We have time-
dependent parameters s11 , s22 , s12 and s21 and we obtain the following equations of
motion

(E1 (t) |ψE1 t ψE1 |t + E2 (t) |ψE2 t ψE2 |t )(pI |ψE1 t + pII |ψE2 t )
d d d
= (pI |ψE1 t + pII |ψE2 t ) = (pI ) |ψE1 t + (pII ) |ψE2 t ). (25)
dt dt dt
We obtain the set of 2 equations

d d
E1 (t)pI = (pI ψE1 | ) |ψE1  + (pII ψE1 | ) |ψE2 ),
dt dt
d d
E2 (t)pII = (pI ψE2 | ) |ψE1  + (pII ψE2 | ) |ψE2 ). (26)
dt dt
484 K. Pomorski

Consequently we obtain

pI ψE1 | dtd |ψE2  pI (E2 (t) − ψE2 | dtd |ψE2 )


= , = (27)
pII E1 (t) − ψE1 | dtd |ψE1  pII ψE2 | dtd |ψE1 

ψE1 | dtd |ψE2  (E2 (t)−ψE2 | dt


d ψ
| E2 )
and it implies =
(E1 (t)−ψE1 | dt
d ψ
| E1 ) ψE2 | dtd |ψE1 

4 Equations of Motion for Classical Epidemic Model in Projector


Representation (Dirac Notation) and Rabi Oscillations
in Classical Epidemic Model

Let us consider the equations of motion in the following way:

1
(E1 (t) |ψE1 
ψE1 | + E2 (t) |ψE2  ψE2 |
nE1
+e12 (t) |ψE2  ψE1 | + e21 (t) |ψE1  ψE2 |)(pI |ψE1  + pII |ψE2 )
d
= (pI |ψE1  + pII |ψE2 ). (28)
dt
This equation is equivalent to the set of 2 coupled ordinary differential equations
given as
d
 d
E1 (t)pI (t) + e21 (t)pII (t) = ψE1 (t)| (pI (t) |ψE1 (t)) + ψE1(t)  (pII (t) |ψE2 (t)),
dt dt
d d
E2 (t)pII (t) + e12 (t)pI (t) = ψE2 (t)| (pI (t) |ψE1 (t)) + ψE2 (t)| (pII (t) |ψE2 (t)).
dt dt
(29)

and can be rewritten to be as


d d
 d
E1 (t)pI (t) + e21 (t)pII (t) = pI (t) + pI (t)(ψE1 (t)| |ψE1 (t)) + pII (t) ψE1(t)  (|ψE2 (t)),
dt dt dt
d d
 d
+e12 (t)pI (t) + E2 (t)pII (t) = pII (t) + pII (t)(ψE2 (t)| |ψE2 (t)) + pI (t) ψE2(t)  (|ψE1 (t)). (30)
dt dt dt

It will lead to further simplification that can be written as


d
 d d
+ [E1 (t) − (ψE1 (t)| |ψE1 (t))]pI (t) + [e21 (t) − pII (t) ψE1(t)  (|ψE2 (t))]pII (t) = pI (t)
dt dt dt
d d d
+[e12 (t) − ψE2 (t)| (|ψE1 (t))]pI (t) + [E2 (t) − (ψE2 (t)| |ψE2 (t))]pII (t) =
dt dt dt
pII (t). (31)

We can write it in the compact form as



    
[E1 (t) − (ψE1 (t)| dtd |ψE1 (t))] [e21 (t) − ψE1(t)  dtd (|ψE2 (t))] pI (t) d pI (t)
[e12 (t) − ψE2 (t)| dt (|ψE1 (t))] [E2 (t) − (ψE2 (t)| dt |ψE2 (t))]
d d pII (t)
=
dt pII (t)
. (32)
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 485

The solution is analytical given as


 t    

t
dt  [E1 (t  ) − (ψE1 (t  )| dtd |ψE1 (t  ))]
dt  [e21 (t  ) − ψE1(t  )  dtd (|ψE2 (t  ))] pI (t0 )
exp tt0 t
 t0
        pII (t0 )
t0 dt [e12 (t ) − ψE2 (t )| dt  (|ψE1 (t ))] t0 dt [E2 (t ) − (ψE2 (t )| dt  |ψE2 (t ))]
d d

 
pI (t)
=
pII (t)
(33)

and can be written as


      
pI (t0 ) g1,1 (t,t0 ) g1,2 (t,t0 ) pI (t0 ) pI (t)
Ĝ(t,t0 ) = exp =
pII (t0 ) g2,1 )(t,t0 ) g2,1 (t,t0 ) pII (t0 ) pII (t)
    
G1,1 (t,t0 ) G1,2 (t,t0 ) pI (t0 ) pI (t)
=
G2,1 )(t,t0 ) G2,1 (t,t0 ) pII (t0 )
=
pII (t)
, (34)

where
 t
 d    t
 d  
g1,1 (t,t0 ) = dt  [E1 (t  ) − ( ψE1 (t  )  ψE1 (t  ) )], g1,2 (t,t0 ) = dt  [e21 (t  ) − ψE1 (t  )  (ψE2 (t  ) )], (35)
t0 dt t0 dt
 t
 d    t
 d  
g2,1 (t,t0 ) = dt  [e12 (t  ) − ψE2 (t  )  (ψE1 (t  ) )], g2,2 (t,t0 ) = dt  [E2 (t  ) − ( ψE2 (t  )  ψE2 (t  ) )]. (36)
t0 dt t0 dt

and we have the corresponding classical eigenstates



 d  
ψE1 (t  ) (  ψE1 (t  ) ) =
dt
 
1   
 − (s11 − s22 )2 + 4s12 s21 + s11 − s22 , 2s21 ×
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
   
d

1 − (s11 − s22 )2 + 4s12 s21 + s11 − s22
× ,
dt 2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) , 2s21

 d  
ψE1 (t  ) (  ψE2 (t  ) ) =
dt
 
1   
 − (s11 − s22 )2 + 4s12 s21 + s11 − s22 , 2s21
2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
   
d 1 + (s11 − s22 )2 + 4s12 s21 + s11 − s22
×  
dt 2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 2s21
, (37)


 d  
ψE2 (t  ) (  ψE1 (t  ) ) =
dt
 
1   
 + (s11 − s22 )2 + 4s12 s21 + s11 − s22 , 2s21 ,
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
   
d

1 − (s11 − s22 )2 + 4s12 s21 + s11 − s22 ,
× 
dt 2s21 + (− (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 2s21

 d  
ψE2 (t  ) (  ψE2 (t  ) ) =
dt
 
1   
 + (s11 − s22 )2 + 4s12 s21 + s11 − s22 2s21
2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 )
   
d 1 + (s11 − s22 )2 + 4s12 s21 + s11 − s22
×  
dt 2s21 + (+ (s11 − s22 )2 + 4s12 s21 + s11 − s22 ) 2s21
, (38)
486 K. Pomorski

and
 t  t
  
1
dt  E1 (t  ) = dt  − (s11 (t  ) − s22 (t  ))2 + 4s12 (t  )s21 (t  ) + s11 (t  ) + s22 (t  ) (39)
t0 t0 2
 t t    
1
dt  E2 (t  ) = dt  + (s11 (t  ) − s22 (t  ))2 + 4s12 (t  )s21 (t  ) + s11 (t  ) + s22 (t  ) . (40)
t0 t0 2

and
   
g11 (t,t0 )+g22 (t,t0 ) (g11 (t,t0 ) − g22 (t,t0 )) sinh 12 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G1,1 (t,t0 ) = e 2 + 
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
  
1
+ cosh
2
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 ) (41)

   
g11 (t,t0 )+g22 (t,t0 ) (g11 (t,t0 ) − g22 (t,t0 )) sinh 12 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G2,2 (t,t0 ) = e 2 − 
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
  
1
+ cosh
2
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 ) (42)

g11 (t,t0 )+g22 (t,t0 )   


2g12 (t,t0 )e 2 sinh 1
2 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G1,2 (t,t0 ) =  , (43)
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )

g11 (t,t0 )+g22 (t,t0 )   


2g21 (t,t0 )e 2 sinh 1
2 (g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )
G2,1 (t,t0 ) =  (44)
(g11 (t,t0 ) − g22 (t,t0 ))2 + 4g12 (t,t0 )g21 (t,t0 )

5 Analogy of Quantum Entanglement in Classical Epidemic


Model
It is now easy to generalize our considerations for two coupled statistical ensembles as
corresponding to cities with flight traffic. We have
⎛ ⎞⎛ ⎞
s11 (t)A s12 (t)A 0 s1A2B (t) p1 (t)A
⎜ s21 (t)A s22 (t)A s2A2B (t) 0 ⎟ ⎜ p2 (t)A ⎟
⎝ 0 s2A1B (t) s11 (t)B s12 (t)B ⎠ ⎝ p1 (t)B ⎠
s2A1B (t) 0 s21 (t)B s22 (t)B p2 (t)B
⎛ ⎞
p1 (t)A
d ⎜ p2 (t)A ⎟
= Ŝ|ψclassical >= ⎝ ⎠
dt p1 (t)B
p2 (t)B
= Ŝt (p1A (t) |1A | |1B + p2A (t) |1A |2B
d
+p1B (t) |2A | |1B + p2B (t) |2A |2B ) =
dt
|ψclassical > (45)

We make the analytic simplifications by assumption of two symmetric systems A


and B interacting in asymmetric way as
⎛ ⎞⎛ ⎞ ⎛ ⎞
s11 (t) s12 (t) 0 s(t) p1 (t)A p1 (t)A
⎜s21 (t) s22 (t) s(t) 0 ⎟ ⎜ p2 (t)A ⎟ d ⎜ p2 (t)A ⎟
⎝ 0 =
s(t) s11 (t) s12 (t)⎠ ⎝ p1 (t)B ⎠ dt ⎝ p1 (t)B ⎠
. (46)
s(t) 0 s21 (t) s22 (t) p2 (t)B p2 (t)B
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 487

We have four eigenstates


⎛ √ ⎞
4(S−S12)(S−S21)+(S11−S22)2 −S11+S22
⎜− 2(S−S21) ,⎟
⎜ −1, ⎟
|V1 (t) = ⎜ √
⎜ 4(S−S12)(S−S21)+(S11−S22)2 −S11+S22 ⎟
⎟ (47)
⎝ 2(S−S21) , ⎠
1

⎛ √ ⎞
4(S−S12)(S−S21)+(S11−S22)2 +S11−S22
⎜ 2(S−S21) ,⎟
⎜ −1, ⎟
|V2 (t) = ⎜


2 +S11−S22 ⎟
⎟ (48)
⎝− 4(S−S12)(S−S21)+(S11−S22)
2(S−S21) ,⎠
1

⎛ √ ⎞
− 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22
⎜ 2(S+S21) ,⎟
⎜ 1, ⎟
|V3 (t) = ⎜ √
⎜ − 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22 ⎟
⎟ (49)
⎝ 2(S+S21) ,⎠
1

⎛√ ⎞
4(S+S12)(S+S21)+(S11−S22)2 +S11−S22
⎜ 2(S+S21) ,⎟
⎜ 1, ⎟
|V4 (t) = ⎜ √
⎜ 4(S+S12)(S+S21)+(S11−S22)2 +S11−S22 ⎟
⎟ (50)
⎝ 2(S+S21) ,⎠
1

and |V3 (t) and |V4 (t) are physically justifiable in framework of epidemic model that
can be used for classical entanglement. We have four projectors corresponding to the
measurement of p1A , p2A , p1B and p2B represented as matrices
⎛ ⎞ ⎛ ⎞
1000 0 0 0 0
⎜0 0 0 0⎟ ⎜0 1 0 0⎟
P̂p1A = ⎝ , P̂ =
0 0 1 0⎠ p2A ⎝0 0 1 0⎠
, (51)
0001 0 0 0 1

⎛ ⎞ ⎛ ⎞
1 0 0 0 1 0 0 0
⎜0 1 0 0⎟ ⎜0 1 0 0⎟
P̂p1B = ⎝
0 0 1 0⎠ , P̂p2B = ⎝
0 0 0 0⎠
(52)
0 0 0 0 0 0 0 1

Measurement conducted on system A also brings the change of system B state what
is due to presence of non-diagonal matrix elements in system evolution equations (gen-
eralized epidemic model). It is therefore analogical to measurement of quantum entan-
gled state. After the measurements we obtain the following classical states
488 K. Pomorski

⎛ ⎞⎛ ⎞ ⎛ ⎞
1 0 0 0 pA1 (t) 1
⎜0 0 0 0⎟ ⎜ pA2 (t)⎟ ⎜ 0 ⎟
P̂p1A |ψclassical  = ⎝ = ,
0 0 1 0⎠ ⎝ pB1 (t)⎠ ⎝ pB1 (t)⎠
0 0 0 1 pB2 (t) pB2 (t)
⎛ ⎞⎛ ⎞ ⎛ ⎞
0000 pA1 (t) 0
⎜0 1 0 0⎟ ⎜ pA2 (t)⎟ ⎜ 1 ⎟
P̂p2A |ψclassical  = ⎝ =
0 0 1 0⎠ ⎝ p (t)⎠ ⎝ p (t)⎠
B1 B1
0001 pB2 (t) pB2 (t)
⎛ ⎞⎛ ⎞ ⎛ ⎞
1 0 0 0 pA1 (t) pA1 (t)
⎜ 0 1 0 0 p (t)
⎟ ⎜ A2 ⎟ ⎜ A2 ⎟p (t)
, P̂p1B |ψclassical  = ⎝ = ,
0 0 1 0⎠ ⎝ p (t)⎠ ⎝ 1 ⎠
B1
0 0 0 0 pB2 (t) 0
⎛ ⎞⎛ ⎞ ⎛ ⎞
1000 pA1 (t) pA1 (t)
⎜ 0 1 0 0 p (t)
⎟ ⎜ A2 ⎟ ⎜ A2 ⎟p (t)
P̂p2B |ψclassical  = ⎝ =
0 0 0 0⎠ ⎝ pB1 (t)⎠ ⎝ 0 ⎠
(53)
0001 pB2 (t) 1

6 Two Classical Statistical Systems Interacting in Quantum


Mechanical Way
We are going to reformulate the description of two classical noninteracting systems A
and B. We have
   
s11A (t) s12A (t) 10
Ĥ0 (t) = × +
s21A (t) s22A (t) 01
   
10 s (t) s12B (t)
× 11B
01 s21B (t) s22B (t)
 
  
  
  

= (E1A (t) ψE1A (t) ψE1A (t) + E2A (t) ψE2A (t) ψE2A (t))(ψE1B (t) ψE1B (t) + ψE2B (t) ψE2B (t))
 
  
  
  

+(ψE1A (t) ψE1A (t) + ψE2A (t) ψE2A (t))(E1B (t) ψE1B (t) ψE1B (t) + E2B (t) ψE2B (t) ψE2B (t)). (54)

and we have the


|ψ (t)
           
= γ1 (t) ψE1A (t) ψE1B (t) + γ2 (t) ψE1A (t) ψE2B (t) + γ3 (t) ψE2A (t) ψE1B (t) + γ4 (t) ψE2A (t) ψE2B (t)
= (p1A (t) |x1A  + p2A (t) |x2A )(p1B (t) |x1B  + p2B (t) |x2B )
= p1A (t)p1B (t) |x1A  |x1B  + p1A (t)p2B (t) |x1A  |x2B  + p2A (t)p1B (t) |x2A  |x1B  + p2A (t)p2B (t) |x2A  |x2B  , (55)
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 489

where |xkA  |xlB  are time-independent and k and l are 1 or 2 and γ1 (t) + γ2 (t) + γ3 (t) +
γ4 (t) = 1. We observe that Ĥ0 (t) |ψ t = dtd |ψ t explicitly given as
⎛ ⎞
p1A (t)p1B (t)
d ⎜ p1A (t)p2B (t)⎟
⎝ ⎠
dt p2A (t)p1B (t)
p2A (t)p2B (t)
⎛ ⎞⎛ ⎞
s11A (t) + s11B (t) s12B (t) s12A (t) 0 p1A (t)p1B (t)
⎜ s21B (t) s11A (t) + s22B (t) 0 s12A (t) ⎟ ⎜ p1A (t)p2B (t)⎟
=⎝ ⎠ ⎝ p (t)p (t)⎠
s21A (t) 0 s22A (t) + s11B (t) s12B (t) 2A 1B
0 s21A (t) s21B (t) s22B (t) + s22A (t) p2A (t)p2B (t)
⎛ ⎞⎛ ⎞
s11A (t) + s11B (t) s12B (t) s12A (t) 0 pIQ (t)
⎜ s21B (t) s11A (t) + s22B (t) 0 s12A (t) ⎟ ⎜ pIIQ (t) ⎟
=⎝ ⎠ ⎝ p (t)⎠
s21A (t) 0 s22A (t) + s11B (t) s12B (t) IIIQ
0 s21A (t) s21B (t) s22B (t) + s22A (t) pIV Q (t)
⎛ ⎞
pIQ (t)
d ⎜ pIIQ (t) ⎟
= ⎝
dt pIIIQ (t)
⎠, (56)
pIV Q (t)

where pIQ (t), pIIQ (t), pIIIQ (t) and pIV Q (t) (with pIQ (t) = p1A (t)p1B (t), pIIQ (t) =
p1A (t)p2B (t), pIIIQ (t) = p2A (t)p1B (t), pIV Q (t) = p2A (t)p2B (t)) are describing proba-
bilities for four different states of statistic finite machine. Similar situation occurs in the
case of Schroedinger equation written for two non-interacting systems, but instead of
probabilities we have square root of probabilities times phase factor. In general case of
interaction between A and B systems in classical epidemic model we have
Ĥ0 (t) + ĤA−B (t)
  


= EI (t) ψE1A (t) ψE1B (t) ψE1A (t) ψE1B (t)
  


+EII (t) ψE1A (t) ψE2B (t) ψE1A (t) ψE2B (t)
  


+EIII (t) ψE2A (t) ψE1B (t) ψE2A (t) ψE1B (t)
   


+EIV (t) ψE2A (t) ψE2B (t) ψE2A (t) ψE2B (t)
  


+e(1A,1B)→(1A,2B) (t) ψE1A (t) ψE1B (t) ψE1A (t) ψE2B (t)
  


+e(1A,2B)→(1A,1B) (t) ψE1A (t) ψE2B (t) ψE1A (t) ψE1B (t)
  


+e(2A,1B)→(1A,1B) (t) ψE2A (t) ψE1B (t) ψE1A (t) ψE1B (t)
  


+e(1A,1B)→(2A,1B) (t) ψE1A (t) ψE1B (t) ψE2A (t) ψE1B (t)
  


+e(1A,2B)→(2A,2B) (t) ψE1A (t) ψE2B (t) ψE2A (t) ψE2B (t) +
  


e(2A,2B)→(1A,2B) (t) ψE2A (t) ψE2B (t) ψE1A (t) ψE2B (t)
  


+e(2A,1B)→(2A,2B) (t) ψE2A (t) ψE1B (t) ψE2A (t) ψE2B (t) +
  


e(2A,2B)→(2A,1B) (t) ψE2A (t) ψE2B (t) ψE2A (t) ψE1B (t)
  


+e(1A,1B)→(2A,2B) (t) ψE1A (t) ψE1B (t) ψE2A (t) ψE2B (t) +
  


e(2A,2B)→(1A,1B) (t) ψE2A (t) ψE2B (t) ψE1A (t) ψE1B (t) . (57)

The presented analytical approach can be applied for system A with four distinct
states as well as for system B with four distinct states since isolated system A(B) with
four states can be described by evolution matrix four by four that has four analytical
eigenvalues and eigenstates and becomes non-analytical for five and more distinct states
due to fact that roots of polynomial of higher order than four becomes non-analytical
and becomes numerical with some limited exceptions. We can write the matrix
⎛ ⎞
EIQ (t) e(1A,2B)→(1A,1B) e(2A,1B)→(1A,1B) e(2A,2B)→(1A,1B)
⎜e(1A,1B)→(1A,2B) E (t) e e ⎟
ĤEIQ ,..,EIV Q =⎜ IIQ
⎝e(1A,1B)→(2A,1B) e(1A,1B)→(2A,1B)
(2A,1B)→(1A,2B) (2A,2B)→(1A,2B) ⎟
(58)
EIIIQ (t) e(2A,2B)→(1A,2B) ⎠
e(1A,1B)→(2A,2B) e(1A,1B)→2A,2B) e(2A,1B)→(2A,2B) EIV Q (t)
490 K. Pomorski

7 From Epidemic Model to Tight-Binding Equations


7.1 Case of Two Level Classical Stochastic Finite State Machine

Let us be motivated by work on single electron devices by [1–4, 6, 12, 15–17, 21–24].
Instead of probabilities it will be useful to operate with square root of probabilities as
they are present in
quantum mechanics and in Schroedinger
 orDirac equation. Since
d √ √ d √ √
dt ( p1 p1 ) = 2 p1 (t) d
dt p 1 (t), dt ( p2 p 2 ) = 2 p2 (t) d
dt p2 (t) we can rewrite
the epidemic equation as
⎛  ⎞
p2(t)   d   
1
s (t) 1
p1(t) s12 (t)⎠  p1 (t) dt  p1 (t) =  p1 (t)
⎝ 2 11 2 ih̄ d
p1(t) p2 (t)
= d
p2 (t) ih̄ dt p2 (t)
(59)
p2(t) s21 (t) 2 s22 (t)
1 1
2 dt

The following notation is introduced:



t0 p1 (t)
1 t1 t0 t0
S12 [ ,t1 ] = s12 ( ) = S12R [ ,t1 ] + iS12I [ ,t1 ],
ih̄ p2 (t)
2 ih̄ ih̄ ih̄

t0 1 p2 (t) t1 t0 t0
S21 [ ,t1 ] =
ih̄ 2 p1 (t)
s21 ( ) = S21R [ ,t1 ] + iS21I [ ,t1 ]
ih̄ ih̄ ih̄
(60)

and one obtains


 ⎛ ⎞
S12R [ ih̄0 ,t1 ] + iS12R [ ih̄0 ,t1 ]ei(Θ1 (t)−Θ2 (t))
t t
p ( 1 )eiΘ1 (t)
t1 t
2 s11 ( ih̄ ) − h̄ dt1 Θ1 (t)
1 d
t t ⎝ 1 ih̄ ⎠
S21R [ ih̄0 ,t1 ] + iS21I [ ih̄0 ,t1 ]ei((Θ2 (t)−Θ1 (t)))
t1
2 s22 ( ih̄ ) − h̄ dt1 Θ2 (t)
1 d
p2 ( ih̄1 )eiΘ2 (t)
t

⎛ ⎞
t iΘ (t)
d ⎝ p1 ( ih̄1 )e 1 ⎠
= ih̄  .
p2 ( ih̄1 )eiΘ2 (t)
dt t

Let us start from quantum mechanical perspective


  √ √  √ √ 
E p1 tsR + itsI p cos(Θ1 ) + i p1 sin(Θ1 ) d p cos(Θ1 ) + i p1 sin(Θ1 )
√ 1 √ = ih̄ √ 1 √ (61)
tsR − itsI E p2 p2 cos(Θ2 ) + i p1 sin(Θ2 ) dt1 p2 cos(Θ2 ) + i p2 sin(Θ2 )

what can be written as


⎛ ⎞ ⎛ ⎞
p1 (t)cos(Θ1 (t)) p1 (t)cos(Θ1 (t))
⎜ p1 (t)sin(Θ1 (t)) ⎟ d ⎜ p1 (t)sin(Θ1 (t)) ⎟
Â(t) ⎜ ⎟ ⎜ ⎟
⎝ p2 (t)cos(Θ1 (t))⎠ = dt ⎝ p2 (t)cos(Θ2 (t))⎠ , (62)
 
p1 (t)sin(Θ1 (t)) p2 (t)sin(Θ2 (t))

⎛ ⎞ ⎛ ⎞
p1 (t0 )cos(Θ1 (t0 )) p1 (t)cos(Θ1 (t))
t ⎜ ⎟ ⎜ ⎟
 p1 (t0 )sin(Θ1 (t0 )) ⎟ ⎜ p1 (t)sin(Θ1 (t)) ⎟
 
e t0 Â(t )dt ⎜
⎝ p2 (t0 )cos(Θ2 (t0 ))⎠ = ⎝ p2 (t)cos(Θ2 (t))⎠ (63)
 
p2 (t0 )sin(Θ2 (t0 )) p2 (t)sin(Θ2 (t))
Equivalence Between Classical Epidemic Model and Quantum Tight-Binding Model 491

The last set of equations is equivalent to


⎛ ⎞
 p1 (t)(cos(Θ1 (t)))2  0 0 0  
⎜ 0 p1 (t)(sin(Θ1 (t)))2  0 0 ⎟  t Â(t  )dt 
2⎜

⎟ e t0
⎠ ×
0 0 p2 (t)(cos(Θ2 (t)))2  0
0 0 0 p2 (t)(sin(Θ2 (t))) 2
⎛ ⎞
√ 1
0 0 0 ⎛ ⎞
2
⎜ p1 (t)(cos(Θ1 (t))) ⎟  p1 (t0 )(cos(Θ1 (t0 )))2 = p1R (t)
⎜ √ 1 ⎟
⎟ ⎜ p1 (t0 )(sin(Θ1 (t0 ))) = p1I (t) ⎟

0 0 0
⎜ p1 (t)(sin(Θ1 (t)))2
2

×⎜ ⎟ ⎝
⎜ 0 0 √ 1
0 ⎟ p2 (t0 )(cos(Θ2 (t0 )))2 = p2R (t)⎠
⎝ p2 (t)(cos(Θ2 (t)))2 ⎠ p2 (t0 )(sin(Θ2 (t0 ))) = p2I (t)
2
0 0 0 √ 1
p2 (t)(sin(Θ2 (t)))2
⎛ ⎞
p1 (t)(cos(Θ1 (t))2 )
d ⎜⎜ p1 (t)(sin(Θ1 (t)) ) ⎟
2 ⎟
=
dt ⎝ p2 (t)(cos(Θ2 (t))2 )⎠
p2 (t)(sin(Θ2 (t))2 )

8 Conclusions
There are various deep analogies between classical statistical mechanics and quantum
mechanics as given by [10, 11]. The obtained results shows that quantum mechanical
phenomena might be almost entirely simulated by classical statistical model. It includes
the quantum like entanglement [9, 19] and superposition of states. Therefore coupled
epidemic models expressed by classical systems in terms of classical physics can be
the base for possible incorporation of quantum technologies and in particular for quan-
tum like computation and quantum like communication. In the conduced computations
Wolfram software was used [18]. All work presented at [12, 20] can be expressed by
classical epidemic model. It is expected that time crystals can be also described in the
given framework [13, 14]. It is open issue to what extent we can parameterize various
condensed matter [3–5, 7, 8, 22, 25] phenomena by stochastic finite state machine.

References
1. Likharev, K.K.: Single-electron devices and their applications. Proc. IEEE 87, 606–632
(1999)
2. Leipold, D.: Controlled Rabi Oscillations as foundation for entangled quantum aperture
logic, Seminar at UC Berkley Quantum Labs (2018)
3. Fujisawa, T., Hayashi, T., Cheong, H.D., Jeong, Y.H., Hirayama, Y.: Rotation and phase-
shift operations for a charge qubit in a double quantum dot. Physica E Low-Dimensional
Syst. Nanostruct. 21(2–4), 10461052 (2004)
4. Petersson, K.D., Petta, J.R., Lu, H., Gossard, A.C.: Quantum coherence in a one-electron
semiconductor charge qubit. Phys. Rev. Lett. 105, 246804 (2010)
5. Giounanlis, P., Blokhina, E., Pomorski, K., Leipold, D.R., Staszewski, R.B.: Modeling of
semiconductor electrostatic qubits realized through coupled quantum dots. IEEE Access 7,
49262–49278 (2019)
6. Bashir, I., et al.: A mixed-signal control core for a fully integrated semiconductor quantum
computer system-on-chip. In: Proceedings of IEEE European Solid-State Circuits Confer-
ence (ESSCIRC) (2019)
7. Spalek, J.: Wstep do fizyki materii skondensowanej. PWN (2015)
492 K. Pomorski

8. Jaynes, E.T., Cummings, F.W.: Comparison of quantum and semiclassical radiation theories
with application to the beam maser. Proc. IEEE 51(1), 89–109 (1963)
9. Angelakis, D.G., Mancini, S., Bose, S.: Steady state entanglement between hybrid light-
matter qubits. arXiv:0711.1830 (2008)
10. Wetterich, C.: Quantum mechanics from classical statistics. arxiv:0906.4919 (2009)
11. Baez, J.C., Pollard, B.S.: Quantropy. http://math.ucr.edu/home/baez/quantropy.pdf
12. Pomorski, K., Staszewski, R.B.: Analytical solutions for N-electron interacting system con-
fined in graph of coupled electrostatic semiconductor and superconducting quantum dots in
tight-binding model with focus on quantum information processing (2019). https://arxiv.org/
abs/1907.03180
13. Wilczek, F.: Quantum time crystals. Phys. Rev. Lett. 109, 160401 (2012)
14. Sacha, K., Zakrzewski, J.: Time crystals: a review. Rep. Prog. Phys. 81(1), 016401 (2017)
15. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Staszewski, R.B.: Analytic view on
coupled single-electron lines. Semicond. Sci. Technol. 34(12), 125015 (2019)
16. Pomorski, K., Staszewski, R.B.: Towards quantum internet and non-local communication
in position-based qubits. AIP Conf. Proc. 2241, 020030 (2020). https://doi.org/10.1063/5.
0011369+ arxiv:1911.02094
17. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Peczkowski, P., Staszewski, R.B.:
From two types of electrostatic position-dependent semiconductor qubits to quantum univer-
sal gates and hybrid semiconductor-superconducting quantum computer. In: Proceedings of
SPIE, vol. 11054 (2019)
18. Wolfram Mathematica. http://www.wolfram.com/mathematica/
19. Wikipedia: Bell theorem
20. Pomorski, K.: Seminars on quantum technologies at YouTube channel: quantum hardware
systems (2020). https://www.youtube.com/watch?v=Bhj ZF36APw
21. Pomorski, K.: Analytical view on non-invasive measurement of moving charge by position
dependent semiconductor qubit. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC,
vol. 1289, pp. 31–53. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63089-8 3
22. Pomorski, K.: Analytical view on tunnable electrostatic quantum swap gate in tight-binding
model. arXiv:2001.02513 (2019)
23. Pomorski, K.: Analytic view on N body interaction in electrostatic quantum gates and deco-
herence effects in tight-binding model. Int. J. Quantum Inf. 19(04), 2141001 (2021)
24. Pomorski, K.: Analytical view on N bodies interacting with quantum cavity in tight-binding
model. arXiv:2008.12126 (2020)
25. Pomorski, K., Peczkowski, P., Staszewski, R.: Analytical solutions for N interacting elec-
tron system confined in graph of coupled electrostatic semiconductor and superconducting
quantum dots in tight-binding model. Cryogenics 109, 103117 (2020)
Effects of Various Barricades on Human Crowd
Movement Flow

Andrew J. Park1(B) , Ryan Ficocelli2 , Lee Patterson3 , Frank Dodich3 , Valerie Spicer4 ,
and Herbert H. Tsang1
1 Trinity Western University, Langley, BC V2Y 1Y1, Canada
{a.park,herbert.tsang}@twu.ca
2 Thompson Rivers University, Kamloops, BC V2C 0C8, Canada
[email protected]
3 Justice Institute British Columbia, New Westminster, BC V3L 5T4, Canada
[email protected]
4 Simon Fraser University, Burnaby, BC V5A 1S6, Canada

[email protected]

Abstract. Human crowd movement flow has been studied in various disciplines
such as computing science, physics, engineering, urban planning, etc., for many
decades. Some studies focused on the management of big crowds in public events
whereas others investigated the egress of the crowd in emergency cases. Optimal
flows of a human crowd have been a particular interest among many researchers.
This paper presents how various physical barricades affect human crowd move-
ment flow using a social force model. Simulation experiments of bidirectional
crowd flows were conducted with/without barricades in a straight-line street. The
barricades with various lengths and rotations were tested to discover optimal flows
of a crowd with various densities of the crowd. The experimental results show that
setting up barricades with a particular length and rotation generates a better flow
of the crowd compared to the situations without both or either of them. This study
can help the management of the crowd in public events by setting up physical
barricades strategically to produce optimal flows of a crowd.

Keywords: Crowd flow · Crowd management · Barricades · Agent-modeling


and simulation · Social force

1 Introduction
A traditional school of thought was holding the notion of a crowd to be homogeneous and
irrational. However, contemporary studies support that a crowd can be heterogeneous and
rational. The flow of human crowd movement is different from that of fluid or liquid since
each member of the crowd can make one’s own rational decision although one may have a
tendency to be coherent with the crowd and become part of the crowd flow. Emotions can
influence such crowd flow, which might lead to a disastrous result such as a panic situation
in the case of an emergency. Well-planned crowd management strategies can generate
an optimal flow of a crowd and mitigate potential harms and dangers in public events.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 493–510, 2023.
https://doi.org/10.1007/978-3-031-18461-1_32
494 A. J. Park et al.

Crowd modeling and simulation has been a popular research topic in various disciplines
including computer graphics and animation, physics, engineering, urban planning, safety
science, transportation science, etc. Some studies were interested in generating realistic
crowd movement while others examined various crowd management strategies. Still
other studies investigated the crowd emergency egress in the case of fire or any disasters.
Although their applications might be different, optimal flows of a human crowd have
been a common interest among these researchers.
Various modeling techniques have been used to model a crowd or pedestrians. Some
common techniques for crowd/pedestrian modeling are agent-based modeling, social
force modeling, cellular automata modeling, and fluid dynamic modeling [13]. This
study has used the techniques of social force modeling.
Although there have been some studies of crowd flows with physical barricades,
it seems that systematic studies of the effects of barricades with various lengths and
rotations on crowd flows are lacking. This paper tries to fill such a gap by systematically
investigating how various settings of physical barricades affect the flow of a crowd. Sim-
ulation experiments were conducted with various lengths of barricades that are rotated
at various degrees in a straight-line street where two groups of a crowd were going to
the other side from each end. The experimental results show that barricade settings with
a specific length and rotation produce a better flow of a crowd than barricades with other
lengths and rotations or no barricade at all.
The paper is organized as follows: the background section surveys the concept and
model of a crowd and reviews various crowd modeling techniques including social force
modeling. Various studies of crowd or pedestrian flows are reviewed. The section on
simulation experiments introduces the experiments conducted with barricades of vari-
ous lengths and rotations and shows the results. The discussion section has comprehen-
sive analyses of the experimental results. The future plan is proposed followed by the
acknowledgement.

2 Background

The definition of a crowd has been debated for decades. Le Bon defined a crowd in
his seminal book, “The Crowd: A Study of the Popular Mind” as “a group of individu-
als united by a common idea, belief, or ideology” [14]. A crowd may show a collective
behavior such as a protest against authorities, which seems emotionally charged and irra-
tional. The understanding of a crowd as a group of homogeneous entities was prevalent
due to early sociological writings during French Revolution in the 1790s [11]. Contem-
porary studies, however, regard a crowd as a group of heterogeneous entities who are
rational and make their own decisions [3, 12]. It can be argued that both kinds of a crowd
can exist depending on what kind of events they participate in and what kind of situations
they are in. Political or sporting events can create the former kind of a crowd (traditional)
such as protests against governments or hooliganism against rival fans whereas the latter
kind of a crowd (contemporary) is common in peaceful, public events such as events that
celebrate national holidays. This study considers a peaceful crowd at a celebratory event.
When a big crowd gathers at this kind of event, strategies for managing such a crowd
need to be implemented well in advance. In particular, making optical flows of a crowd
Effects of Various Barricades on Human Crowd Movement Flow 495

to the event venue and out of the venue are of importance. Our previous paper shows the
crowd swirling at the intersection when they are not managed [17]. A simple strategy of
placing a police line (a barricade) diagonally at the intersection helps a better flow of a
crowd although some members of the crowd are forced to certain directions. This study
is an extension of the former study to investigate the effects of physical barricades on
the flow of a crowd.
Various methods and techniques have been used to study crowd behaviors. Field
observation of a crowd produces ground truth. However, it is time-consuming and labor-
intensive. Recent studies show that a crowd or pedestrian movement is tracked by ana-
lyzing video records or tracking GPS information from each member’s phone [1, 24].
Controlled experiments with human participants in artificial, physical settings can be an
alternative to field observation for crowd study in a systematic way [4, 10]. Developing
mathematical or computational models of a crowd has been popular among researchers
who study crowd behaviors. Some common mathematical or computational modeling
techniques are as follows:

• Agent-Based Modeling: Each agent can represent a pedestrian with simple and nec-
essary characteristics. Multiple agents are simulated to generate emergent behaviors
[15, 19, 21].
• Social Force Modeling: Social force modeling is a kind of agent-based modeling. It
is particularly used for human crowd modeling with three forces: a force that accel-
erates towards the desired velocity of motion; a force that keeps a certain distance
between pedestrians and between pedestrians and borders; and a force that attracts
other pedestrians [8].
• Cellular Automata Modeling: Colored cells of a grid can represent pedestrians which
evolve at each discrete time step with a set of rules [18, 22, 23].
• Fluid Dynamic Modeling: A crowd with high density behaves like fluid flows [2, 5,
9].

Some emergent behaviors of a crowd are observed, which include lane formation
(channeling), self-organization, swirling, and bottleneck [16].
There have been studies of crowd (pedestrian) flows with physical barricades (barri-
ers or obstacles). Either unidirectional or bidirectional flows of a crowd (pedestrians) are
observed in a field or a controlled setting or computationally simulated in a straight-line
street (or corridor), a T-junction, or an intersection with barricades with various shapes
[6, 7, 20].
This study uses social force modeling techniques to simulate a crowd with a large
number of pedestrians and generate emergent behaviors. Barricades can be either static
(physical barricades) or dynamic (human cordons). This study focuses on effects of
physical, static barricades in a straight-line street with bidirectional flows of a crowd
with various lengths and rotations of the barricades.

3 Development of Social Force Crowd Model System


The software presented in this paper allows for the modeling of crowd movements
through a social force model. The movement of an individual in this system can be
496 A. J. Park et al.

thought of as a combination of several different forces which are combined together


to create an overall movement. Each of the social forces is given a weight, and this
represents the portion of the overall movement that the force accounts for. For instance,
if one force has weight 1, and a second has weight 2, then the second force accounts for
twice as much of the overall movement as the first force. These values are proportional,
so if in one scenario the first force has weight 1 and a second force has weight 2, then
in a second scenario the first force has weight 2 and the second force has weight 4, then
in both scenarios, the forces account for the same percentage of the overall movement.
There are four social forces in our model: cohesion, avoidance, target seeking, and wall
avoidance.
Cohesion is the social force that models the tendency of individuals that are moving
in the same direction to move as a group. In the software, this is accomplished by taking
the average of the locations of nearby individuals that are moving in the same direction.
The individuals are considered nearby if they are within a 5-unit sphere of the agent
in question, and the agents are considered to be moving in the same direction if their
movements are within 90° of each other. By taking an average of the nearby agents
moving in the same direction, the individual tends to move towards the center of the
group, and thus the agents form a group while moving.
Avoidance is the social force that models the tendency of individuals to move to avoid
running into other individuals. This is done by reacting to the locations of agents within
a set radius and then moving in the opposite direction. In order to have the agent move
more due to a nearby individual rather than a farther away individual, the movement is
not based solely on the average of the locations of the nearby agents. Instead, the function
produces a vector for each nearby agent. The direction of the vector is from the nearby
agent to the individual that will be moving; hence the moving agent will move away.
The length of the vector is calculated by taking the relative location of the nearby agent
to the individual that is moving and projecting that line onto a sphere that matches the
avoidance radius of the individual. The length of this projection is taken as the length of
the vector. Because the length of the projection is used, the avoidance mechanism reacts
more strongly to nearby agents, which will have a long projection to the sphere, than
to agents that are near the boundary of avoidance, which will have a short projection to
the sphere. Each of these vectors is added together and averaged to create the avoidance
movement. The avoidance radius in our system is 0.75 units, which is smaller than the
5-unit radius of cohesion. This is done so that an individual will move as part of a group
but will move if others get to close.
Target seeking is the social force that drives the individual towards a goal. This simple
force is calculated by taking a vector from the individual to their target and scaling the
length of the vector to match the weight of the target seeking force.
Obstacle avoidance is the social force that causes the individual to move away from
walls and barricades, along with other non-moving obstacles. This force is implemented
by having 11 rays fired out of the agent in a 180° cone facing the direction of the agent.
If any of the rays collide with an obstacle, the system calculates a force to move away
from the obstacles. This is done by converting each of the rays that did not collide with
an obstacle into a vector and averaging the vectors to get a direction of movement. The
weight of the movement is used to determine the length of the movement vector. Rays
Effects of Various Barricades on Human Crowd Movement Flow 497

that did not collide with an obstacle but are further away from the center ray than a
ray that did collide with an obstacle are not included in the average calculation. This
allows for the calculated value to be farther from any obstacles, and hence leads to more
movement away from any obstacles.

4 Simulation Experiments
The system was used to simulate groups of people moving down a street. In order to
observe the effect of setting up barricades on the flow of people through the street,
immovable barricades were present in some of the simulations. The environment, which
includes a blue barricade, can be seen in Fig. 1. Then to move through the street is counted
each second of the simulation and written into an external file in order to quantify the
movement of the crowd agents through the street (Fig. 2).

Fig. 1. The simulation street environment, with a 30-unit length barricade.

Fig. 2. The simulation street environment with the spawn and destination possible location
rectangles in red

The crowd agents in the system are represented by colored capsules, which when
viewed from the top-down perspective look like circles. These crowd agents can be seen
in Fig. 3. The dark blue agents on the left side of Fig. 3 began on the left side of the street
and are moving to the right side, while the yellow agents are moving from the right side
to the left side. The coloring of the agents remains constant throughout the simulations,
so the blue agents always move from the left to the right, while the yellow agents always
move from the right to the left.
The effect of the rotation of the barricade on the crowd movement was observed
by running the simulation multiple times with the barricade in different rotations. In
each of the simulations, the barricade had a length of 30 units. This batch of simulations
included scenarios where there was no barricade, where the barricade was straight up
498 A. J. Park et al.

Fig. 3. The crowd agents in the simulation environment

and down the street, where the barricade had been rotated counterclockwise by 5°, and
where the barricade had been rotated counterclockwise by 10°. The environment where
the barricade had been rotated counterclockwise by 10° can be seen in Fig. 4.
The simulation was also run several times in order to observe the effect of the length
of the barricade on the flow of people through the street. The barricade was kept in the
orientation parallel to the direction of the street in all of these experiments. The possible
barricade lengths were: no barricade, 5-unit length barricade, 15-unit length barricade,
and 30-unit length barricades. A 5- unit length barricade can be seen in Fig. 5, a 15-unit
length barricade can be seen in Fig. 6, and a 30-unit length barricade can be seen in
Fig. 3.

Fig. 4. The simulation environment with the barricade rotated counterclockwise by 10°

Fig. 5. The simulation environment with a 5-unit length barricade

In both sets of experiments, the density of the agents was also varied to see if the
density of the crowd agents in the street changed the effects of the barricades. This was
accomplished by changing the number of crowd agents spawned in each spawning cycle.
The number of agents spawned was set to either 10, 75, or 100.
Effects of Various Barricades on Human Crowd Movement Flow 499

Fig. 6. The simulation environment with a 15-unit length barricade

5 Experimental Results
In the absence of barricades, the groups of crowd agents exhibited different macroscopic
behaviors based on the density of crowd agents in the simulation. In the low-density
scenarios, where only 10 agents were spawned each spawning cycle, the crowds tended
to form straight-line groups as they moved towards each other. As these straight-line
groups meet up with each other, the crowd agents at the front of the group move around
each other, both moving to different sides. These agents continue moving forward, but
avoid moving inside the group moving in the opposite direction. Agents that follow the
head of the group exhibit cohesion towards the leader and avoidance from the agents
moving in the opposite direction, leading to a natural channel forming between the two
groups. This channelization can be seen in Fig. 7.

Fig. 7. Two low-density crowds forming channels to pass each other

In the high-density scenarios, where agents are spawned in groups of 75 or 100, the
groups tend to form more spherical shapes, which can be seen in Fig. 8. This is due to
both the large number of agents present and the cohesion behavior of the agents. As these
large, spherical groups collide with each other, the two groups form one giant sphere as
the crowds attempt to move past each other, which can be seen in Fig. 9. People near
the center of this mass are unable to make meaningful progress through the group, as
they keep colliding and avoiding each other. People that are located near the edges of the
sphere are able to move past each other, as there is more room to move. This leads to the
people around the outside of the group being able to move faster than those in the center,
500 A. J. Park et al.

and as they move, they create space for the more central agents to be able to move. This
in turn means that the large crowds move past each other with the most exterior agents
leaving first, then the more central agents leaving afterwards. This phenomenon is best
seen in Fig. 10, where the blue agents can be seen moving farther to the right (towards
their destination) around the edges, and the yellow agents can be seen moving farther
to the left around the edges. By looking at Fig. 8, Fig. 9, and then Fig. 10 in sequence,
one can see how the two groups move around each other. The fact that the centers of the
two groups must wait for the peripheral crowd agents to clear before being able to move
means that the group as a whole ends up moving slower.

Fig. 8. Two high-density crowds approaching each other.

Fig. 9. Two high-density crowds colliding with each other.

The clustering of the groups can be seen in the low-density scenarios as well, albeit
at a smaller scale. Even though the two groups naturally form channels to avoid each
Effects of Various Barricades on Human Crowd Movement Flow 501

other, at some point some of the agents will need to cross the opposite direction group
in order to reach their destination. As they move across, they cause collisions with the
other group and cause them to slow down, forming smaller spherical clusters. Although
it does not take as long for these clusters to clear due to their smaller size, they do lead to
a slower overall movement of the group, just like in the high-density cluster case. These
small clusters can be seen in Fig. 11.

Fig. 10. Two high-density crowds moving past each other

Fig. 11. Low-density crowds clustering as they cross each other

The graph of the results of the simulations where the crowd was spawned in groups
of 75 and where there were straight barricades of varying lengths can be seen in Fig. 12.
The simulation where there was no barricade was the worst for the flow of crowd agents,
while the best flow was achieved when there was a 30-unit length barricade present. The
5-unit length barricade scenario led to a better flow than in the no barricade scenario, and
the 15-unit length barricade scenario had a better flow than the 5-unit length barricade.
502 A. J. Park et al.

Fig. 12. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 75, and the simulations varied the length of the central barricade.

When looking at the graph of the results from the highest density scenarios with the
straight barricades, as in Fig. 13, it can be seen that the efficacy of the longer barricades
is more pronounced.
The graph of the results of the low-density experiments, where crowd agents were
spawned in groups of 10, where there were straight barricades of varying lengths can be
seen in Fig. 14.
The other set of simulations looked at the effect of the rotation of the barricade on the
flow of the crowd through the street, where the barricade was either not present, straight
down the street, rotated counterclockwise by 5°, or rotated counterclockwise by 10°.
The graph of the results of these simulations, where the crowd was spawned in groups
of 10, can be seen in Fig. 15. In these low-density scenarios, it can be seen in Fig. 15
that the worst-performing group was when there was no barricade, and the flow of the
crowd increased as the rotation of the barricade became greater.
While the low-density scenarios have better flows with higher rotation of the barri-
cade, the opposite is true in the high-density scenarios. The graph of the results of the
simulations where the crowds were spawned in groups of 75 can be seen in Fig. 16, and
the graph of the results of the simulations where the crowds were spawned in groups of
100 can be seen in Fig. 17. In both of the high-density scenarios, the lower the rotation
of the barricade, the better the crowd flow was through the street.
Effects of Various Barricades on Human Crowd Movement Flow 503

Fig. 13. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 100, and the simulations varied the length of the central barricade.

Fig. 14. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 10, and the simulations varied the length of the central barricade.

6 Discussion
The graphical results of the simulation experiments where the crowd was spawned in
groups of 75 and the length of the barricade was varied between experiments are shown
504 A. J. Park et al.

Fig. 15. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 10, and the simulations varied the rotation of the central barricade.

Fig. 16. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 75, and the simulations varied the rotation of the central barricade.

in Fig. 12. Since the 30-unit length barricade scenario had better crowd flow than the
15-unit length barricade scenario, which in turn had a better crowd flow than the 5-unit
length barricade scenario, the conclusion is that longer barricades lead to a better flow
through the street at high crowd densities. This conclusion is also supported by the fact
Effects of Various Barricades on Human Crowd Movement Flow 505

Fig. 17. The graph of the flow of crowd agents through the street where the crowd was spawned
in groups of 100, and the simulations varied the rotation of the central barricade.

that the scenario with no barricade was the worst-performing scenario out of all of the
scenarios in Fig. 12. While the poor performance with no barricade can be explained by
the clustering of the groups, as seen in Fig. 9, the better performance due to a barricade
can be gleaned from looking at several images of the simulations with the barricades
present. As the two groups collide, they appear to form a cluster that has been bisected
by the barricade, as seen in Fig. 18. This is because the crowd agents tend to move
towards the center of the street due to cohesion, but then spread outwards as the two
groups meet in the middle. While this by itself does not explain the better performance
of when the barricades are present, the reason can be found in Fig. 19. What happens
when the two groups meet is that the furthest forward member of the groups meets, and
eventually one moves away from the barricade. Without the barricade, the other leader
would likely move in the opposite direction, but due to the barricade, they simply move
against the barricade. As the one leader moves away from the barricade, agents that
come afterwards move to that side as well due to cohesion, which allows the barricade-
side agent to progress further. This effectively makes a wedge-like channel through the
group, as the barricade-side group tends to follow each other. In Fig. 19, this wedging
can be seen as the blue agents are forced away from the barricade, and the yellow agents
are allowed to progress further to the right along the barricade. If the barricade is long
enough, this wedge can get through the entire group and allow for easier movement of
the crowd. Conversely, if the barricade is not long enough, the structure of the wedge
does not hold as the crowd leaves the barricade, and thus they must wait for the exterior
crowd agents to clear before moving further. Although the shorter barricades do not see
the agents through the entire cluster, they do allow for some progress to be made, which
leads to a better flow that is proportional to the length of the barricade.
506 A. J. Park et al.

Fig. 18. Two high-density crowds colliding with each other with a barricade present

Fig. 19. The yellow group presses against the barricade, allowing them to wedge past the blue
group and progress to the left.

Similar to the simulations where the crowd was spawned in groups of 75, the graph
showing the results of the simulations where the crowd was spawned in groups of 100
and the effects of varying lengths of barricades, Fig. 13, shows that longer barricades
lead to better crowd flow through the street. This is a consequence of the large cluster of
agents that forms due to the higher density, which the wedging helps to pierce through
the facilitate movement.
Likewise, in the low-density scenarios, the worst-performing group was where there
was no barricade, while the best-performing group was where there was the longest
barricade of 30-unit length. The differences of crowd flows can be seen in Fig. 14. The
effects of the small clusters on the flow of the overall group can be best seen in the no
barricade scenario. As the clusters form, the overall movement of the group slows down,
causing the graph to go more horizontal. Then, as the clusters move past each other, the
graph becomes more vertical. This process oscillates, causing the S-shaped curve in the
graph. While clusters do form in the scenarios where there is a barricade, the wedging
phenomenon helps lessen the effect of the clusters by allowing the crowds to move past
each other. Once the crowds move past the barricades, the small clusters have more of a
Effects of Various Barricades on Human Crowd Movement Flow 507

slowing effect on the group movement, and thus the shorter barricades perform slightly
worse than the longer barricades, albeit not as noticeable as in the high-density scenarios.
Whether it is a high-density scenario or a low-density scenario, the results of the
experiments lead to the recommendation of the use of long barricades to facilitate group
movements down a street.
The next set of experiments looked at the effects of rotating a 30-unit length barricade
on the crowd flow through the street. The results of the low-density scenarios, where
the crowd was spawned in groups of 10, can be seen in Fig. 15. In the graph, the worst-
performing group was when there was no barricade, and the flow of the crowd increased
as the rotation of the barricade became greater. The rotation of the barricade leads the
crowds to be naturally separated from each other, as the wider side of the street funnels
the crowd to a particular side. This separation can be seen in Fig. 20. This separation
works better than no barricade due to the minimal clustering that occurs. Similarly,
the rotated barricade leads to less clustering than the straight barricade, and while the
straight barricade allows for wedging, that process is slower than if the groups never
meet. The difference between the 10° rotation and the 5° rotation comes from the point
where the crowds have moved past the barricade. As the group nears their destination,
some of them have to move across the newly spawned group to reach their destination.
This cross-flow can be seen in Fig. 21. In the 5° rotation case, this can cause some of
the newly spawned crowd to move to the “wrong side” of the barricade, the side with
the smaller opening. This means that as they move towards their destination, they are
fighting against the flow of the more numerous group, which slows down both groups.
This “wrong side” movement can be seen in Fig. 22. This “wrong side” movement is
much more frequent in the 5° rotation scenario than in the 10° rotation scenario, as there
is a larger opening in the 5° rotation scenario than in the 10° rotation scenario, which
leads to a better flow in the 10° rotation scenario than in the 5° rotation scenario.

Fig. 20. The rotated barricade leads to a separation of the two crowds.

The next two sets of simulations were where the effect of rotating the barrier on
crowd flow was measured for crowds that were spawned in groups of 75 and 100, the
graphical results of which can be seen in Fig. 16 and Fig. 17, respectively. In both high-
density scenarios, the worst flow comes when there is no barricade, which is explained
from the clustering of the groups as they meet in the middle of the street. The best flow
508 A. J. Park et al.

Fig. 21. The crowds cross each other as they emerge from the rotated barricade.

Fig. 22. Some of the crowd agents are forced to the side of the barricade with more oncoming
crowd agents.

in both scenarios came from the case where the barricade had no rotation, and the rate
of flow was lessened as the rotation of the barricade increased. This negative impact of
rotating the barricade is more pronounced in the highest density simulations, where the
crowds were spawned in groups of 100, while it is less pronounced in the simulations
where the crowds were spawned in groups of 75. The reason for this negative impact can
be seen in the bottlenecking that occurs due to the rotation, which is visible in Fig. 23.
As the barricade is rotated more, the high-density crowds end up at a bottleneck as they
try to leave the part of the street with the barricade, which causes the crowd to cluster
tightly. Since the crowd cannot move past one another, they have to slow down and wait
for those before them to leave before they can exit, which causes the whole group to
slow down and the flow of the entire group to slow down.
While the more rotated barricade leads to better group flow in the low-density sce-
nario, the straight barricade performed the best in the high-density scenarios. That leads
to the recommendation of the use of rotated barricades when a low-density group is
expected, and using a straight barricade when a high-density group is expected.
Effects of Various Barricades on Human Crowd Movement Flow 509

Fig. 23. The high-density crowds are forced into a bottleneck due to the rotated barricade.

7 Conclusion and Future Plan


This paper presented two sets of experiments, one that looked at the effects of the length
of a straight barricade on the movement of crowds through a street, while the other
looked at the effects of rotating a constant-length barricade. When looking at the length
of the barricade and their effects on the flow of the crowd, the longer barricades always
out-performed the shorter barricades, no matter if it was a high-density scenario or a low-
density scenario. Thus, a longer barricade is always recommended when managing crowd
flows through a street. In the case of rotating the barricade, more rotation worked better
in low-density scenarios, while minimal rotation works best in high-density scenarios.
Thus, it is recommended that the rotation of the barricade be based on the density
of the crowd expected. In the future, our research group plans on looking at different
configurations of physical barricades. By combining different shapes throughout the
street, different effects on the crowd flow may occur. Not only this, but we plan on looking
at different configurations of physical barricades in different intersections. We plan to
look at T-intersections and 4-way intersections, and see if there are certain configurations
of physical barricades that can facilitate crowd flow through those intersections. We are
also interested in investigating effects of dynamic barricades (human cordons) on crowd
management.

Acknowledgment. The authors would like to thank Department of Mathematical Sciences of


Trinity Western University for their generous support. The authors would also like to express their
gratitude for invaluable insights and feedback from their collaborators.

References
1. Blanke, U., Tröster, G., Franke, T., Lukowicz, P.: Capturing crowd dynamics at large scale
events using participatory GPS-localization. In: 2014 IEEE Ninth International Conference
on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), pp. 1–7. IEEE
(2014)
2. Farooq, M.U., Saad, M.N.B., Malik, A.S., Salih Ali, Y., Khan, S.D.: Motion estimation of
high density crowd using fluid dynamics. Imaging Sci. J. 68(3), 141–155 (2020)
510 A. J. Park et al.

3. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83, 1420–1443
(1978)
4. Guo, N., Hao, Q.Y., Jiang, R., Hu, M.B., Jia, B.: Uni-and bi-directional pedestrian flow in the
view-limited condition: experiments and modeling. Transp. Res. Part C Emerg. Technol. 71,
63–85 (2016)
5. Helbing, D.: A fluid dynamic model for the movement of pedestrians. arXiv preprint cond-
mat/9805213 (1998)
6. Helbing, D., Buzna, L., Johansson, A., Werner, T.: Self-organized pedestrian crowd dynamics:
experiments, simulations, and design solutions. Transp. Sci. 39(1), 1–24 (2005)
7. Helbing, D., Farkas, I.J., Molnar, P., Vicsek, T.: Simulation of pedestrian crowds in normal
and evacuation situations. Pedestr. Evacuation Dyn. 21(2), 21–58 (2002)
8. Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E 51(5),
4282 (1995)
9. Henderson, L.F.: On the fluid mechanics of human crowd motion. Transp. Res. 8(6), 509–515
(1974)
10. Hoogendoorn, S., Daamen, W.: Self-organization in pedestrian flow. In: Hoogendoorn, S.P.,
Luding, S., Bovy, P.H.L., Schreckenberg, M., Wolf, D.E. (eds.) Traffic and Granular Flow’03,
pp. 373–382. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-28091-X_36
11. Hughes, R.L.: The flow of human crowds. Annu. Rev. Fluid Mech. 35(1), 169–182 (2003)
12. Jager, W., Popping, R., Van de Sande, H.: Clustering and fighting in two-party crowds:
Simulating the approach-avoidance conflict. J. Artif. Soc. Soc. Simul. 4(3), 1–18 (2001)
13. Johansson, A., Kretz, T.: Applied pedestrian modeling. In: Heppenstall, A., Crooks, A., See,
L., Batty, M. (eds.) Agent-Based Models of Geographical Systems, pp. 451–462. Springe r,
Dordrecht (2012). https://doi.org/10.1007/978-90-481-8927-4_21
14. Le Bon, G.: The Crowd: A Study of the Popular Mind. Fischer, London (1897)
15. Macal, C.M., North, M.J.: Agent-based modeling and simulation. In: Proceedings of the 2009
Winter Simulation Conference (WSC), pp. 86–98. IEEE (2009)
16. Manocha, D., Lin, M.C.: Interactive large-scale crowd simulation. In: Arisona, S.M.,
Aschwanden, G., Halatsch, J., Wonka, P. (eds.) Digital Urban Modeling and Simulation.
CCIS, vol. 242, pp. 221–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-
642-29758-8_12
17. Park, A.J., Ficocelli, R., Patterson, L.D., Spicer, V., Dodich, F., Tsang, H.H.: Modelling
crowd dynamics and crowd management strategies. In: 2021 IEEE 12th Annual Information
Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0627–0632.
IEEE (2021)
18. Peng, Y.C., Chou, C.I.: Simulation of pedestrian flow through a “t” intersection: a multi-floor
field cellular automata approach. Comput. Phys. Commun. 182(1), 205–208 (2011)
19. Rozo, K.R., Arellana, J., Santander-Mercado, A., Jubiz-Diaz, M.: Modelling building emer-
gency evacuation plans considering the dynamic behaviour of pedestrians using agent-based
simulation. Saf. Sci. 113, 276–284 (2019)
20. Severiukhina, O., Voloshin, D., Lees, M.H., Karbovskii, V.: The study of the influence of
obstacles on crowd dynamics. Procedia Comput. Sci. 108, 215–224 (2017)
21. Wagner, N., Agrawal, V.: An agent-based simulation system for concert venue crowd evac-
uation modeling in the presence of a fire disaster. Expert Syst. Appl. 41(6), 2807–2815
(2014)
22. Wolfram, S.: Statistical mechanics of cellular automata. Rev. Mod. Phys. 55(3), 601 (1983)
23. Yue, H., Guan, H., Zhang, J., Shao, C.: Study on bi-direction pedestrian flow using cellular
automata simulation. Phys. A 389(3), 527–539 (2010)
24. Zacharias, J.: Pedestrian dynamics on narrow pavements in high-density Hong Kong. J. Urban
Manag. 10(4), 409–418 (2021)
The Classical Logic and the Continuous
Logic

Xiaolin Li(B)

Department of Mathematics and Computer Science, Alabama State University,


Montgomery, AL 36104, USA
[email protected]

Abstract. The truth value of a proposition in classical logic is either 0


or 1, where 0 stands for falsity and 1 stands for truth. In the real world,
however, there exist many propositions with variable answers that are
neither false nor true. In this paper, we present the continuous logic
with the truth value of a proposition falling into the continuous range
[0, 1], where 0 stands for complete falsity and 1 for complete truth. To
compare the continuous logic with the classical logic, fistly, we define
three primitive logic operators not, and, or, and several compound logic
operators not-and, not-or, exclusive-or, not-exclusive-or, and implication
from [0, 1]n to [0, 1], where n is an integer and n ≥ 1. Secondly, we
discuss various laws and inference rules in both the classical logic and
the continuous logic. We show that the continuous logic is consistent
with the classic logic, and that the classical logic is simply a special case
of the continuous logic.

Keywords: Classical logic · Continuous logic · Logic operator · Logic


inference

1 Introduction
The classical logic is bivalent. It only permits propositions having a value of truth
or falsity and there is no other value between. Consequently, given a proposition
P , P or not P is always true. In the real world, however, there exist certain
propositions with variable answers, such as asking various people to identify a
color. The notion of truth doesn’t fall by the wayside, but rather a means of
representing and reasoning over partial knowledge afforded, by aggregating all
possible outcomes into a dimensional spectrum. This leads to the so called many
value logic. The many-valued logic is a propositional calculus in which there are
more than two truth values [1–5,8,9,20].
The first known classical logician who didn’t fully accept the law of excluded
middle was Aristotle, who is also generally considered to be the first classical
logician and the “father of logic” [5]. However, Aristotle didn’t create a system
of multi-valued logic to explain this isolated remark. In 1920, the Polish logician
and philosopher Jan Lukasiewicz began to create systems of many-valued logic
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 511–525, 2023.
https://doi.org/10.1007/978-3-031-18461-1_33
512 X. Li

[2], where he introduced a “possible” value in addition to false and true values.
In 1967, Kleene introduced a three-valued logic where an “unknown” value is
used in logic inference [8].
In 1965, Lotfi A. Zadeh introduced the fuzzy set theory and fuzzy logic [18],
[19]. Fuzzy logic is a form of many-valued logic. It deals with reasoning that is
approximate rather than fixed and exact. Compared to the classical logic where
variables may take on true or false values, fuzzy logic variables may have a truth
value that ranges in degree between 0 and 1. Fuzzy logic has been extended to
handle the concept of partial truth, where the truth value may range between
complete truth and complete falsity. Since then, fuzzy logic has been studied
by many researchers and applied to many fields, from applied mathematics to
control theory and artificial intelligence [6,7,10–17,21].
In this paper, we present the concepts of continuous logic. Similar to the
fuzzy logic, the truth value of a variable in the continuous logic is within the
closed interval [0, 1], where 0 stands for complete falsity and 1 for complete
truth. We define primitive logic operators and compound logic operators that
map variables or propositions from a domain [0, 1]n to the range [0, 1], where n
is an integer and n ≥ 1. We also present various laws and rules in logic inference.
We show that the classical logic is simply a special case of this continuous logic.
The remainder of this paper is divided into six sections. Section 2 reviews the
classical logic. Section 3 defines the continuous logic. Section 4 presents inference
rules in both the classical and the continuous logic. Section 5 discusses the con-
sistency between the classical logic and the continuous logic. And finally, Sect. 6
concludes this paper.

2 The Classical Logic


Let us consider the classical logic. Denote B = {0, 1} as the truth set, where 0
stands for falsity and 1 stands for truth, we have three primitive operators such
that
Logic not  : B −→ B.
Logic and · : B 2 −→ B.
Logic or + : B 2 −→ B.

Table 1. Classical Not, And, Or

x y x y  xy x + y
0 0 1 1 0 0
0 1 1 0 0 1
1 0 0 1 0 1
1 1 0 0 1 1

The truth values of the primitive logic operators are defined by Table 1,
where x, y ∈ B and (x, y) ∈ B 2 . Notice that not is a unary operator. It is also
The Classical Logic and the Continuous Logic 513

referred to as negation or complement. On the other hand, operators And and Or


are binary operators. They are also referred to as conjunction and disjunction,
respectively. We call them primitive operators because they cannot be expressed
by each other or expressed by any other operators. For simplicity, we drop the ·
from the notation such that we write x · y as xy without confusion.
Based on the primitive operators above, we can define the following com-
pound logic operators:
Not-and (·) : B 2 −→ B.
Not-or (+) : B 2 −→ B.
Exclusive-or ⊕ : B 2 −→ B.
Not-exclusive-or (⊕) : B 2 −→ B.
Implication →: B 2 −→ B.

Table 2. Compound operators

x y (xy) (x + y) x ⊕ y (x ⊕ y) x → y


0 0 1 1 0 1 1
0 1 1 0 1 0 1
1 0 1 0 1 0 0
1 1 0 0 0 1 1

Table 2 defines the above operators. We say they are compound operators
because they can be expressed by the primitive operators:

(xy) = x + y 
(x + y) = x y 
x ⊕ y = xy  + x y
(x ⊕ y) = x y  + xy
x → y = x + y

Notice that all the above operators are commutative except for the implica-
tion →, since operators · and + are commutative. It is obvious that the domain
of operators ·, +, (·) and (+) can be extended to B n , where n > 1.
We have the following laws in classical logic:

1. Double negation

x = x (1)

2. Annihilation
x0 = 0
(2)
x+1=1
514 X. Li

3. Identity

x+0=x
(3)
x1 = x

4. Idempotence

xx = x
(4)
x+x=x

5. Commutativity

x+y =y+x
(5)
xy = yx

6. Associativity

x + (y + z) = (x + y) + z
(6)
x(yz) = (xy)z

7. Absorption

x(x + y) = x
(7)
x + xy = x

8. Complement

xx = 0
(8)
x + x = 1

9. Distributivity

x(y + z) = xy + xz
(9)
x + yz = (x + y)(x + z)

10. De Morgan Law

(x + y) = x y 
(10)
(xy) = x + y 

The above laws can be easily proved using truth tables.

3 The Continuous Logic

Let us consider the closed continuous truth interval C = [0, 1], where 0 stands
for complete falsity and 1 for complete truth. Similar to the case of the classical
logic, we can define three primitive operators from a domain C n to the range C,
where n ≥ 1:
The Classical Logic and the Continuous Logic 515

Definition 1. Logic not



: C −→ C, i.e., ∀x ∈ C,

x = 1 − x (11)

Definition 2. Logic and


· : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

x · y = min{x, y} (12)

Logic and makes a logic conjunction. We will write x · y as xy for simplicity.

Definition 3. Logic or
+ : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

x + y = max{x, y} (13)

Logic or makes a logic disjunction.


It is obvious that the domain of both logic and and logic or can also be
extended onto C n , where n > 2. In such cases, it follows that ∀(x1 , x2 , · · · , xn ) ∈
C n , we have

x1 x2 · · · xn = min{x1 , x2 , · · · , xn }
(14)
x1 + x2 + · · · + xn = max{x1 , x2 , · · · , xn }

Based on the above primitive operators, we can define the following com-
pound operators:

Definition 4. Not-and
(·) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

(xy) = 1 − min{x, y} (15)

Definition 5. Not-or
(+) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

(x + y) = 1 − max{x, y} (16)

Definition 6. Exclusive-or
⊕ : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

x ⊕ y = xy  + x y
(17)
= max{min(x, 1 − y), min(1 − x, y)}

Definition 7. Not-exclusive-or
(⊕) : C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

(x ⊕ y) = (x y + xy  )
= x y  + xy (18)
= max{min(1 − x, 1 − y), min(x, y)}
516 X. Li

The proof of the above equation is given by Appendix A.

Definition 8. Implication
→: C 2 −→ C, i.e., ∀(x, y) ∈ C 2 ,

x → y = x + y
(19)
= max{1 − x, y}

It is obvious that all the above operators are Commutative except for the
implication →. And the operators (·) and (+) can also be extended onto the
domain C n , where n > 2.
The continuous logic keeps all laws in the classical logic:

1. Double negation

x = x (20)

2. Annihilation
x0 = 0
(21)
x+1=1

3. Identity

x+0=x
(22)
x1 = x

4. Idempotence

xx = x
(23)
x+x=x

5. Commutativity

x+y =y+x
(24)
xy = yx

6. Associativity

x + (y + z) = (x + y) + z
(25)
x(yz) = (xy)z

7. Absorption

x(x + y) = x
(26)
x + xy = x

8. Complement

xx = min(x, 1 − x)
(27)
x + x = max(x, 1 − x)
The Classical Logic and the Continuous Logic 517

9. Distributivity

x(y + z) = xy + xz
(28)
x + yz = (x + y)(x + z)

10. De Morgan Law

(x + y) = x y 
(29)
(xy) = x + y 

Laws 1 to 8 are self-evident. The proofs of distributivity law and the De


Morgan law are given in Appendix B and Appendix C.
It is obvious that the De Morgan law can be extended onto C n such that
∀(x1 , x2 , · · · , xn ) ∈ C n ,

(x1 + x2 + · · · + xn ) = x1 x2 · · · xn


(30)
(x1 x2 · · · xn ) = x1 + x2 + · · · + xn

4 Logic Inference

Let us consider logic inference with the classical logic and the continuous logic.

4.1 Classical Inference

We have the following inference rules in classical logic:

1. Modus Ponens
∀P, Q ∈ B, we have

P (P → Q) = P (P  + Q) (31)

When P = 1, it follows that

P (P → Q) = Q

2. Modus Tollens
∀P, Q ∈ B, we have

P → Q = Q → P  (32)

This is simply because

P → Q = P + Q
= Q + P
= Q + P 
= Q → P 
518 X. Li

3. Disjunctive Syllogism
∀P, Q ∈ B, we have
P  (P + Q) = P  P + P  Q (33)
When P = 0, it follows that
P  (P + Q) = Q
4. Hypothetical Syllogism
∀P, Q, R ∈ B, we have
P (P → Q)(Q → R) = P (P  + Q)(Q + R)
(34)
= P QR
When P = Q = 1, it follows that
P (P → Q)(Q → R) = R
5. Logic Equivalence
∀P, Q ∈ B, we have
P ↔ Q = (P → Q)(Q → P )
= (P  + Q)(Q + P ) (35)
= P  Q + P Q
It follows that

0P =
 Q
P ↔Q=
1P =Q
6. Constructive Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(P + R) = (P  + Q)(R + S)(P + R) (36)
When P = R = 1, it follows that
(P → Q)(R → S)(P + R) = QS
7. Destructive Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(Q + S  ) = (P  + Q)(R + S)(Q + S  ) (37)
When P = R = 1, it follows that
(P → Q)(R → S)(Q + S  ) = 0
8. Bidirectional Dilemma
∀P, Q, R, S ∈ B, we have
(P → Q)(R → S)(P + S  ) = (P  + Q)(R + S)(P + S  ) (38)
When P = R = 1, it follows that
(P → Q)(R → S)(P + S  ) = QS
The Classical Logic and the Continuous Logic 519

4.2 Continuous Inference


All classical inference rules can be extended into the continuous inference rules:

1. Modus Ponens
∀P, Q ∈ C, we have

P (P → Q) = P (P  + Q)
= PP + PQ (39)
= max{min(P, 1 − P ), min(P, Q)}

When P = 1, it follows that

P (P → Q) = Q

2. Modus Tollens
∀P, Q ∈ C, we have

P → Q = Q → P  (40)

because

P → Q = P + Q
= Q + P
= Q + P 
= Q → P 

3. Disjunctive Syllogism
∀P, Q ∈ C, we have

P  (P + Q) = P  P + P  Q
(41)
= max{min(1 − P, P ), min(1 − P, Q)}

When P = 0, it follows that

P  (P + Q) = Q

4. Hypothetical Syllogism
∀P, Q, R ∈ C, we have

P (P → Q)(Q → R) = P (P  + Q)(Q + R)
= P P  Q + P P  R + P QQ + P QR
(42)
= max{min(P, 1 − P, 1 − Q), min(P, 1 − P, R),
min(P, Q, 1 − Q), min(P, Q, R)}

When P = Q = 1, it follows that

P (P → Q)(Q → R) = R
520 X. Li

5. Logic Equivalence
∀P, Q ∈ C, we have
P ↔ Q = (P → Q)(Q → P )
= (P  + Q)(Q + P )
= P  Q + P  P + QQ + P Q (43)
= P  Q + P Q
= max{min(1 − P, 1 − Q), min(P, Q)}
The above equation can be proved in the way similar to the proof of Eq. (18),
see Appendix A for details. It follows that

0 P = 1, Q = 0 or P = 0, Q = 1
P ↔Q=
1 P = Q = 1 or P = Q = 0
In all other cases, 0 < P ↔ Q < 1.
6. Constructive Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(P + R) = (P  + Q)(R + S)(P + R)
= min{max(1 − P, Q), max(1 − R, S), (44)
max(P, R)}
When P = R = 1, it follows that
(P → Q)(R → S)(P + R) = QS
7. Destructive Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(Q + S  ) = (P  + Q)(R + S)(Q + S  )
= min{max(1 − P, Q), max(1 − R, S), (45)
max(1 − Q, 1 − S)}
When P = R = 1, it follows that
(P → Q)(R → S)(Q + S  ) = min{Q, S, max(1 − Q, 1 − S)}
= QS(Q + S  )
Furthermore, when Q, S ∈ B, i.e., Q = 0 or S = 0 or Q = S = 1, it follows
that
(P → Q)(R → S)(Q + S  ) = 0
8. Bidirectional Dilemma
∀P, Q, R, S ∈ C, we have
(P → Q)(R → S)(P + S  ) = (P  + Q)(R + S)(P + S  ))
= min{max(1 − P, Q), max(1 − R, S), (46)
max(P, 1 − S)}
When P = R = 1, it follows that
(P → Q)(R → S)(P + S  ) = QS
The Classical Logic and the Continuous Logic 521

5 Consistency
Let us check the consistency between the classical logic and the continuous logic.
Firstly, consider the three primitive operators  , ·, and + in the continuous logic.
∀x, y ∈ C, we have
x = 1 − x
xy = min{x, y}
x + y = max{x, y}
When x, y ∈ B, they become the classical  , ·, and +, we get Table 1.
Secondly, consider the compound logic operators (·) , (+) , ⊕, (⊕) , and →.
∀x, y ∈ C, we have
(xy) = 1 − min(x, y)
(x + y) = 1 − max(x, y)
x ⊕ y = max{min(x, 1 − y), min(1 − x, y)}
(x ⊕ y) = max{min(1 − x, 1 − y), min(x, y)}
x → y = max(1 − x, y)
When x, y ∈ B, they become the classical logic operators (·) , (+) , ⊕, (⊕) ,
and →, we get Table 2.
Thirdly, consider the laws in both the classical logic and the continuous
logic. All these laws are in the same forms except the complement law. In the
continuous logic, ∀x ∈ C, we have
xx = min{x, 1 − x}
x + x = max{x, 1 − x}
When x ∈ B, it leads to the classical complement law:
xx = 0
x + x = 1
Now consider the inference rules in both the classical logic and the continuous
logic. When the involved propositions are in B, each of the inference rules in the
continuous logic leads to a corresponding rule in the classical logic. For examples,
consider the equivalence rule in the continuous logic:
P ↔ Q = (P → Q)(Q → P )
= (P  + Q)(Q + P )
= P  Q + P  P + QQ + P Q
= P  Q + P Q
= max{min(1 − P, 1 − Q), min(P, Q)}
When P, Q ∈ B, it leads to the equivalence rule in the classical logic, and we
get Table 3.
From the above discussion, one can see that the continuous logic is consistent
with the classical logic. As a matter of factor, since B ⊂ C, the classical logic is
simply a special case of this continuous logic.
522 X. Li

Table 3. Classical equivalence

P Q P  Q P → Q Q → P P ↔ Q
0 0 1 1 1 1 1
0 1 1 0 1 0 0
1 0 0 1 0 1 0
1 1 0 0 1 1 1

6 Conclusion
In this paper, we have presented the continuous logic. The truth values of vari-
ables and propositions in this continuous logic are within the closed interval
C = [0, 1], where 0 stands for complete falsity and 1 for complete truth. We
have defined three primitive logic operators: 1)Logic not  : C −→ C, 2)Logic
and · : C 2 −→ C, and 3) Logic or + : C 2 −→ C. Based on these primitive oper-
ators, we have derived some compound operators: 4) Not-and (·) : C 2 −→ C,
5)Not-or (+) : C 2 −→ C. 6) Exclusive-or ⊕ : C 2 −→ C, 7) Not-exclusive-or
(⊕) : C 2 −→ C, and 8) Implication →: C 2 −→ C. Many of the above operators
can be extended such that operator : C n −→ C, where n is an integer and n > 2.
In addition to the above operators, we have presented some laws and inference
rules in this continuous logic. The laws include 1) Double nigation, 2) Annihila-
tion, 3) Identity, 4) Idemptonce, 5) Commutativity, 6) Associativity, 7) Absorp-
tion, 8) Complement, 9) Distributivity, and 10) De Morgan Law. The inference
rules include 1) Modus Ponens, 2) Modus Tollens, 3) Disjunctive Syllogism, 4)
Hypothetical Syllogism, 5) Logic Equivalence, 6) Constructive Dilemma, 7) Dis-
tructive Dilemma, and 8) Bidirectional Dilemma.
Furthermore, We have also checked the consistency between the classical
logic and the continuous logic. We have shown that the classical logic is simply
a special case of this continuous logic because the truth value set of the classical
logic is a subset of the truth value set of the continuous logic, i.e., B = {0, 1} ⊂
C = [0, 1].

Appendix A
Proof of Not-exclusive-or Equation
(x ⊕ y) = (x y + xy  )
= (x y) (xy  )
= (x + y  )(x + y)
= xx + xy + y  x + y  y
1. Case of x ≥ y
In this case, we have
xy = y
x y  = x
The Classical Logic and the Continuous Logic 523

It follows that
(x ⊕ y) = xx + y + x + y  y
= (x + 1)x + y(1 + y  )
= x + y
= x y  + xy
2. Case of x < y
In this case, we have
xy = x
x y  = y 
It follows that
(x ⊕ y) = xx + x + y  + y  y
= x(x + 1) + y  (1 + y)
= x + y
= xy + x y 
Therefore, in all cases, we have
(x ⊕ y) = x y  + xy
It is equivalent to say that ∀x, y ∈ C,
max{min(x, 1 − x), min(x, y), min(1 − x, 1 − y), min(y, 1 − y)}
= max{min(x, y), min(1 − x, 1 − y)}

Appendix B
Proof of the Distributivity Law
1. x(y + z) = xy + xz
(a) Case of x ≥ y
if y ≥ z, we have
x(y + z) = xy = y
xy + xz = y + xz = y + z = y
otherwise, we have
x(y + z) = xz
xy + xz = y + xz = xz.

(b) Case of x < y


if y ≥ z, we have
x(y + z) = xy = x
xy + xz = x + xz = x
Otherwise, we have
x(y + z) = xz = x
xy + xz = x + x = x
In all cases, we have x(y + z) = xy + xz .
2. x + yz = (x + y)(x + z)
This equation can be proved in a way similar to the proof of x(y+z) = xy+xz.
524 X. Li

Appendix C
Proof of the De Morgan Law

1. (x + y) = x y 
(a) Case of x ≥ y
In this case, we have x ≤ y  . and hence
(x+y)’=x’=x’y’
(b) Case of x < y
In this case, we have x > y  , and hence
(x+y)’=y’=x’y’
In all cases, we have (x + y) = x y  .
2. (xy) = x + y 
This equation can be proved in a way similar to the proof of (x + y) = x y  .

References
1. Biacino, L., Gerla, G.: Fuzzy logic, continuity and effectiveness. Arch. Math. Logic
41(7), 643–667 (2002)
2. Cignoli, R.: Proper n-valued Lukasiewicz algebras as S-algebras of Lukasiewicz
n-valued propositional calculi. Stud. Logica. 41(1), 3–16 (1982)
3. Gerla, G.: Effectiveness and multivalued logics. J. Symb. Log. 71(1), 137–162
(2006)
4. Hajek, P.: Fuzzy logic and arithmetical hierarchy. Fuzzy Sets Syst. 3(8), 359–363
(1995)
5. Hurley, P.: A Concise Introduction to Logic, 9th edn. Wadsworth, Belmont (2006)
6. Ibrahim, A.M.: Introduction to Applied Fuzzy Electronics. Prentice Hall, Upper
Saddle River (1997)
7. Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Pren-
tice Hall, Upper Saddle River (1995)
8. Kleene, S.C.: Mathematical Logic. Dover Publications, New York (1967)
9. Mundici, D., Cignoli, R., D’Ottaviano, I.M.L.: Algebraic Foundations of Many-
Valued Reasoning. Kluwer Academic, Dodrecht (1999)
10. Novak, V.: On fuzzy type theory. Fuzzy Sets Syst. 149(2), 235–273 (2005)
11. Passino, K.M., Yurkovich, S.: Fuzzy Control. Addison-Wesley, Boston (1998)
12. Pedrycz, W., Gomide, F.: Fuzzy Systems Engineering: Toward Human-Centered
Computing. Wiley-Interscience, Hoboken (2007)
13. Pelletier, F.J.: Review of metamathematics of fuzzy logics. Bull. Symb. Log. 6(3),
342–346 (2000)
14. Santos, E.S.: Fuzzy algorithms. Inf. Control 17(4), 326–339 (1970)
15. Seising, R.: The Fuzzification of Systems. The Genesis of Fuzzy Set Theory and Its
Initial Applications – Developments up to the 1970s. Springer, New York (2007).
https://doi.org/10.1007/978-3-540-71795-9
16. Tsitolovsky, L., Sandler, U.: Neural Cell Behavior and Fuzzy Logic. Springer, New
York (2008). https://doi.org/10.1007/978-0-387-09543-1
17. Yager, R.R., Filev, D.P.: Essentials of Fuzzy Modeling and Control. Wiley, New
York (1994)
18. Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965)
The Classical Logic and the Continuous Logic 525

19. Zadeh, L.A.: Fuzzy algorithms. Inf. Control 12(2), 94–102 (1968)
20. Zaitsev, D.A., Sarbei, V.G., Sleptsov, A.I.: Synthesis of continuous-valued logic
functions defined in tabular form. Cybern. Syst. Anal. 34(2), 190–195 (1998)
21. Zimmermann, H.: Fuzzy Set Theory and Its Applications. Kluwer Academic Pub-
lishers, Boston (2001)
Research on Diverse Feature Fusion Network
Based on Video Action Recognition

Chen Bin(B) and Wang Yonggang

College of Tourism and Culture, Yunnan University, Lijiang 674199, Yunnan Province, China
[email protected]

Abstract. The application of video action recognition in various fields makes it a


hot issue. Network is proposed in this paper a new kind of third-rate fusion, called
new third-rate fusion network, the network will come from multiple features of
multi-layer flow of time and space with a diverse and compact double linear fusion
module integration. After the feature fusion stage, the channel-Spatial module, In
UCF101 data sets and HMDB51 experiments on data sets show that compared
with the latest model, the network has the best performance.

Keywords: Video action recognition · Multi-feature · Space · Fusion network

1 Introduction
Motion recognition because of its in video surveillance, human computer interaction,
health care and other fields are widely used and is regarded as one of the most attractive
topics in computer vision field. Design the local binary pattern [1] and manual functions
such as gradient direction histogram, space or time information extracted from action
video data, which are extensively adopted in simple scene video, but they can hardly
be extended to scenes with complex backgrounds due to various noises. As the deep
network architecture [2] emerged, deep learning function has made great strides. But it
is important to note that, unlike static image recognition, video motion recognition need
not only spatial information.
The dual-stream convolution network [3] not only considers the optical stream, but
also provides the weight of a fixed fraction fusion method. The network benefits from
the pre-training technology of the common still image data set, in which the space flow
extraction from the RGB space characteristics of the framework and the time stream
extracts the temporal features from the optical stream image. The final forecast is deter-
mined by putting the prediction results of two single streams into an average function
or a SVM classifier. The time segment network (TSN) [4] improves the original dual-
stream network through a new approach to sparse sampling. Specifically, It the whole
video into several clips, they are equal in time, and then randomly selects a short clip
from each clip. Dual-stream convolution network enables each segment to get fragment-
level predictions. The last of the video level forecast is the fusion of all segment-level
predictions and segmented consensus functions. TSN overcomes the problem that dual-
stream networks can not generate time characteristics with long-distance dependence.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 526–537, 2023.
https://doi.org/10.1007/978-3-031-18461-1_34
Research on Diverse Feature Fusion Network 527

However, these methods still have some shortcomings, such as: (i) Fusion stage before
the global average of pooling layer due to the average score mechanism will lead to loss
of information; and (ii) the global information fusion stage using only contains and the
lack of the top features of local details.
Accordingly, a third-rate network structure for video action recognition is proposed.
In the network fusion stream design the two modules. The first module uses a vari-
ety of compact bilinear algorithms to integrate a variety of features from multi-layer
basic networks, and effectively balances feature interaction and computational costs.
The feature transformation part of this module solves the problem of underfitting and
further reduces the computational cost, while the second module optimizes the parallel
connection function.

2 Related Work

2.1 Dual-Stream Network

Feichtenhofer et al. [5] proposed a method to fuse space-time information from dual-flow
network in the convolution layer, and tested several fusion methods including summa-
tion fusion, 2D/3D convolution fusion, bilinear fusion, etc. Unfortunately, the method of
using hybrid space-time network instead of time stream will negatively affect the extrac-
tion of proposed time features. TSN [4] makes effective use of the whole video in the
time dimension and explores various data modes such as RGB, RGB diff, optical stream,
Twisted optical stream, etc. Liu et al. [6]. An implicit dual-flow convolution network
is innovatively introduced to capture the motion information between adjacent frames.
Furthermore, some literatures generalize the dual-stream approach from 2D networks to
3D networks. However, 3D CNN [7] for recognition of human behavior does not meet
the proposed requirements as expected due to its rich parameters and high training data
requirements.

2.2 Bilinear Features

All stacked convolution layers for the depth of the action recognition network architec-
ture, For instance Inception, ResNet, DenseNet [8], features are extracted in a bottom-up
manner. The early layers of the network are characterized by local information, while
the top-level features contain a large amount of global information. Inception from
multi-scale receiving domain feature extraction, and collected each initial modules. The
small-scale convolution core in the Inception module can reduce the computing cost
and network parameters. ResNet connects input and output features through residual
connections, which helps to solve the challenges of better deep network model training.
In the end, whereas, they only take advantage of top-level features. DenseNet connects
the characteristics of all front-end modules densely. DenseNet shows good performance
on the recognition of still images, but its benefit is not obvious after merging the early
features of the network into the action recognition task [5]. Moreover, the excess feature
involved in convolution calculation is inefficient.
528 C. Bin and W. Yonggang

2.3 Attention Mechanism


The Life information is not evenly distributed in the spatio-temporal dimension of video
data, which in turn requires a new method to locate the location and content of key infor-
mation. Given a set of values vector and a query vector, pay attention to the mechanism
can be based on the weighted and query vector calculated value. Simply, mechanism
of attention can focus the “eyes” of the network on significant areas of video or simu-
late remote dependencies. In recent years, mechanism of attention has been thoroughly
and deeply applied in the field of natural language processing (NLP) [9, 10]. Subakan
et al. [11] provided an encoding-decoding method with a self-attention mechanism rather
than a recursive or convolutional structure. Mechanism of attention has also made some
advances in computer vision, such as target detection, image or video classification,
scene segmentation, etc.

3 Video Action Recognition-Based Diverse Feature Fusion Network


3.1 Proposal Overview
The scheme consists of three streams: space stream, time stream and fusion stream. Space
stream and time stream take the architecture network InceptionV3 as the backbone, and
InceptionV3 is pre-trained on ImageNet. The spatial stream takes a single RGB frame as
input data, while the time stream takes the corresponding five optical stream images as
input data. The purpose of these two streams is to extract spatial and temporal features
respectively, and then use different features from multiple layers to obtain compact
space-time features in the fusion stream. Given video V, RGB image Xrgb and optical
stream image Xrgb are extracted from V. When the data of the two modes are put into a
dual-stream network, the spatial stream can be expressed as:
   
PS , FS1 , FS2 , FS3 , · · · , FSL = Net S Xrgb (1)

The time stream can be expressed as:


   
PS , FS1 , FS2 , FS3 , · · · , FSL = Net S Xrgb (2)

wherein, PS represents spatial stream classification prediction, PT represents time stream


N represents features, and index N ∈ {1,2, …, L} represents
classification prediction, FM
which layer the feature comes from, and the index M ∈ {S, T} represents the feature
source of the stream. L represents the maximum number of layers to get the feature, S
represents the space stream and T represents the time stream, Net S and Net T represent
spatial stream network and time stream network respectively. The fusion stream makes
use of the different characteristics from the multi-layer Dual-stream network, and the
fusion stream classification prediction PST is obtained from the following equation:
 
PST = Net ST FS1 , FT1 , FS2 , FT2 , · · · , FSL , FTL (3)

The three-stream fusion network can be expressed as:


 
DFFN Xrgb , Xflow = SF(PS , PT , PST ) (4)
Research on Diverse Feature Fusion Network 529

wherein, Net ST represents the fusion stream network, and SF(•) represents the fusion
function. Put forward the overall architecture of the network is shown in Fig. 1. Multiple
compact bilinear fusion (DCBF) modules 1, 2 and 3 fuse pairs of spatial and temporal
features into compact space-time features. DCBF module 4 further integrates these
space-time features into a variety of compact space-time features. The following CSA
modules can use adaptive weights to refine a variety of compact space-time features.
The purpose of a converged stream is to obtain information that complements the other
two streams. The final classification prediction output is determined by all the prediction
outputs of the three streams with a weighted average function.

3.2 Diverse and Compact Bilinear Fusion

In the traditional fusion methods, such as element-by-element summation operation,


maximum operation and join operation, the interaction between the two elements of the
feature is rarely or basically unable to capture high-dimensional information. Bilinear
fusion enables all elements of the feature to interact with each other to extract more addi-
tional information. Specifically, two features are represented by X and Y respectively,
and bilinear fusion can be expressed as:

Z =X ⊗Y (5)

wherein, ⊗ represents the each pixel position matrix outer product of X and Y, that is
XY T . Z represents the Double linear fusion is the output of the feature vector. However,
the two problems of computational cost and residual large amount of noise challenge
the application of bilinear fusion. Assuming that the dimension of both input features is
103 , the dimension of the output fusion feature will be 106 . The two features of bilinear
fusion need a lot of computational cost, while the compact bilinear fusion takes into
account both feature crossover and computational cost. Compact bilinear is applied to
fuse features from multiple layers of a dual-stream network. In this work, the module
is named as diversified compact bilinear fusion module. The DCBF module consists
of a single diversified compact bilinear pooling part or a diversified compact bilinear
pooling part with additional feature conversion parts. These two types of fusion modules
are shown in Fig. 2. Applying the Count Sketch projection function [12] to a variety of
compact bilinear pooling helps to reduce data dimensions and find more frequent data
in the original input data. This feature will significantly project high-dimensional input
data to low-dimensional projection data.
υRm and ωRn are applied to represent input features and projection features, and
ω is first initialized to zero vectors. Two vectors S ∈ {−1 + 1}m and h ∈ {1, · · · , n} are
sampled. Vector s selects -1 or + 1, h samples its elements between 1 and υ. They all
satisfy the uniform distribution and remain the same. h maps each index i of the input
υ to the index t of the output ω. Using t = h(i) as the index of w, w can be updated by
adding s(i) • υ(i) to w itself. h(i) can be found in h by indexing i ∈ {1, · · · , m}. When
the Count Sketch projection function is expressed as:

ω = (υ, s, h) (6)
530 C. Bin and W. Yonggang

Fig. 1. Network architecture

Fig. 2. Diverse and compact bilinear fusion module

The following equation is obtained:

ψ(v1 ⊗ v2 , s, h) = ψ(v1 , s, h) ∗ ψ(v2 , s, h) (7)

wherein, υ1 and υ2 represent two input features. For the convenience of calculation, the
convolution in time domain can be transformed into the product of elements in frequency
Research on Diverse Feature Fusion Network 531

domain by Fourier transform and Fourier inverse transform. According to the fast Fourier
transform (FFT) [13], the output characteristics of compact bilinear can be expressed as
follows:

ψ(v1 ⊗ v2 , s, h) = FFT −1 (FFT (ω1 )  FFT (ω2 )) (8)

wherein,  represents the element-wise product. ω1 and ω2 represent projected features.

3.3 Channel Spatial Attention Module

Recently, channel attention and spatial attention have been frequently used in the field
of computer vision. The CSA module in combination with the two in parallel, and the
two submodules work at the same time. The weights within the channel are the weight
matrix of channel concern, and the weights across the channel are the weight matrix of
multi-scale space concern. These two weights are calculated as:

CA = δ(FC wt1 (FC wt0 (fA )) + FC wt1 (FC wt0 (fM )))

fA = AugC(fin )

fM = MaxC(fin )

Sa = δ(Conv1 ∗ 1(f2 ))

f2 = Cat(Conv3 ∗ 3(f1 ); Conv5 ∗ 5(f1 ); Conv7 ∗ 7(f1 ))

f1 = Cat(AugS(fin ); MaxS(fin )) (9)

wherein, fin ∈ RC×H ×W , Ca ∈ RC×1×1 and Sa ∈ R1×H ×W represent input features,


intra-channel weights and cross-channel weights, respectively. δ represents the Sigmoid
C C
function. wt 0 ∈ R r ×C and wt 1 ∈ RC× r represent the weights of the fully connected
layer FC wt0 and FC wt1 , respectively. r represents the deceleration ratio. f2 and FC wt0
follow is an ReLU activation function. AugC is an average pooling operation, Which
one can compute the mean value of a particular channel. MaxC is the maximum pooling
operation, and you can find the maximum value in a particular channel. AugS can be
used to calculate the average of a specific location of all channels, and MaxS helps to
determine the maximum value of a specific location of all channels. Conv is a convolution
operation, whose index represents the convolution core size and Cat represents the join
operation. Then, the output characteristics can be obtained by: fout = fin × Ca × Sa.
In the overall architecture, residual connections of CSA modules are provided to
speed up the training process. Then the output characteristics of the CSA module with
residual connections can be expressed as: fout = fin × Ca × Sa × fin◦
532 C. Bin and W. Yonggang

4 Experimental Analysis
4.1 Public Datasets
UCF101: a dataset containing 13320 short videos from YouTube video clips. It is an
extension of the UCF50 data set, in order to save the real action in the video. It contains
101 classes, is a typical data set of assessment action recognition model.
HMDB51: a HMDB51 dataset that includes 6766 short videos from a variety of
sources, such as movies, YouTube videos, and Google Video. It has 51 action classes,
each containing at least 101 clips.
The task list of the dataset is set up before the experiment, as shown in Table 1. This
work takes the video of the database as the research object, but only four simple actions
of the human body: running, walking, jumping and gestures are mainly studied because
the scene is relatively simple, and these four actions are detected and classified.

Table 1. Video set task list

Task category Description


Scenes The environment in which the video action takes place
Target The target is mainly the action of the characters in the video
Action Running, walking, jumping, gestures
Event Video sequence actions of people in video focus

Table 2. Performance comparison

Model UCF101 (%) HMDB51 (%)


TSN [4] 94.00 68.50
Convolutional dual-stream [5] 92.50 65.40
Hidden Dual Stream [6] 93.20 66.80
Proposed (thawed) 95.27 71.33
Proposed (frozen) 94.96 71.09

Table 3. Accuracy of each module

Model Frozen (%) Thawed (%)


Base-line 93.73 93.73
DCBF 94.44 94.75
Channel spatial attention 94.79 95.20
Research on Diverse Feature Fusion Network 533

Table 4. Effects of different characteristics

Model Frozen (%) Thawed (%)


Base-line 93.73 93.73
Level-1 94.32 94.62
Level-2 94.52 94.96
Level-3 94.79 95.20

4.2 Experimental Construction

Firstly, the training space stream network and time stream network are constructed
respectively. The RGB frame is extracted from the original video by OpenCV, and the
optical stream image is extracted by the TVL1 optical stream algorithm implemented by
CUDA in OpenCV. For dual-stream network training, the learning rate is initialized to
10–3 . For the UCF101 dataset, the learning rate of the spatial stream network is divided
by 10 at the 40th epoch, 80th epoch and 120th epoch, and the spatial stream network is
trained at the 150th epoch. Divide the 170th epoch, 280th epoch and 320th epoch by 10
for the time stream network, and complete the time stream network training at the 350th
epoch. Some adjustments were made during the experiment on the HMDB51 dataset
because it was harder to train with the HMDB51 data set. The period of change in the
learning rate is 60, 120, 160, and the training of spatial stream ends in the 200th period.
For the time stream, the learning rate is changed at 170, 280, 320 epoch and the training
ends at 400th epoch.
The Fusion of flow network training process is divided into two stages: freezing
phase and thawing stage. In the freezing phase, all parameters are frozen and there is
no gradient back propagation in the dual-stream network to maintain the integrity of
the feature extractor and accelerate the fusion stream training. In the thawing stage, the
parameters of the two streams are thawed and the gradient back propagation is carried
out in the whole three-stream network. The batch size for the freeze phase is set to 60
and the thaw phase is set to 15. The dropout rate of the two stages was 0.5. For the
UCF101 dataset, in the 50th epoch, 100th epoch, and 150th epoch during the freeze
phase, the learning rate is initialized to 10–3 and divided by 10, with a maximum epoch
of 200th. The initial learning rate during the thawing phase is 10–4 , and the learning rate
changes during periods 50 and 100. The maximum epoch of thawing stage is 150th. For
HMDB51, the period of change in the learning rate is 80, 160, 230, and the training ends
at the 300th period of the freeze phase and the 80, 160, 200th of the thaw phase. The
final predicted score is the weighted average of the scores from the three streams. Space
stream is set to 0.5, time stream is set to 2.0, and fusion stream is set to 1.0. In the test
phase, the scores of 25 evenly spaced sampling segments in the whole video frame were
averaged to obtain the final score of each stream.
534 C. Bin and W. Yonggang

Fig. 3. Performance comparison

4.3 Experimental Results and Analysis

The proposed model is compared with the current state model on UCF101 common
data sets and HMDB51 common data sets. Even the number of segments of the model
is 3, which is divided into manual feature group and deep learning feature group. The
result of comparison are shown in Fig. 3. It can be seen from the table that most deep
learning feature methods are superior to artificial feature methods in terms of accuracy.
The algorithms, which one we proposed achieves the accuracy of 94.96% and 95.27%
respectively in the freezing phase and thawing phase of UCF101 data sets, that is better
than other algorithms. Correspondingly, on the HMDB51 dataset, the proposed model
reached 71.09% and 71.33% in the two stages, respectively.

4.3.1 Role of the Module


Three models were tested to explore the significance of each module. The accuracy of the
baseline can reach 93.73%. When four DCBF modules are added to the baseline model,
the accuracy can be improved by about 0.70% during the freezing phase and 1.01%
during the thawing phase. With the addition of the CSA module, the whole network can
further achieve accuracy gains of 0.34% and 0.45%. Figure 4 confirms the accuracy of
each module added to the entire network. In this table, DCBF represents an experiment
without a CSA module, and channel space attention represents an experiment with an
integral network with a CSA module.
Research on Diverse Feature Fusion Network 535

Fig. 4. Accuracy of each module

Fig. 5. Effects of different characteristics

4.3.2 Feature Fusion


To present the effects of features at multiple levels, we use three sets of features. Figure 5
shows the details. Level 1 means that the fusion module only uses Inc. Mod1 for the
underlying dual-stream network. Meanwhile, level 1 uses only DCBF module 1. Level 2
uses features of Inc. Mod.1 and company. The Mod. 2. DCBF module 1, DCBF module
2, and DCBF module 4 are enabled in level 2 laboratory. Level 3 represents the final
536 C. Bin and W. Yonggang

model for this article. All three models retain the CSA module. The accuracy of Level
1 in the two stages is about 94.32% and 94.64%, respectively. Level 2 is 0.20% higher
than Level 1 in the freezing phase and 0.34% higher than Level 1 in the thawing phase.
The accuracy of Level 3 is 0.27% and 0.24% higher than that of the others.

5 Conclusion

In this paper, a new video based action recognition network, multi-feature fusion net-
work, is proposed. A variety of compact bilinear algorithms are used to fuse the spatial
and temporal characteristics acquired before the global average pooling layer of two
single streams, which is called fusion stream. In addition, considering the local life
information, it combines the spatio-temporal characteristics corresponding to the multi-
ple layers of two kinds of flows, which is called diversification feature. Channel attention
and multi-size spatial attention are connected in parallel to obtain a set of weights to
better select the feature part. Experiments on real data sets show that the proposed model
has good performance.
Video action recognition is an important research topic in the field of computer
vision. In this study, video action recognition is improved to some extent, but it is far
from enough. In the future, the research may focus on: (1) the study of occlusion in
complex video; (2) the analysis of local joint movements; and (3) the fusion of joint and
manual apparent features.

References
1. Deng, H., Jun, K., Min, J., Liu, T.: Diverse Features Fusion Network for video-based action
recognition (2021)
2. Wei, X., Yu, X., Zhang, P., Zhi, L., Yang, F.: CNN hyperspectral image classification combined
with local binary mode. J. Remote Sens. 24(8), 107–113 (2020)
3. Lu, Y.: Research on Deep Neural Network Architecture Improvement and Training Perfor-
mance Improvement. North Central University
4. Li, Q., Li, A., Wang, T., et al.: Behavioral recognition combining ordered optical stream graph
and dual-stream convolutional network. J. Opt. 38(6), 7 (2018)
5. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal Segment Networks: Towards Good Practices
for Deep Action Recognition. Springer, Cham (2016)
6. Zhang, J., Qinke, P., Sun, S., Liu, C.: Collaborative filtering recommendation algorithm based
on user preference derived from item domain features (2014)
7. Dean, J., Ghemawat, S., et al.: MapReduce: simplified data processing on large clusters.
Commun. ACM, 51(1), 107-113 (2008)
8. Liu, H., Tu, J., Liu, M.: Dual-stream 3D convolutional neural network for skeleton-based
action recognition (2017)
9. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
10. Huang, G., Liu, Z., Laurens, V., et al.: Densely connected convolutional networks. IEEE
Comput. Soc. (2016)
11. Yin, W., Schütze, H., Xiang, B., et al.: ABCNN: attention-based convolutional neural network
for modeling sentence Pairs. Comput. Sci. (2015)
Research on Diverse Feature Fusion Network 537

12. Lei, S., Yi, W., Ying, C., et al.: A review of attention mechanism research in natural language
processing. Data Anal. Knowl. Discov. 4(5), 14 (2020)
13. Subakan, C., Ravanelli, M., Cornell, S., et al.: Attention is all you need in speech separation.
In: ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE (2021)
14. Caracas, A., Kind, A., Gantenbein, D., et al.: Mining semantic relations using NetFlow. In:
IEEE/IFIP International Workshop on Business-driven It Management. IEEE (2008)
Uncertainty-Aware Hierarchical
Reinforcement Learning Robust to Noisy
Observations

Felippe Schmoeller Roza(B)

Fraunhofer IKS, Munich, Germany


[email protected]

Abstract. This work proposes UA-HRL, an uncertainty-aware hier-


archical reinforcement learning framework for mitigating the problems
caused by noisy sensor data. The system is composed of an ensemble
of predictive models that learns the environment’s underlying dynamics
and estimates the uncertainty through their prediction variances and a
two-level Hierarchical Reinforcement Learning agent that integrates the
uncertainty estimates into the decision-making process. It is also shown
how frame-stacking can be combined with the uncertainty estimation for
the agent to make better decisions despite the aleatoric noise present in
the observations. In the end, results obtained in a simulation environ-
ment are presented and discussed.

Keywords: Hierarchical reinforcement learning · Uncertainty ·


Robustness · Decision making

1 Introduction
Reinforcement learning (RL) has regained attention in the last years with
achievements that range from learning to play video games from scratch [18]
to beating masters in the game of Go [22,27,28]. All of this was possible due to
the integration of Deep Neural Networks (DNNs) that can be efficiently used as
feature extractors and function approximators for high-dimensional problems.
However, despite these impressive accomplishments, RL is still overlooked as a
viable solution for controlling dynamic systems in real-world applications. The
lack of strong safety guarantees, unacceptable level of robustness, and weak gen-
eralization can be indicated as some of the reasons for not choosing RL.
However, not only safety-related properties are necessary for RL systems. The
challenges of learning from complex environments are diverse, ranging from dif-
ficulties in dealing with large state spaces and high-dimensional data encodings
to dealing with temporal abstraction and increasing sample efficiency. Research
in the field of hierarchical reinforcement learning (HRL) alongside evidence col-
lected from studies in neuroscience and behavioral psychology suggest that hier-
archical structures can help RL tackle these issues [3,4,21].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 538–547, 2023.
https://doi.org/10.1007/978-3-031-18461-1_35
Uncertainty-Aware HRL Robust to Noisy Observations 539

Working under uncertainty is another challenging problem for RL agents (and


machine learning in general). Using RL to make decisions based on data that
differs from the training distribution might lead to catastrophic outcomes, as
pointed out in [9]. It is also important to notice that, even though deep learning
is now state-of-the-art for solving many complex tasks (e.g., computer vision),
methods used for training DNNs are also known to be susceptible to failure
under uncertainty, as pointed out by different studies [11,24,25].
Different sources of uncertainty, namely epistemic and aleatoric uncertainties,
affect real-world environments. [26] shows how to integrate uncertainty estima-
tion in RL models to model epistemic uncertainty and use it as a proxy for
detecting domain shifts. Aleatoric uncertainty, on the other hand, cannot be
captured using this same approach.
This paper is focused on presenting an HRL model that combines a bootstrap-
based uncertainty estimator with frame staking to handle aleatoric uncertainty
caused by noisy data. The paper is structured as follows: Sect. 2 presents the work
that has been published comprising topics that are relevant, Sect. 3 contains
the RL and HRL problem formulation, Sect. 4 details the proposed method,
Sect. 5 shows the results obtained through simulation, and Sect. 6 presents the
conclusion remarks, the method known limitations, and future work.

2 Related Work
Hierarchical reinforcement learning is credited as a good representation of the
human biological brain structure and of our reasoning process [5,6,21]. HRL’s
capacity to excel in complex problems is also outlined in a myriad of publications
(e.g. [20,32]). One important HRL formulation is given by the options framework
[29], which still influences (directly or indirectly) most of the state-of-the-art
HRL models [2]. The importance of achieving temporal abstraction with RL
agents is not only highlighted by the options framework but by other works that
have also addressed this topic [13].
Two-level HRL architectures are quite popular due to their simple structure,
which helps in designing and deploying the model. It consists of a high-level
model responsible to break down the problem into sub-goals, and a low-level
controller responsible to accomplish the determined sub-goals. An issue with
this approach comes from the policies working at different temporal abstractions,
making learning a complicated task considering that the environment outputs a
single feedback signal. The author in [15] show how intrinsic motivation can be
used to train the low-level controller while the high-level system learns directly
through the extrinsic reward given by the environment. The author in [19] also
presents a similar HRL structure.
Uncertainty in deep learning models is now thoroughly studied since there is
a bigger appeal for deploying such systems in real (and potentially safety-critical)
environments. The author in [1] present an overview of different methods used for
uncertainty quantification in deep learning. The author in [10] show a comparison
between different methods available for uncertainty estimation. Regarding RL,
uncertainty is often associated with safety [12,14,17].
540 F. S. Roza

3 Preliminaries
In this section, the classical RL framework and the HRL variants are formalized.
In RL, we consider an agent that sequentially interacts with an environment
modeled as a Markov Decision Process (MDP) [19]. An MDP is a tuple M :=
(S, A, R, P, μ0 ), where S is the set of states, A is the set of actions, R : S ×
A × S → R is the reward function, P : S × A × S → [0, 1] is the transition
probability function which describes the system dynamics, where P (st+1 |st , at )
is the probability of transitioning to state st+1 , given that the previous state
was st and the agent took action at , and μ0 : S → [0, 1] is the starting state
distribution. At each timestep the agent observes the current state st ∈ S, takes
an action at ∈ A, transitions to the next state st+1 drawn from the distribution
P (st , at ), and receives a reward R(st , at , st+1 ).
For the Hierarchical model formulation, a two-level structure similar to the
one presented by [31] will be used. The top-level policy μhi observe the state
st and set a high-level action (or goal) gt . The bottom-level policy μlo observes
the state st and the goal gt and outputs a low-level action at which should
move the agent towards accomplishing the high-level action. The high-level agent
receives the environment reward R(st , at , st+1 ) every time step, while an intrinsic
reward R(st , gt , at , st+1 ) is given by the high-level agent to the low-level agent
every time step. Temporal abstraction is provided by the high-level policy only
deriving a new high-level action once the low-level agent finished the task, either
by completing it or failing at it.

4 Uncertainty-Aware Hierarchical Reinforcement


Learning
In this section, the framework for learning HRL models under uncertainty is
presented. The system is composed by an uncertainty estimator that works with
frame stacking to estimate aleatoric uncertainty caused by sensor noise, and a
two-level HRL model. The framework is shown in Fig. 1.

4.1 Frame Stacking


Frame stacking is a method used to capture features that only manifest in a
sequence of states. It consists in buffering a sequence of observations and the
actions taken and appending these to the observation. In RL this is particularly
relevant since inconsistencies in the transition function can only be perceived
after the transition takes place. Therefore, using a buffer to keep past observa-
tions will enable tracking such deviations.
Frame stacking can also help to identify if noise is affecting the observation
measurements. As an example, we can think of an agent controlling a car and
observing its speed. Let’s assume the agent observed a speed of 20 m/s at time
t0 . If the agent starts breaking the car and at t1 observes a speed of 25 m/s,
a logical guess is that one of the observations is wrong, since it’s expected to
Uncertainty-Aware HRL Robust to Noisy Observations 541

High-level agent

Uncertainty
esƟmator

Low-level agent

Fig. 1. UA-HRL: uncertainty-aware hierarchical reinforcement learning.

observe the speed to decrease after a break (if reasons like mechanical failures are
disregarded). In other words, if an accurate model for the underlying dynamics
of the environment is available, feeding the model with the history of previous
states and actions and comparing the predicted states to the actually measured
states will give some good hints on both data and model quality.

4.2 Uncertainty Estimation

One of the main subsystems that composes the framework is the uncertainty
estimator. The best approach for capturing uncertainty in deep learning would
be to use Bayesian networks, able to inherently represent uncertainty by learning
distributions over the network weights. However, due to the amount of computing
necessary to process such models, approximations are usually used. Different
techniques can be used to approximate Bayesian networks, such as Monte-Carlo
Dropout, Deep Ensembles and deterministic uncertainty quantification [7,8,30].
Considering the good performance achieved by Deep Ensembles [10], this
method was chosen to achieve uncertainty quantification in the proposed frame-
work. The uncertainty is approximated by the variance of predictions given by
the ensemble members for the same input state. The intuition behind it is that
assuming that training leads to near-optimal approximations, the models should
converge to similar decisions when a sample similar to those experienced during
training but probably will diverge when out-of-the-ordinary inputs are given,
enforced by random weight initialization and using a dropout layer during train-
ing.
In the proposed method, uncertainty estimation is performed on top of the
frame stacked observations. The ensemble consists of m models. Each model
predicts the next state observation based on a stacking of N past observations
and actions, i.e., the model input is {s0 , ..., sN , a0 , ..., aN } and it outputs sN +1 .
The uncertainty estimates will be given by the variance over the m ensemble
predictions at every time step.
542 F. S. Roza

4.3 Hierarchical Reinforcement Learning


The other subsystem that completes the proposed framework is a two-level hier-
archical reinforcement learning model. The goal of the high-level agent is to
maximize the extrinsic reward, which is a feedback signal that comes from the
environment, while the low-level agent should follow the direction given by the
top-level policy, also responsible to generate the intrinsic reward that will guide
the training process of the low-level agent. It follows a similar structure as pro-
posed by [15].
Both high-level and low-level policies are trained using Proximal policy opti-
mization (PPO) [23].

5 Experiments
The experiments were conducted in the windy maze environment, proposed by
[16]. This environment consists of a discrete maze, as shown in Fig. 2. The goal
is to cross the maze from the starting position and reach the goal position. In
every time step, a wind randomly blows in one of the four directions (north,
south, east, west) and the agent can choose to be carried by the wind or stay in
its position. The observation consists of the [x,y] position of the agent and the
wind direction. The reward is sparse, with the agent receiving 100 when reaching
the goal and 0 otherwise. The episode is finished when the agent achieves the
goal or after 200 steps.

Start

Goal

Fig. 2. Windy maze environment.

In the hierarchical setting, the high-level agent defines the direction the agent
should move next and the low-level decides if it should move or not. By not
moving, a reward of 0 is returned to the low-level agent. By deciding to move,
the low-level will receive +1 if the high-level goal is achieved and −1 otherwise.
This environment was modified to fit the uncertainty setting this paper is
focusing on. To mimic sensor noise reasonably, noise with a magnitude of 1
Uncertainty-Aware HRL Robust to Noisy Observations 543

is randomly injected to the agent position, i.e., ±1 is added to both x and


y positions with a probability ρ ∈ [0, 1]. If ρ = 0, the environment works as
originally designed and the agent can rely completely on the measurements to
decide whether to move or not. The higher the probability of noise being added
to the measurements, the less the agent should rely on the observation alone for
taking decisions.

5.1 Results

Fig. 3. Results for the Vanilla HRL Model (without uncertainty estimation). The high-
level reward represents extrinsic reward and, therefore, reflects the ability to solve the
task. The low-level reward is the intrinsic reward and the combined is the sum of both.

The first set of experiments was done with an HRL model without any uncer-
tainty estimation. The probabilities of adding noise to the observation were
ρ = {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}, ranging from 0.0 (no noise) to 0.5 (50% of proba-
bility for each observation to be contaminated with noise). The results are shown
in Fig. 3.
In Fig. 3a we can see how the noise affects the capacity of the agent of solving
the given task. For ρ values above 0.3, the agent is not able to solve the problem
544 F. S. Roza

at all. Figure 3c shows how the low level agents still can learn something in most
cases. That means that the HRL fails in learning due to the inability of the high-
level agent to correctly set goals to be followed and that should move the agent
towards the goal. This result is expected considering that in this environment
only the positions are affected by noise, while the wind measurement remains
free of noise.

Fig. 4. Results for UA-HRL with uncertainty estimator ensemble based on 5 models.
The high-level reward represents extrinsic reward and, therefore, reflects the ability to
solve the task. The low-level reward is the intrinsic reward and the combined is the
sum of both.

Figure 4 shows the results obtained for the same settings with UA-HRL, the
framework proposed in this paper. For this experiment, the values of ρ were
changed in the same way as in the previous experiment. By analyzing Fig. 4a
it becomes clear how the UA-HRL model is able to learn meaningful policies
regardless of the amount of noisy data experienced during training. It is also
evident how much slower is the learning rate. That can be easily explained by
the fact that in this case, the model not only must learn how to solve the task
but also how to interpret and take advantage of the uncertainty estimates.
Uncertainty-Aware HRL Robust to Noisy Observations 545

It is not clear why the system is performing slightly worse for low ρ values.
One hypothesis is that for low-noise data, the uncertainty estimates are not
necessary but variances are still produced and introduced into the model since
realistically the learned models will slightly diverge even for noise-free samples.

6 Conclusion

In this paper, UA-HRL, a framework that includes an uncertainty estimator


to a hierarchical reinforcement learning model is presented. The main goal is
to mitigate deficiencies caused by noisy observations, an undesired property
present in most real-world problems that can affect not only hierarchical mod-
els but any decision-making agent. Through the results obtained with simula-
tions in the Windy Maze environment, the improvements reached by training
uncertainty-aware systems were evident, especially for data distributions with
more noise. Therefore, the proposed HRL model was improved in terms of robust-
ness through the integration of uncertainty estimates in the decision-making
process.
Some limitations are still present in the proposed method. The system does
not provide any strong guarantee that helps in ensuring safety. Rather, it is
shown through experimentation that the capacity to make decisions is main-
tained even under uncertainty. Also, the whole method is built on top of the
assumption that the ensemble variance is a good proxy for uncertainty. Using
Bayesian networks or better approximators could improve the uncertainty esti-
mation.
Some questions remain open and should be addressed in future work. A
more detailed ablation study would clarify the reason for UA-HRL performing
poorly for data with lower amounts of noise. Extending the experiments to other
environments could also help on evaluating the effectiveness of the proposed
method in increasing robustness when dealing with noisy measurements.

Acknowledgments. This work was funded by the Bavarian Ministry for Economic
Affairs, Regional Development and Energy as part of a project to support the thematic
development of the Institute for Cognitive Systems.

References
1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech-
niques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
2. Bacon, P.-L., Harb, J., Precup, D.: The option-critic architecture. In: Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
3. Badre, D., Hoffman, J., Cooney, J.W., D’esposito, M.: Hierarchical cognitive con-
trol deficits following damage to the human frontal lobe. Nat. Neurosci. 12(4),
515–522 (2009)
4. Botvinick, M., Ritter, S., Wang, J.X., Kurth-Nelson, Z., Blundell, C., Hassabis, D.:
Reinforcement learning, fast and slow. Trends Cogn. Sci. 23(5), 408–422 (2019)
546 F. S. Roza

5. Botvinick, M.M., Niv, Y., Barto, A.G.: Hierarchically organized behavior and its
neural foundations: a reinforcement learning perspective. Cognition 113(3), 262–
280 (2009)
6. Botvinick, M.M.: Hierarchical reinforcement learning and decision making. Curr.
Opinion Neurobiol. 22(6), 956–962 (2012)
7. Fort, S., Hu, H., Lakshminarayanan, B.: Deep ensembles: a loss landscape perspec-
tive. arXiv preprint arXiv:1912.02757 (2019)
8. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing
model uncertainty in deep learning. In: International Conference on Machine Learn-
ing, pp. 1050–1059. PMLR (2016)
9. Haider, T., Roza, F.S., Eilers, D., Roscher, K., Günnemann, S.: Domain shifts in
reinforcement learning: identifying disturbances in environments (2021)
10. Henne, M., Schwaiger, A., Roscher, K., Weiss, G.: Benchmarking uncertainty esti-
mation methods for deep learning with safety-related metrics. In: SafeAI@ AAAI,
pp. 83–90 (2020)
11. Henne, M., Schwaiger, A., Weiss, G.: Managing uncertainty of AI-based perception
for autonomous systems. In: AISafety@ IJCAI (2019)
12. Hoel, C.-J., Wolff, K., Laine, L.: Tactical decision-making in autonomous driving
by reinforcement learning with uncertainty estimation. In: 2020 IEEE Intelligent
Vehicles Symposium (IV), pp. 1563–1569. IEEE (2020)
13. Jong, N.K., Hester, T., Stone, P.: The utility of temporal abstraction in reinforce-
ment learning. In: AAMAS (1), pp. 299–306. Citeseer (2008)
14. Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S.: Uncertainty-aware rein-
forcement learning for collision avoidance. arXiv preprint arXiv:1702.01182 (2017)
15. Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep
reinforcement learning: integrating temporal abstraction and intrinsic motivation.
In: Advances in Neural Information Processing Systems, vol. 29 (2016)
16. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune:
a research platform for distributed model selection and training. arXiv preprint
arXiv:1807.05118 (2018)
17. Lütjens, B., Everett, M., How, J.P.: Safe reinforcement learning with model uncer-
tainty estimates. In: 2019 International Conference on Robotics and Automation
(ICRA), pp. 8662–8668. IEEE (2019)
18. Mnih, V., et al.: Playing Atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602 (2013)
19. Nachum, O., Gu, S.S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement
learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
20. Pertsch, K., Lee, Y., Lim, J.J.: Accelerating reinforcement learning with learned
skill priors. arXiv preprint arXiv:2010.11944 (2020)
21. Ribas-Fernandes, J.J.F., et al.: A neural signature of hierarchical reinforcement
learningd. Neuron 71(2), 370–379 (2011)
22. Schrittwieser, J., et al.: Mastering Atari, go, chess and shogi by planning with a
learned model. Nature 588(7839), 604–609 (2020)
23. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
24. Schwaiger, A., Sinhamahapatra, P., Gansloser, J., Roscher, K.: Is uncertainty quan-
tification in deep learning sufficient for out-of-distribution detection? In: AISafety@
IJCAI (2020)
25. Schwaiger, F., et al.: From black-box to white-box: examining confidence calibra-
tion under different conditions. arXiv preprint arXiv:2101.02971 (2021)
Uncertainty-Aware HRL Robust to Noisy Observations 547

26. Sedlmeier, A., Gabor, T., Phan, T., Belzner, L., Linnhoff-Popien, C.: Uncertainty-
based out-of-distribution detection in deep reinforcement learning. arXiv preprint
arXiv:1901.02219 (2019)
27. Silver, D., et al.: Mastering the game of go with deep neural networks and tree
search. Nature 529(7587), 484–489 (2016)
28. Silver, D., et al.: Mastering the game of go without human knowledge. Nature
550(7676), 354–359 (2017)
29. Sutton, R.S., Precup, D., Singh, S.: Between MDPS and semi-MDPS: a framework
for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211
(1999)
30. Van Amersfoort, J., Smith, L., Teh, Y.W., Gal, Y.: Uncertainty estimation using a
single deep deterministic neural network. In: International Conference on Machine
Learning, pp. 9690–9700. PMLR (2020)
31. Vezhnevets, A.S., et al.: Feudal networks for hierarchical reinforcement learning.
In: International Conference on Machine Learning, pp. 3540–3549. PMLR (2017)
32. Yang, Z., Merrick, K., Jin, L., Abbass, H.A.: Hierarchical deep reinforcement learn-
ing for continuous action control. IEEE Trans. Neural Netw. Learn. Syst. 29(11),
5174–5184 (2018)
Resampling-Free Bootstrap Inference
for Quantiles

Mårten Schultzberg(B) and Sebastian Ankargren

Spotify, Stockholm, Sweden


{mschultzberg,sebastiana}@spotify.com

Abstract. Bootstrap inference is a powerful tool for obtaining robust


inference for quantiles and difference-in-quantiles estimators. The com-
putationally intensive nature of bootstrap inference has made it infeasible
in large-scale experiments. In this paper, the theoretical properties of the
Poisson bootstrap algorithm and quantile estimators are used to derive
alternative resampling-free algorithms for Poisson bootstrap inference
that reduce the computational complexity substantially without addi-
tional assumptions. These findings are connected to existing literature
on analytical confidence intervals for quantiles based on order statis-
tics. The results unlock bootstrap inference for difference-in-quantiles for
almost arbitrarily large samples. At Spotify, we can now easily calculate
bootstrap confidence intervals for quantiles and difference-in-quantiles in
A/B tests with hundreds of millions of observations.

Keywords: Bootstrap · Difference-in-quantiles · Order statistic ·


Poisson · Resampling-free

1 Introduction
The use of randomized experiments in product development has seen an enor-
mous increase in popularity over the last decade. Modern tech companies now
view experimentation, often called A/B testing, as fundamental and have tightly
integrated practices around it into their product development. The vast major-
ity of A/B testing compares two groups, treatment and control, with respect to
average treatment effects through calculation of difference-in-means. These com-
parisons are operationalized through standard z-tests that are simple to perform.
With the rise of A/B testing, tests that do not compare average effects are also
gaining more and more interest. Difference-in-quantiles, where treatment and
control quantiles are compared, is one such test, where reasons for it might be
that effects are not expected, or that they are difficult to identify, on average.
For example, a change could be targeting users experiencing the largest amount
of buffering, which means a difference-in-quantiles comparison for, say, the 90th
percentile may be of more interest than the average buffering amount experi-
enced by users. These tests are, however, much more difficult to perform, with
non-standard sampling distributions, that severely complicate implementations.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 548–562, 2023.
https://doi.org/10.1007/978-3-031-18461-1_36
Resampling-Free Bootstrap 549

In randomized experiments, a common technique for doing inference for esti-


mators with non-standard or intractable sampling distributions is bootstrap [7].
Bootstrap is a resampling-based method where the sampling distribution of an
estimator is estimated by resampling with replacement from the observed sam-
ple. Bootstrap inference is known to be consistent for quantile estimators and
difference-in-quantiles estimators [8,9] under mild conditions on the outcome
distribution. The computational intensive nature of bootstrap has made the pri-
mary use case small-sample experiments. The large-scale online experiments run
by tech companies often involve millions or even hundreds of millions of users.
Recently, two prominent approaches for bootstrapping with big data have
been proposed. Both of these methods are focused on finding computationally
efficient implementations of the bootstrap approach rather than reducing its
complexity. The first approach, known as the Poisson bootstrap, utilizes that
bootstrap samples, i.e., multinomial sampling from the original sample, can be
well approximated by Poisson frequencies [11]. For example, [2] showed that the
Poisson bootstrap can be implemented in the MapReduce framework [6], which
enables powerful parallelization on clusters of computers. [2] implemented a non-
parametric bootstrap in MapReduce for linear estimators [16], like means and
sums or smooth functions thereof. Moreover, [2] showed that semi-parametric
estimators of non-linear estimators such as quantiles can be estimated using sim-
ilar implementations. As an alternative to bootstrapping quantiles when samples
are dependent, [14] developed a method based on asymptotic arguments.
Another recent approach to bootstrapping that enables efficient implemen-
tations for big data is the so-called ‘Bag of little bootstraps’ [13]. This approach
splits the full-sample inference problem into several smaller inference problem,
and then weights the results together in a consistent manner. Bag of little boot-
straps is non-parametric in the sense that it applies to any estimator, while still
allowing efficient parallelization implementations.
Both the Poisson bootstrap and the Bag of little bootstrap enable big data
bootstrap inference for quantiles through parallelization. In other words, the
computational complexity is overcome by efficient and scalable computing. The
general complexity of the Poisson bootstrap algorithm is of the order of O(CB),
where C is the complexity of the estimator calculated in each bootstrap sample
(see, e.g., [4] for an introduction to algorithmic complexity). For quantiles, most
common estimators have complexity of order O(N ). This follows since most
quantile estimators are based on the order statistics for which the complexity is
of order O(N ) [4], leading to O(N B) for quantile bootstrap.
Analytical confidence intervals for quantiles have been studied for a long time.
For the one-sample quantile case, simple exact and distribution-free confidence
intervals for population quantiles can be constructed using only order statistics,
see for example [10, p. 159]. These approaches unlock one-sample confidence
intervals for massive samples, but they are largely absent in experimentation in
the tech industry. A likely reason is that these approaches do not directly extend
to the two-sample difference-in-quantile case.
550 M. Schultzberg and S. Ankargren

The focus of this paper is to reduce the complexity of Poisson confidence


intervals (CIs) for quantiles and difference-in-quantiles type estimators. Specifi-
cally, the properties of the Poisson distribution and quantile estimators based on
order statistics are used to simplify the problem. We note that for one-sample
problems our approach exactly reproduces the exact order-statistic-based confi-
dence intervals such as those described by e.g. [10, p.159]. We then show that
our approach, as opposed to the exact CIs, can be easily extended to simplify the
two-sample confidence interval problem. Specifically it is shown that, for two-
sample difference-in-quantile estimators, it is sufficient to sample order statistics
from the original samples according to a known probability distribution. This
implies that it is sufficient to order the sample once, and then sample order statis-
tics from the ordered sample. This reduces the computational complexity from
O(N B) to O(max(N log(N ), B)). These findings let us perform non-parametric
Poisson bootstrap inference for difference-in-quantiles estimators for almost arbi-
trarily large data sets.
The rest of this paper is structures as follows. Section 2, gives a brief intro-
duction to analytical exact CIs based on order statistics. Section 3 gives a
short overview of traditional Poisson bootstrap for quantiles and difference-in-
quantiles. Section 4 introduces the first contribution of the paper, resampling-free
Poisson bootstrap for quantiles. Section 5 presents the extension to the difference-
in-quantiles CIs together with Monte Carlo evidence of the coverage. Finally,
Sect. 6 concludes the paper.

2 Analytical Confidence Intervals for Quantiles Based


on Order Statistics
In this section we briefly describe analytical confidence intervals (CIs) for quan-
tiles based on order statistics to build intuition for the bootstrap proposed in
this paper, and make the connection to the literature. This type of confidence
interval has been proposed by many (see e.g. [5,10,17]) and there are several
extensions and improvements of the standard solution [12,15]. Here we present
the standard CIs.
Let Yi be a random variable with cumulative distribution function F , and let
y = (y1 , ..., yi , ..., yN ) be a sample of size N obtained by sampling independently
from a continuous distribution F . Let 0 < q < 1 be the population quantile
of interest, and F −1 (q) be the population value at the quantile. The order-
statistic CIs are based on the following simple reasoning. When one observation
is sampled from the population, the probability that the observation is smaller
than F −1 (q) is equal to q, and the probability that the observation is larger than
F −1 (q) is equal to 1 − q. When N observations are sampled from the population
independently, this enables the following distribution-free CI. Select integers r
and s such that 1 ≤ r < s ≤ N and
−1
s 
−1 N i
P (Y(r) < F (q) < Y(s) ) = q (1 − q)i ≥ 1 − α, (1)
i=r
i
Resampling-Free Bootstrap 551

then the interval (Y(r) , Y(s) ) is a 1 − α CI for F −1 (q), where Y(i) refers to the
ith order statistic. There is not a unique pair (r, s) that satisfies the preceding
equation. Additional restrictions can be added to find a unique pair, like assign-
ing equal probability to each tail. See, e.g., [10, p. 158] for details. Throughout
this paper we will refer to the CI given above as analytical order-statistic-based
CIs (AOS-CI).
For the purposes of this paper, the most important aspect of the AOS-CI is
that they use the fact that the distribution of order-statistic indexes is indepen-
dent of the outcome data distribution. The binomial probabilities used above
are valid due to the properties of quantile definition rather than properties of
the outcome data. An important limitation of the AOS-CIs is that they are not
applicable to difference-in-quantiles; the relation between the population quan-
tiles and order statistics does not translate to the difference-in-quantiles and the
difference in order statistics. Nevertheless, in this paper we use similar argu-
ments in a bootstrap context to enable two-sample inference for quantiles. The
following section gives a brief introduction to bootstrap inference.

3 Poisson Bootstrap for Quantiles


The Poisson bootstrap [11] works analogously to a standard non-parametric
multinomial bootstrap, but it lets the number of observations in each bootstrap
vary. Let again Yi be the random variable of interest and yi be an observation
of that variable. Using the Poisson bootstrap, given a sample of size N , we inde-
(b)
pendently generate pi ∼ Poi(1) for all i = 1, ..., N . In each boostrap sample, yi
is included pi times to form the bootstrap sample y(b) . We repeat the procedure
N (b)
B ∈ Z+ number of times. The size of a given bootstrap sample is i = 1 pi ,
which is equal to N only in expectation. Let y(i) represent the ith order statistic
in the sample y. In this paper, we are studying the quantile estimator of the
form given in Definition 1.
Definition 1. Define the function g as

q(N + 1) ifq(N + 1) mod 1 = 0
g[q, N ] = ,
(1 − D) q(N + 1) + D q(N + 1) ifq(N + 1)
0
mod 1 =
where D ∼ Ber(q(N +1) mod 1). Define the sample quantile estimator of quan-
tile q as
τ̂q = Y(g[q,N ]) .
The function g in Definition 1 can be thought of as a stochastic rounding
function. If q(N +1) is an integer, the ordered observation q(N +1) is the sample
quantile. If q(N + 1) is not an integer, it is randomly rounded up or down with
probability proportional to the remainder and the corresponding order statis-
tic selected. The estimator is similar to most popular quantile estimators, but
instead of weighting together the two closest observations when the quantile
index is non-integer, it randomly selects one of them. This formulation of the
quantile estimator implies that the estimate is always an observation from the
original sample, which is key for the following results.
552 M. Schultzberg and S. Ankargren

3.1 Poisson Bootstrap Inference for Quantiles Through Resampling

In this section, standard resampling-based Poisson bootstrap for quantiles is


presented, to build intuition for the proposed alternative presented in the fol-
lowing section. Even though bootstrap is never needed to construct CIs for the
one-sample quantile (since AOS-CI can be used directly), we focus on the one-
sample case here to build intuition. The two-sample difference-in-quantiles case
is based on the same logic and is achieved by simple extensions presented in
Sect. 5.
A standard Poisson bootstrap CI algorithm for τ̂q is given in Algorithm 1.
Algorithm 1 requires that each bootstrap sample vector y(b) is realized and

Algorithm 1 . Algorithm for poisson bootstrap confidence interval for a one-


sample quantile.
(b) (b) (b)
1. Generate N Poi(1) random variables p1 , . . . , pi , . . . , pN .
2. Include each yi observation pi times and form the bootstrap sample outcome vector
y(b) .
(b)
3. Calculate the sample estimate τ̂qb = y(g[q,N p ]) .
i=1 i
4. Repeat steps 1–3 B times.
5. Return the α/2 and 1 − α/2 quantiles of the distribution of τ̂ b as the two-sided
(1 − α)100% confidence interval.

ordered such that the order statistic that is the sample quantile estimate can
be extracted. This is a memory and computationally intensive exercise. In the
following section, we exploit theoretical properties of τ̂q to provide substantial
simplifications to Algorithm 1 that ultimately will enable bootstrap CIs for the
difference-in-quantiles case.

4 Poisson Bootstrap CIs for τ̂q Without Resampling


(b)
The key insight that facilitates our approach is that the estimator τ̂q in Defi-
nition 1 applied to the bth bootstrap sample only uses order statistics from the
(b)
original sample. The process of drawing resampling frequencies pi and realizing
the vector y(b) is only necessary to find what original index maps to the sample
quantile in the bootstrap sample. If we knew the distribution of the random
variable describing which original index is the desired sample quantile in the
bootstrap sample, we could simply generate indexes from that distribution and
extract the corresponding original sample order statistic directly.
Resampling-Free Bootstrap 553

We now aim to build intuition for this index distribution. Consider an exam-
ple where we are interested in a confidence interval for the median in a sample
of N = 10 observations. According to the quantile estimator in Definition 1, the
median is y(5) or y(6) with equal probability. It is intuitively apparent that the
middle order statistics y(4) , y(5) , y(6) , y(7) in the original sample are more likely
to be medians in the bootstrap sample. The first and the last order statistics
y(1) , y(10) have little chance of being the medians in the bootstrap sample. For
N
example, y(1) will be the median in a bootstrap sample only if p1 ≥ i = 2 pi
is satisfied. If, e.g., p1 = 2 then the sum of all 9 remaining Poisson random
variables (p2 , . . . , p10 ) must be smaller than or equal to p1 = 2, which is unlikely
N
given that i = 2 pi ∼ Poi(9). Following this logic, it is clear that order statistics
that are closer in rank to the original sample quantile have higher probability
of being observed as the sample quantile in a bootstrap sample. It is easy to
simulate the distribution of what index in the original sample is observed as the
desired quantile across bootstrap samples. Figure 1 displays this distribution for
the 10th percentile in 1, 000, 000 Poisson bootstrap samples in a sample of size
N = 2000. If indexes could be generated directly from the index distribution,

30000

20000
Frequency

10000

140 150 160 170 180 190 200 210 220 230 240 250 260
Index of order statistic in the original sample

Fig. 1. Distribution of index of the order statistics from a sample of N = 2000 that
became the 10th percentile over 1M poisson bootstrap samples.

the quantile for a bootstrap sample could be obtained by simply generating an


index and returning the corresponding order statistic from the original sample.
In the following section, this index distribution is characterized mathematically.
554 M. Schultzberg and S. Ankargren

4.1 The Probability Distribution Describing What Index


in the Original Sample Is Observed as the Bootstrap Sample
Quantile q Estimate τ̂q

Let Pi ∼ Poi(1) for i = 1, . . . , N be the frequencies used in the Poisson bootstrap.


i − 1 N
Denote X<i = j = 1 Pj , X>i = j = i + 1 Pj and S = X<i + Pi + X>i =
N
P
i=1 i . By construction X <i ∼ Poi(i − 1) for i > 1, X>i ∼ Poi(N − i) for
i < N , and S ∼ Poi(N ). The following theorem establishes the distribution of
ψ.
Theorem 1. Let ψ ∈ {1, . . . , N } be the random variable that denotes what index
of the original order statistics is observed as the bootstrap sample quantile q
estimate τ̂q (Definition 1) in a Poisson bootstrap sample. The probability mass
function of ψ is

 
p(ψ = i) = pX<i ,Pi ,X>i |S = n X<i ≤ q(n + 1) − 1,
n=0

Pi > 0, X>i ≤ (1 − q)(n + 1) − 1|S = n ×
pS (S = n)I(r = 0)

 
+r pX<i ,Pi ,X>i |S = n X<i ≤ q(n − 1) − r,
n=0

Pi > 0, X>i ≤ (1 − q)(n + 1) + r − 2|S = n ×
pS (S = n) I(r
= 0)

 
+ (1 − r) pX<i ,Pi ,X>i |S X<i ≤ q(n + 1) − r − 1,
n=0

Pi > 0, X>i ≤ (1 − q)(n + 1) + r − 1|S = n ×
pS (S = n) I(r
= 0),

where r = q(n + 1) mod 1.

Proof. Let n be the realization of S, and q(n + 1) mod 1 = r. If r = 0, then


q(n + 1) is an integer, implying that for index i to be the quantile the following
must be satisfied

X<i ≤ q(n + 1) − 1
Pi > 0
X>i ≤ (1 − q)(n + 1) − 1.

If r
= 0 the index selected will be q(n+1) with probability r and q(n+1)
with probability 1−r. In the former case, the conditions that need to be satisfied
Resampling-Free Bootstrap 555

for i to be the index is

X<i ≤ q(n + 1) − r
Pi > 0
X>i ≤ (1 − q)(n + 1) + r − 2.

When q(n + 1) is rounded down, the conditions are instead

X<i ≤ q(n + 1) − r − 1
Pi > 0
X>i ≤ (1 − q)(n + 1) + r − 1.

The expression in the theorem then follows from the law of total probability.

While Theorem 1 presents the distribution of ψ, it is not a tractable dis-


tribution that lends itself to being characterized easily. We will return to the
matter of practical applications in Sect. 4.2. For the one-sample case, Theorem
1 can be used to obtain Poisson bootstrap confidence intervals for the quantile
analytically. We formalize this result in Corollary 1.
Corollary 1. Denote the lower and upper confidence interval bounds that result
L U
from Algorithm 1 as Cψ,α/2 and Cψ,1−α/2 . Let ψ be the random variable of the
order-statistic index that becomes the quantile estimate in the bootstrap sample
as defined in Theorem 1. Denote iL = maxi {i : P (ψ ≤ i) ≤ α/2} and iU =
mini {i : P (ψ ≥ i) ≥ 1 − α/2}. Then the confidence interval for the quantile
estimator τ̂q given by (Y(iL ) , Y(iU ) ) has coverage ≤ 1 − α.

Proof. Since
L
Cψ,α/2 ≥ Y(iL )
U
Cψ,α/2 ≤ Y(iU ) ,

the results follows directly from Theorem 1.


Interestingly, Corollary 1 gives CIs that are similar in construction to the


AOS-CIs (Sect. 2), although derived based on two distinct approaches. In the
following section we present an approximations of p(ψ = i) that makes it easy
to generate values that can in turn be used to enable difference-in-quantiles CIs.

4.2 Approximating the Index Distribution to Enable Fast,


Resampling-Free Bootstrap Inference for Difference-in-Quantiles
In this section, we propose an approximation of p(ψ = i) to enable resampling-
free difference-in-quantiles bootstrap CIs that are easy to implement. To moti-
vate our approximation, Fig. 2 shows two examples of index distributions with
the probability mass function of the Bin(N + 1, q) distribution overlaid. The
binomial distribution provides an impressive fit and the fit seems to improve
556 M. Schultzberg and S. Ankargren

0.03

0.015

0.02
Frequency

Frequency
0.010

0.01
0.005

0.00 0.000
160 200 240 900 950 1000 1050 1100
Index of order statistic in the original sample Index of order statistic in the original sample

(a) N = 2000, q = 0.1 (b) N = 2000, q = 0.5

Fig. 2. Two examples of index distributions and their respective binomial approxi-
mations. The area and dotted curves show the estimated densities for the binomial
approximation and index distribution, respectively.

with increasing N . We have found this surprisingly simple approximation to


work incredibly well, and its simplicity means it is fast and easy to work with.
We next state the use of this approximation as a conjecture, and then proceed
to demonstrate its merit in Monte Carlo simulations.
Conjecture 1. Let X ∼ Bin(N + 1, q) and GN denote its cdf. Let also HN be the
cdf of the index distribution as defined in Theorem 1. Then

sup |GN (x) − HN (x)| → 0 as N → ∞.


x∈{1,...,N }

Conjecture 1 says that the bootstrap index distribution can be approximated


with increased accuracy by a binomial distribution as the sample size increases,
which is here defined as an asymptotically vanishing Kolmogorov-Smirnov dis-
tance between the distributions.
If the binomial distribution is used to approximate the index distribution,
the confidence interval in Corollary 1 coincides exactly with the AOS-CI using
α/2 in each tail of the index distribution. This implies that the coverage for the
one-sample CI is exactly bounded by construction, but it says nothing about how
similar GN (x) is to HN (x). However, the binomial-approximated index distribu-
tion and the binomial distribution used to derive AOS-CI are two quite distinct
distributions. That is, they here happen to coincide exactly, but they describe
fundamentally different things; the distribution of original-sample indexes of
order statistics observed as quantiles in bootstrap samples versus the probabil-
ity of a certain number of order statistics to be above or below the population
quantile.
The distribution of original-sample indexes of order statistics observed as a
given quantile in bootstrap samples is independent of the data-generating process
as long as the outcome can be ordered. This means that Monte Carlo simulation
Resampling-Free Bootstrap 557

can provide strong evidence that generalizes to all such data-generating pro-
cesses. The setup of the simulation is the following. The number of bootstrap
samples is B = 106 and the sample size is set to N ∈ {100, 200, 500, 1000,
5000, 10000}. The quantile of interest is q ∈ {0.01, 0.1, 0.25, 0.5}, which, due
to symmetry, generalizes also to q ∈ {0.75, 0.9, 0.99}. For each combination of
sample size and quantile, 106 bootstrap samples are realized, and it is recorded
which index from the ordered original sample that is observed as estimate of
the quantile. The empirical distribution function of the indexes across the boot-
strap samples are fitted. The Kolmogorov-Smirnov (KS) distance is calculated
comparing the empirical bootstrap distribution to a Bin(N + 1, q) distribution.
Figure 3 displays the Kolmogorov-Smirnov distance for each combination of
quantile and sample size. In support of Conjecture 1, the KS distance is decreas-
ing in sample size, indicating that the approximation of ψ using the Bin(N +1, q)
distribution is improving as the sample size increases. Perhaps surprisingly, the
approximation is not strictly improving as the quantile comes closer to 0.5. A
likely explanation is that although the skewness and boundedness of the distri-
bution is less heavy the closer the quantile is to 0.5, the variance in the index
distribution also increases.

0.100
Kolmogorov−Smirnov distance

0.075
Quantile
0.01
0.050 0.1
0.25
0.5

0.025

0.000

0 2500 5000 7500 10000


Sample size (N)

Fig. 3. The Kolmogorov-Smirnov distance between the empirical index-distribution


and Bin(N + 1, q), for quantiles 0.01, 0.1, 0.25, and 0.5 over sample sizes between 100
and 10000.
558 M. Schultzberg and S. Ankargren

5 Poisson Bootstrap CIs for Difference-in-Quantiles


We now extend our approach to two-sample difference-in-quantiles inference.
Assume that the the control and treatment groups are of sizes Nc and Nt ,
respectively, such that the total sample size is Nc + Nt . Let the outcome of
the control and treatment groups be denoted yc = (yc,1 , . . . , yc,i , . . . , yc,Nc ) and
yt = (yt, 1 , . . . , yt, j , . . . , yt, Nt ) , respectively. Define the difference-in-quantile
estimator δ̂ = τ̂t,b q − τ̂c,b q , where subscripts c and t indicates the control and
treatment groups, respectively. Algorithm 2 defines the standard resampling-
based algorithm for a Poisson bootstrap difference-in-quantile CI.

Algorithm 2. Algorithm for poisson bootstrap confidence interval for a two-


sample difference-in-quantile q.
(b) (b)
1. Generate Nc +Nt Poi(1) random variables pc,i , i = 1, . . . , Nc and pt,j , j = 1, . . . , Nt .
2. Include each yc,i observation pi times and each yc,j observation pj times to form
(b) (b)
the bootstrap sample outcome vectors yc and yt .
b b b
3. Calculate the sample estimate δ̂ = τ̂t,q − τ̂c,q
4. Repeat steps 1–3 B times.
5. Return the α/2 and 1 − α/2 quantiles of the distribution of δ̂ b as the two-sided
(1 − α)100% confidence interval.

This algorithm requires generating B Poi(1) random variables, realize the


samples and find the appropriate order statistic for each bootstrap sample, cal-
culate the difference between treatment and control for each bootstrap sample,
and finally find the quantiles of the distribution of differences. This implies a
total complexity of order O(N B) + O(B) + O(B) = O(N B).
Using Theorem 1, it is straightforward to improve the efficiency of Algorithm
2 for difference-in-quantile CIs. Here, we will utilize Conjecture 1 directly to find
practically applicable approximations. As before, exact analytical results can be
obtained by replacing the binomial distribution with p(ψ = i).
It is not possible to find a direct analogue to Corollary 1 for the two-sample
difference-in-quantiles CI. While the within-sample distribution of indexes is
independent of the outcome data, the distribution of the difference between sam-
ples (i.e., the difference-in-quantile estimate) is not. Instead, Theorem 1 together
with Conjecture 1 can be applied to generate B bootstrap quantile estimates
(1) (B) (1) (B)
for each sample, i.e., (τ̂q,t , ..., τ̂q, t ), and (τ̂q, c , ..., τ̂q,c ) for treatment and con-
trol, respectively. The bootstrap distribution of the difference-in-quantiles can
be directly obtained by simply taking the difference between these two vectors.
Let a[v] denote extraction of elements from the vector a according to the vector
of indexes v where elements in v are bounded between 1 and the length of a.
Algorithm 3 displays an efficient algorithm for obtaining CIs for the difference-
in-quantiles.
Resampling-Free Bootstrap 559

Algorithm 3. Algorithm for poisson bootstrap confidence interval for a two-


sample difference-in-quantile q.
1. Generate B random numbers from Bin(Nc + 1, q) and B random numbers from
Bin(Nt + 1, q) and save them in two vectors Ic and It , respectively.
2. Order the outcome vectors ỹc = (yc,(1) , . . . , yc,(Nc ) ) and ỹt, = (yt,(1) , . . . , yc,(Nt ) ) .
3. Calculate the vector of difference-in-quantiles as τ̂ = ỹt [It ] − ỹc [Ic ]
4. Return the α/2 and 1 − α/2 quantiles of τ̂ as the two-sided (1 − α)100% confidence
interval for the difference-in-quantiles.

Algorithm 3 generates 2B binomial random numbers, sorts two vectors of


lengths Nc and Nt , extracts 2B numbers from arrays and calculates the differ-
ence, and finally finds the quantiles of the distribution. This leads to an overall
complexity of order O(2B) + O(Nc log(Nc )) + O(Nt log(Nt )) + O(2B) + O(B) +
O(B) = max(B, Nc log(Nc ), Nt log(Nt )). Since log(N ) < B for all relevant pairs
of N and B, Algorithm 3 has lower complexity than Algorithm 2 in all reasonable
applications.

5.1 Monte Carlo Simulations of the CI Coverage for Algorithm 3

In this section, the coverage of the confidence intervals resulting from algorithm
3 are studied using Monte Carlo simulation. The algorithms are implemented in
Julia version 1.6.3 [1], and the code for the algorithms and the Monte Carlo
simulations can be found here https://github.com/MSchultzberg/fast quantile
bootstrap.
The data-generating process is similar to the previous simulation. The num-
ber of Monte Carlo replications is 104 . For each replication, two samples of
Nt = Nc = 105 , respectively, are generated from a standard normal distribution.
The number of bootstrap samples for each Monte Carlo replication is B = 105
and the two-sided 95% confidence interval is returned. The study is repeated
for the quantiles 0.01, 0.1, 0.25, and 0.5. The coverage rate is the proportion
of the CIs that covered the true population difference-in-quantiles, i.e., zero.
To quantify the error due to a finite number of Monte Carlo replications, the
two-sided 95% confidence intervals of the coverage rate (using standard normal
approximation of the proportion) are again presented with the results.
Table 1 displays the results from the Monte Carlo simulation. Again, it is
clear that the coverage is close to the intended 95% for all quantiles, with no
observable systematic deviations.

5.2 Time and Memory Simulation Comparisons

This section presents memory and time consumption comparisons to build intu-
ition for the impact of the reduction in complexity enabled by Theorem 1 and
Conjecture 1. The comparisons are between Algorithm 2 and 3 implemented in
Julia version 1.6.3 [1] and benchmarked using the BenchmarkTools package [3].
560 M. Schultzberg and S. Ankargren

Table 1. Empirical coverage rate for the confidence intervals produced by algorithm 3
for the Difference-in-Quantiles for Quantiles 0.01, 0.1, 0.25, and, 0.25, for Sample Size
105 with 105 Bootstrap Samples over 10000 Replications.

Empirical 95% CI
q coverage Lower Upper
0.01 0.953 0.949 0.957
0.10 0.949 0.944 0.953
0.25 0.950 0.946 0.955
0.50 0.949 0.945 0.954

The setup for the comparison is the following. Two samples of floats are
generated of size 1000 each. B is set to 10000. The setup is selected to enable
100 evaluations of Algorithm 2 within around 200 s on a local machine. The
results are displayed in Table 2.

Table 2. Time and mempry consumption comparison between a standard poisson boot-
strap clgorithm (Algorithm 2) for a Difference-in-Quantiles CIs and the Corresponding
Proposed Binomial-Approximated Poisson Bootstrap Algorithm (Algorithm 3).

Min time Median time Max time Memory usage


Algorithm 2 1726 ms 1821 ms 1902 ms 2.39 GiB
Algorithm 3 2.055 ms 2.214 ms 3.502 ms 407.08 KiB

Clearly, Algorithm 3 outperforms Algorithm 2 both in terms of memory and


speed already for small samples and moderately small B. The results presented
in Table 2, together with the simulation results in Sect. 5.1, establishes the utility
and practical implications that follows from the theoretical results.

6 Discussion and Conclusion

In this paper we exploit the properties of quantile estimators coupled with a


Poisson bootstrap sampling scheme to derive computationally simple bootstrap
inference algorithms to make difference-in-quantiles inference feasible in large-
scale experimentation. It turns out that for the quantile estimator we employ,
no resampling is necessary. Instead, the theoretical distribution of the indexes of
order statistics in the original sample that are observed as the quantile estimate
in the bootstrap sample can be derived and used directly. The traditional algo-
rithm is built around generating Poisson random variables for each observation
in the original sample, realizing the bootstrap sample, and selecting the order
statistic in the bootstrap sample that is closest to the desired quantile.
Resampling-Free Bootstrap 561

In this paper we show that it is possible, due to the known properties of


the Poisson bootstrap sampling mechanism, to describe probabilistically which
order statistic in the original sample is observed as the desired quantile in a
bootstrap sample. This effectively bypasses the need for realizing each boot-
strap sample. In addition, we show that the index distribution, which has an
analytically intractable exact distribution, is well approximated by a binomial
distribution that simplifies implementation dramatically. Together our findings
enables bootstrap inference for quantiles and difference-in-quantiles in large-scale
experiments without the need for intricate parallelization implementations. In
fact, a simple SQL query coupled with a Python or R notebook is sufficient for
even the largest experiments with even hundreds of millions of users. We hope
that this will enable fast and robust inference for quantiles for many large-scale
experimenters. We leave for future research to study the properties of p(ψ = i)
in more detail. If the distribution could be exactly or approximately character-
ized in a manner that made generation of random numbers straightforward, that
would open up faster bootstrap algorithms also for smaller sample sizes. This
might also enable proving or narrowing Conjecture 1.

Acknowledgments. The authors gratefully acknowledge help and feedback from


Anton Muratov, Shaobo Jin, Thommy Perlinger and Claire Detilleux.

References
1. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to
numerical computing. SIAM review, 59(1), pp. 65–98 (2017)
2. Chamandy, N., Muralidharan, O., Najmi, A., Naidu, S.: Estimating Uncertainty
for Massive Data Streams. Technical report, Google (2012)
3. Chen, J., Revels, J.: Robust benchmarking in noisy environments. arXiv e-prints,
arXiv:1608.04295 (2016)
4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms,
Third Edition. The MIT Press, 3rd edition (2009)
5. David, H.A., Nagaraja, H.N.: Order statistics. John Wiley & Sons (2004)
6. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters.
Commun. ACM 51(1), 107–113 (2008)
7. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7(1), 1–26
(1979)
8. Falk, M., Reiss, R.-D.: Weak convergence of smoothed and nonsmoothed bootstrap
quantile estimates. Ann. Probab. 17(1), 362–371 (1989)
9. Ghosh, M., Parr, W.C., Singh, K., Babu, G.J.: A Note on Bootstrapping the Sample
Median. The Annals of Stat. 12(3), 1130–1135 (1984)
10. Gibbons, J.D., Chakraborti, S.: Nonparametric statistical inference. CRC press
(2014)
11. Hanley, J.A., MacGibbon, B.: Creating non-parametric bootstrap samples using
poisson frequencies. Comput. Methods Programs Biomed. 83(1), 57–62 (2006)
12. Hutson, A.D.: Calculating nonparametric confidence intervals for quantiles using
fractional order statistics. J. Appl. Stat. 26(3), 343–353 (1999)
562 M. Schultzberg and S. Ankargren

13. Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M.I.: A scalable bootstrap for mas-
sive data. J. Royal Stat. Soc.: Series B (Statistical Methodology) 76(4), 795–816
(2014)
14. Liu, M., Sun, X., Varshney, M., Xu, Y.: Large-Scale Online Experimentation with
Quantile Metrics. arXiv e-prints, arXiv:1903.08762 (2019)
15. Nyblom, J.: Note on interpolated order statistics. Stat. Probab. Lett. 14(2), 129–
131 (1992)
16. Rao, C.R., Statistiker, M.: Linear statistical inference and its applications vol. 2,
Wiley New York (1973)
17. Scheffe, H., Tukey, J.W.: Non-Parametric Estimation. I. Validation of Order Statis-
tics. Ann. Math. Stat. 16(2), 187–192 (1945)
Determinants of USER’S Acceptance of Mobile
Payment: A Study of Cambodia Context

Sreypich Soun, Bunhov Chov, and Phichhang Ou(B)

Royal University of Phnom Penh, Russian Federation Blvd, Phnom Penh, Cambodia
[email protected]

Abstract. All transactions between buyers and sellers and businesses are done
using payment systems. In the past, people made payments using traditional means
such as cash and checks. However, due to advanced technology, e-commerce, the
internet, and mobile devices, payment systems have transformed from cash-based
to digital-based transactions. Mobile payment refers to modern payment practices
via mobile devices such as cellphones, smartphones, or tablets. Mobile payment
allows consumers to reduce the use of cash and offers efficient and fast performance
as well as the secure transfer of information between consumers when conducting
payments and transactions. Even though mobile payment provides these benefits,
non-cash payment practices are still new among most consumers in Cambodia
due to limited knowledge of digital-based payment. Some companies succeeded,
while some failed due to limited insights into the success factors that predict
users’ intention to use mobile payment. Therefore, the researchers decided to
conduct this study by extending TAM with trust, innovativeness, and functionality,
aiming to explore the user’s acceptance of mobile payment services. This study
was conducted on 301 respondents who have experiences of using mobile payment
applications. The collected data was analyzed using the Statistical Packages of
SPSS 25 and AMOS 23. Based on the results, perceived usefulness and perceived
ease of use have a positive effect on behavioral intention to use mobile payment
services. Perceived usefulness is positively predicted by perceived ease of use,
innovativeness, and functionality. Trust and innovativeness positively influence
perceived ease of use.

Keywords: Mobile payment service · Technology Acceptance Model (TAM) ·


Perceived usefulness · Perceived ease of use · Trust · Innovativeness ·
Functionality

1 Introduction
From a business perspective, all transactions between buyers and sellers or banks and
financial institutions are made using payment systems, and no business activities can
be done without payments and financial transactions. Payments in the past were made
through traditional methods such as cash and checks, which were the basic payment
instruments in the period [1]. With e-commerce and mobile devices, the payment system
has gradually changed from traditional cash-based transactions to cashless transactions

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 563–577, 2023.
https://doi.org/10.1007/978-3-031-18461-1_37
564 S. Soun et al.

[2]. Additionally, mobile commerce has also emerged and gained popularity due to fast-
growing technology, the internet, and mobile devices. As a result, digital, electronic,
and mobile payments were critical in facilitating mobile commerce payment processes.
Mobile commerce refers to the transaction of goods and services via mobile devices [3].
According to the National Bank of Cambodia [4], Cambodia’s payment landscape has
gradually transformed from cash-based to digital-based payment over the last decade due
to advanced technology, economic development, and the demand for fast and efficient
services.
In line with the trend of the digital economy and mobile commerce, there is immense
potential in mobile payment applications and services in Cambodia. As defined by Lerner
[5], mobile payment refers to the present payment practices using mobile devices such
as mobile phones and tablets. Moreover, mobile payments are becoming very popular in
the era of e-commerce and the digital economy, enabling consumers to reduce their use
of cash and offer efficient and fast performance as well as secure transfer of informa-
tion when conducting payments and transactions. Further, with the significant growth
of mobile phone usage and the internet, the mobile payment platforms in Cambodia
have significantly developed over the past 5 years, and the market is currently crowded
with start-ups, international and domestic firms, and various digital companies all try-
ing to benefit from the electronic and mobile payments [6]. Based on the researcher’s
knowledge, there are many key mobile payment services and platforms in operation in
Cambodia, such as ABA Mobile, ACLEDA Mobile, Canadia Mobile Banking, PPCB
Mobile Banking, FTB Mohabot App, Wing, TrueMoney, Lyhour, Pi Pay, SmartLuy,
E-Money, and so forth.
Despite the benefits of adopting mobile payment services to perform payments and
financial transactions, these non-cash payment practices are still new among most con-
sumers in Cambodia. With limited knowledge and awareness of financial technology,
most consumers find it difficult to use mobile payments for their payments and money
transfers. Thus, some payment companies and providers succeeded, and some failed due
to these issues. Simply put, some consumers use mobile payment due to its usefulness
and ease of use, while others may use mobile payment because of the helpful functions
provided and trust in the company. Also, some consumers will use mobile payments
because they are innovative and willing to try sophisticated products such as financial
innovations. This implies that it is crucial to identify the success factors that affect the
consumer’s intention to use mobile payment.
There were several preceding studies that adopted different theories to investigate
the factors affecting consumers’ adoption of digital payment, internet banking, and
mobile devices in the context of Cambodia. Chav and Ou [7] found that attitude is a
critical predictor affecting the intention to use mobile banking, and attitude is influenced
by usefulness, ease of use, trust, and job relevance. In addition, Do et al. [8] noticed
that performance expectancy, effort expectancy, and transaction speed have a positive
impact on behavioral intention to use mobile payment. Consequently, the study adopted
the Technology Acceptance Model (TAM) extended with three prominent factors such
as trust, innovativeness, and functionality to find out the factors affecting the user’s
intention to accept mobile payment services.
Determinants of USER’S Acceptance of Mobile Payment 565

This study aims to explore the determinants of users’ acceptance of mobile payment
services. The determinant factors are represented in Fig. 1. To achieve the aim of this
study, five major objectives were formulated:
O1: To examine the relationships between perceived usefulness, perceived ease of
use, and behavioral intention.
O2: To examine the relationship between perceived ease of use and perceived
usefulness.
O3: To examine the relationships between innovativeness, perceived ease of use, and
perceived usefulness.
O4: To examine the relationship between trust and perceived ease of use.
O5: To examine the relationship between functionality and perceived usefulness.
The findings of this study contribute to major stakeholders such as future researchers,
entrepreneurs, marketers, and developers of mobile payment apps, banks, governments,
and policymakers. The literature, methodologies, and results of this study will be benefi-
cial to students and researchers who are interested in this area of mobile payment systems
and similar topics. Likewise, entrepreneurs, banks, developers, and mobile payment ser-
vice providers can benefit from the insights on the factors affecting users’ acceptance of
mobile payment services, so they can formulate strategies to improve their products or
services in the mobile payment area. Importantly, the findings of this study also benefit
the government and policymakers from relevant departments to capture clear landscapes
of what makes users use mobile payment services, so that they can develop effective
mechanisms to encourage users toward the use of digital and mobile payment, which is
an extremely crucial participant in the digital economy.

2 Literature Review and Hypothesis Development


In the context of technology system adoption, there are several theories and models for
investigating users’ intentions to adopt technology, such as Theory of Reasoned Action
(TRA), Theory of Planned Behavior (TPB), Technology Acceptance Model (TAM),
Innovation Diffusion Theory (IDT), Unified Theory of Acceptance and Use of Technol-
ogy (UTAUT), and so on. However, Venkatesh [9] stated that TAM is robust and widely
employed to examine new technology systems’ adoption. Davis [10] developed TAM
under TRA to test users’ attitudes and acceptance of new technology and information
systems. Further, TAM was adapted in several studies, especially mobile payment and
banking [7, 11–15]. Based on these reasons, the researchers decided to adopt TAM to
conduct the study, and the researchers employed perceived usefulness and perceived
ease of use from TAM, along with three additional factors that affect the user’s intention
to accept the mobile payment, such as trust, functionality, trust, and innovativeness.

2.1 Perceived Usefulness (PU), Perceived Ease of Use (PEU), and Behavioral
Intention (BI)

According to Davis [10], PU refers to the degree that a person thinks that a particular
system improves his or her work, while PEU is defined as the degree that a person thinks
using a particular system is easy and effortless. Accordingly, the study found that PU
566 S. Soun et al.

and PEU have a significant impact on BI’s ability to adopt new technology systems,
and PU is positively affected by PEU [10]. Further, PU and PEU were confirmed on
BI to accept social media transactions [16], mobile banking apps [11], mobile payment
[14], and wireless internet service through mobile technology [17]. Saprikis et al. [18]
also noticed that PU has a positive relationship with BI for using mobile shopping.
In addition, a positive effect of PEU on PU was also validated by previous findings
in different contexts of innovation adoption [7, 11, 13, 15–19]. Thus, the following
hypotheses were developed:
H1: Perceived usefulness has a positive effect on behavioral intention.
H2: Perceived ease of use has a positive effect on behavioral intention.
H3: Perceived ease of use has a positive effect on perceived usefulness.

2.2 Trust (TR) and Perceived Ease of Use (PEU)

According to the literature, TR is an important factor discussed in various contexts of


technology system acceptance. Simultaneously, TR was noticed to be a critical criterion
to be considered when using mobile payment and banking. According to Dahlberg et al.
[20], TR refers to the degree that a user believes that a particular technological system or
application is secure, credible, benevolent, and trustworthy. Hansen et al. [16] identified a
positive effect of TR on the PEU of social media for transactions. In the study conducted
by Muñoz-Leiva et al. [11], the results found that TR is positively associated with PEU
of the mobile banking apps. In addition, Zarmpou et al. [19] found that TR is positively
related to the PEU of mobile services. Besides, Pavlou [21] also confirmed a positive link
between TR and PEU in the study of electronic commerce acceptance, hypothesizing
TR as follows:
H4: Trust has a positive effect on perceived ease of use.

2.3 Innovativeness (INN), Perceived Ease of Use (PEU), and Perceived Usefulness
(PU)

INN is one of the most eminent factors that has been studied in several previous studies
of technology acceptance. In the terminology of information technology, INN refers to
a person’s willingness to undertake any modern information technology [22]. Further,
Agarwal and Prasad [23] defined INN as a person’s willingness to try sophisticated
systems and information technology. Simply put, innovative consumers have a higher
likelihood of trying new technology systems than less innovative consumers. INN was
used by several researchers to investigate users’ intention to adopt new information
systems. Previous studies validated that INN has a positive impact on PEU for mobile
shopping [18], NFC payment systems [13], and mobile payment [14]. In the study of
the adoption of wireless internet services via mobile technology, Lu et al. [17] also
confirmed that both PEU and PU were positively predicted by INN. Likewise, Zarmpou
et al. [19] also observed that INN is positively correlated with PEU and PU of mobile
service adoption. Therefore, the researchers proposed the following hypotheses.
H5: Innovativeness has a positive effect on perceived ease of use.
H6: Innovativeness has a positive effect on perceived usefulness.
Determinants of USER’S Acceptance of Mobile Payment 567

2.4 Functionality (F) and Perceived Usefulness (PU)


The author in [19] introduced the construct “functionality (F)” by borrowing the con-
cepts of perceived ubiquity, perceived reachability, and technicality from the technology
system. To put it simply, F refers to technological system characteristics such as user-
friendly interface, response time to services, transaction speed, and the technological
infrastructure that provides access to services in terms of time and location. Meanwhile,
a positive effect of F on PU was validated in the study by Zarmpou et al. [19]. Based
on this literature, in the context of a user’s acceptance of mobile payment, functionality
was hypothesized as follows:
H7: Functionality has a positive effect on perceived usefulness.
Hypothesized relationships among variables are depicted in Fig. 1.

Trust

H4
Perceived Ease
of Use
H2
H5
H3 Behavioral
Innovativeness
Intention
H6
H1
Perceived
Usefulness

H7
Functionality

Fig. 1. Conceptual model

3 Methodology
3.1 Research Site
In this study, Phnom Penh, the capital of Cambodia, was selected as the study site due
to favourable access to targeted respondents to collect data for the study. According to
Southeast Asia Globe Magazine by Retka [24], more than 71% of Cambodia’s population
had access to financial services, with 59% using formal banking systems. In addition,
Phnom Penh has the largest population, with a statistic of 2,281,951, equivalent to 14.7
per cent of Cambodia’s total population as reported in the General Population Census of
Cambodia in 2019 [25]. Noticeably, Phnom Penh is the centre of economic development
and investment. Thus, Phnom Penh is where the use of e-banking, mobile payment and
e-commerce is concentrated.
568 S. Soun et al.

3.2 Data Collection and Samples

In this study, the researchers administered online questionnaires in Google Form to target
respondents in Phnom Penh who had experiences of using mobile payment services to
collect primary data for the study. On the other hand, secondary data such as information
about the mobile payment services in Cambodia was obtained from reliable websites,
government publications, and previous studies. For unknown populations, Bowerman
et al. [26] suggested that there needs to be at least 196 to produce a reliable result for a
quantitative study. Due to its nature as quantitative research and an unknown population,
the research employed convenient sampling, snowball sampling, and purposive sampling
to collect data from target respondents.

3.3 Measurement Items and Data Analysis Tools

In this study, there are a total of 22 questionnaire items adapted from prior studies
measured using a 5-Point Likert Scale. The BI consists of four items adopted from
reference [18]. PU has four items obtained from reference [19], and PEU also has three
items adapted from reference [19]. TR has 4 questionnaire items selected from reference
[18], while INN has 3 items adopted from reference [18]. F has four items adopted from
reference [19].
Data collected from target respondents was analyzed using the Statistical Package for
the Social Sciences version 25 (SPSS 25) and Analysis of Moment Structures version 23
(AMOS 23). While SPSS 25 was employed to analyze personal information, descriptive
statistics, factor analysis and reliability tests, and correlation matrix, AMOS 23 was
used to conduct confirmatory factor analysis (CFA) to check convergent reliability and
validity and structural equation modeling (SEM) to test the hypothesis.

4 Results

At the end of data collection, 301 samples were collected and feasible for data analysis.
ABA Mobile accounted for more than 71% of mobile payment app brands, followed
by ACLEDA Unity Toanchet (21%), Canadia Mobile (3%), FTB Mohabot App (2%),
and other mobile payment apps (2%). In terms of the reasons for using these mobile
payment applications, approximately 30% were for convenience; approximately 23%
were for online shopping; approximately 28% were for business; approximately 3%
were for discount; nearly 3% were for booking; and 13% were for other reasons. This
implies that most users among the 301 respondents used the ABA Mobile and ACLEDA
Unity Toanchet for convenience, business purposes, online shopping, and other purposes.
Related to gender, nearly 59 percent were female, and about 41 percent were male. The
majority of respondents (74.8 percent) were 21–25 years old, followed by 26–30 years
(12.6%), less than 20 years (10.6%), 31–35 years (1.3%), and over 35 years (0.7%).
For education, undergraduate and master’s degrees dominate other education levels as
more than 80 percent were undergraduate or bachelor’s degree holders, while nearly 14
percent were pursuing or holding a master’s degree, followed by high school (2.7%) and
Ph.D. (1 percent). Speaking of occupations, most respondents were students and from
Determinants of USER’S Acceptance of Mobile Payment 569

the private sector because about 48 percent were students, and around 38 percent were
from the private sector, followed by the public sector (8.3 percent) and other occupations
(5.3 percent). In terms of income, 45.5 percent earned $301–500, followed by less than
$300 (35.5%), more than $700 (11.3%), and $501–500 (8%).

Table 1. The results of descriptive statistics (n = 301)

Item codes Item description Mean Std. Dev


Behavioral Intention
BI1 I intend to use mobile payment shortly 3.91 0.99
BI2 I believe my interest in mobile payment will increase in the future 3.89 1.00
BI3 I intend to use mobile payment as much as possible 3.75 1.01
BI4 I recommend others to use mobile payment 3.72 0.98
Perceived Usefulness
PU1 I think using mobile payment would make it easier for me to 3.70 1.10
conduct transactions
PU2 I think using mobile payment would make it easier for me to follow 3.78 1.07
up on my transactions
PU3 I think using mobile payment would increase my productivity 3.65 1.00
PU4 I think using mobile payment would increase my effectiveness 3.59 0.99
Perceived Ease of Use
PEU1 I think learning to use mobile payment would be easy 3.69 1.06
PEU2 I think finding what I want via mobile payment would be easy 3.70 0.96
PEU3 I think becoming skillful at using mobile payment would be easy 3.71 1.05
Trust
TR1 I feel using mobile payment in monetary transactions is safe 3.56 1.11
TR2 I feel my personal data are in confidence while using mobile 3.53 1.01
payment
TR3 I feel the terms of use are strictly followed while using mobile 3.62 1.02
payment
TR4 I feel using the mobile payment for my transactions is trustworthy 3.61 0.99
Innovativeness
INN1 I am usually among the first to try a mobile payment 3.44 1.09
INN2 I am eager to learn about new technologies 3.71 0.97
INN3 I am eager to try new technologies 3.82 0.95
Functionality
F1 I think the connection speed is high enough for me to use it 3.56 1.09
F2 I think the transaction speed is high enough for me to use it 3.59 0.95
F3 I think the mobile payment interface is comprehensible enough for 3.64 1.00
me to use
F4 I think the anywhere-anytime accessibility infrastructure is high 3.69 0.99
enough for me to use it
570 S. Soun et al.

4.1 Descriptive Statistics

Table 1 presents the results of the mean and standard deviation of the research variables
for all research constructs in this study, measured using a 5-Point Likert Scale assessment.
The results illustrate that the mean score of all research variables ranges from 3.44 to
3.91, moving toward the “Agree” statement, showing a satisfying level of agreement.
Further, the standard deviation score of all research variables ranges from 0.95 to 1.11,
providing good variability of the data collected from respondents.

4.2 Factor Analysis and Reliability Test

Factor analysis was employed to sort, purify, and find misfit variables for studying the
structure of the variables in the analysis of the study of the research constructs to avoid
producing poor results, and reliability test was adopted to examine the reliability of
the research variables in the research study [27]. Hair et al. [27] further stated some
specifications as follows: factor analysis (factor loading ≥ 0.6, Kaiser Meyer-Olkin or
KMO > 0.5, cumulative percentage ≥ 60%, eigenvalue value > 1) and reliability test
(item-total correlation ≥ 0.5 and Cronbach alpha ≥ 0.6). Table 2 shows that the scores
of the factor loading, KMO, cumulative percentage, eigenvalue, inter-total correlation,
and Cronbach alpha (α) meet the requirement of the rule of thumb.

Table 2. Factor analysis and reliabilities test of behavioral intention

Item Factor analysis Reliability test


Factor loading KMO Cumulative Eigenvalue Item-total Coefficient alpha
percentage (%) correlation (α)
Behavioral Intention
BI1 0.839 0.783 66.871 2.675 0.695 0.835
BI3 0.831 0.683
BI4 0.804 0.645
BI2 0.797 0.636
Perceived Usefulness
PU3 0.818 0.712 64.455 2.578 0.648 0.816
PU4 0.818 0.647
PU2 0.797 0.636
PU1 0.778 0.609
Perceived Ease of Use
PEU2 0.835 0.685 66.493 1.995 0.604 0.748
PEU1 0.823 0.581
PEU3 0.787 0.536
(continued)
Determinants of USER’S Acceptance of Mobile Payment 571

Table 2. (continued)

Item Factor analysis Reliability test


Factor loading KMO Cumulative Eigenvalue Item-total Coefficient alpha
percentage (%) correlation (α)
Trust
TR2 0.834 0.775 64.288 2.572 0.683 0.814
TR3 0.802 0.630
TR4 0.801 0.630
TR1 0.768 0.590
Innovativeness
INN2 0.876 0.663 69.622 2.089 0.671 0.779
INN3 0.870 0.659
INN1 0.752 0.511
Functionality
F4 0.821 0.786 66.433 2.657 0.665 0.832
F3 0.819 0.665
F1 0.813 0.656
F2 0.807 0.652

4.3 Correlation Matrix

In this study, Pearson’s coefficient (r) was computed to assess the direction and strength
of the relationship between research constructs [28]. The correlation matrix tested the
correlations between perceived usefulness, perceived ease of use, innovativeness, func-
tionality, trust, and behavioral intention by computing the mean score of the research

Table 3. The results of correlation among research constructs (n = 301)

Variables Mean Std. deviation BI PU PEU TR INN F


BI 3.82 0.81 1.00
PU 3.68 0.84 .722** 1.00
PEU 3.70 0.84 .728** .714** 1.00
TR 3.58 0.83 .660** .627** .612** 1.00
INN 3.66 0.83 .636** .665** .694** .613** 1.00
F 3.62 0.82 .670** .673** .663** .664** .650** 1.00
**. Correlation is significant at the 0.01 level (2-tailed)
Pearson Correlation Coefficient
Note: BI: Behavioral Intention, PU: Perceived Usefulness, PEU: Perceived Ease of Use, TR:
Trust, INN: Innovativeness, F: Functionality
572 S. Soun et al.

constructs and testing its relationship. Table 3 summarizes the results of the positive
correlation among research constructs.

4.4 Measurement Models


Measurement models refer to the implicit or explicit models that link the latent variable to
its indicators [29]. Anderson and Gerbing [30] stated that there are two-stage approaches
to measurement models, CFA and SEM. The validity and reliability of the constructs
were assessed by CFA, and SEM was used to test the hypotheses.
Based on the results of CFA in Table 4, the model fit assessment was produced as
follows: factor loading greater than 0.7, critical ratio (t-value) larger than |1.96|, Average
Variance Extracted (AVE) greater than 0.50, and Composite Reliability (CR) higher than
0.70. Following the requirements suggested by Hair et al. [27] and Fornell and Larcker
[31], the convergent validity of research constructs was confirmed.

Table 4. The results of confirmatory factor analysis (n = 301)

Indicators Constructs Standardized t-value AVE CR S.E


loading
F1 ← Functionality 0.731 A 0.562 0.843 0.466
F3 ← 0.772 12.397 0.404
F4 ← 0.745 12.003 0.445
F2 (Deleted, Standardized Loading < 0.60)
TR1 ← Trust 0.735 A 0.533 0.831 0.460
TR2 ← 0.785 12.212 0.384
TR3 ← 0.666 10.544 0.556
TR4 (Deleted, Standardized Loading < 0.60)
INN1 ← Innovativeness 0.605 10.671 0.566 0.835 0.634
INN2 ← 0.797 14.875 0.365
INN3 ← 0.834 A 0.304
PU1 ← Perceived 0.726 A 0.517 0.818 0.473
PU2 ← Usefulness 0.724 11.729 0.476
PU3 ← 0.707 11.452 0.500
PU4 (Deleted, Standardized Loading < 0.60)
PEU1 ← Perceived Ease 0.723 11.496 0.509 0.812 0.477
PEU2 ← of Use 0.701 11.589 0.509
PEU3 ← 0.716 A 0.487
(continued)

As shown in Fig. 2, the results of SEM indicated that model fit assessment was
achieved (χ2/d.f. = 1.604, GFI = 0.931, AGFI = 0.900, NFI = 0.933, CFI = 0.973,
Determinants of USER’S Acceptance of Mobile Payment 573

Table 4. (continued)

Indicators Constructs Standardized t-value AVE CR S.E


loading
BI1 ← Behavioral 0.758 13.11 0.559 0.898 0.425
BI2 ← Intention 0.748 12.932 0.440
BI3 ← 0.755 A 0.430
BI4 ← 0.729 12.561 0.469
Note: S.E. = Standard Error, A = parameter regression weight was fixed at 1. Significant level of
p-value < 0.05, *** p < 0.001, ** p < 0.01, * p < 0.05

RMSEA = 0.045, P = 0.000), so the model was confirmed to be the best fit. In addition,
Hair et al. [27] further recommended the specifications of hypothesis testing as follow:
critical ratio of two-tailed test (t-value) > |1.96| and significant level of (p-value) < 0.05.
According to Table 5, the results showed that PU has a positive effect on BI (β = 0.46, p
= 0.017), so H1 was supported. Further, the results confirmed that there was a positive
impact of PEU on BI (β = 0.45, p = 0.018) and PU (β = 0.95, p < 0.001). Therefore,
H2 and H3 were supported. The results also identified a positive influence of TR on
PEU (β = 0.28, p < 0.001) accepting H4. Further, the results confirmed that there is a
significant and positive influence of INN on PEU (β = 0.69, p < 0.001) and PU (β =
0.44, p = 0.049). Thus, H5 and H6 were accepted. Finally, the results found that F has
a positive effect on PU (β = 0.34, p = 0.03) supporting H7.

Table 5. The results of hypotheses testing (n = 301)

Path relationships Standardized t-value p-value Hypothesis testing


coefficient(β)
H1: PU → BI 0.46 2.396 0.017 Accepted
H2: PEU → BI 0.45 2.374 0.018 Accepted
H3: PEU → PU 0.95 4.33 *** Accepted
H4: TR → PEU 0.28 3.407 *** Accepted
H5: INN → PEU 0.69 7.401 *** Accepted
H6: INN → PU 0.44 1.971 0.049 Accepted
H7: F → PU 0.34 2.165 0.03 Accepted
Note: t-value > |1.96|, Significant level of p-value < 0.05, *** p < 0.001, ** p < 0.01, * p < 0.05

5 Discussion and Conclusion


According to the results, PU and PEU were confirmed to have a positive effect on BI,
and this finding validates previous studies of acceptance of mobile banking apps [11],
574 S. Soun et al.

mobile payment [14], and wireless internet service [17]. This finding is also consistent
with the study of mobile shopping acceptance by Saprikis et al. [18] that confirmed a
positive effect of PU on BI. This proves that users will have the intention to use mobile
payment because mobile payment is convenient, productive, and effective for dealing
with payments and transactions. Meanwhile, simple processes and ease of use of mobile
payment will also contribute to the user’s intention to accept mobile payment. Besides,
the results also identified a strong and positive association between PEU and PU, which
is in line with existing literature [7, 11, 13, 15–19].
The results further illustrated that TR has a positive influence on the PEU of mobile
payment. This finding is consistent with the findings of Muñoz-Leiva et al. [11], who
discovered a positive relationship between TR and PEU in mobile banking applications.
Besides, in the study of consumers’ use of social media for transactions, TR was proved
to have a positive impact on PEU [16]. Interestingly, the finding of this study is also
consistent with Zarmpou et al. [19], who noticed that TR has a very strong connection
with PEU of mobile services, as well as Pavlou [21], who confirmed a positive connection
between TR and PEU. If users have high confidence and trust in mobile payment, they
are likely to find mobile payment easy to use.
In addition, INN was noted to have a positive impact on PEU of mobile payment
services, which supports prior findings of mobile shopping adaptation [18], NFC pay-
ment system adoption [13], mobile payment acceptance [14], and usage intention of
wireless internet services [17]. Users who are willing to try new things are more likely
than less innovative users to find mobile payment easy to use when dealing with pay-
ments and transactions. The results also revealed that INN also has a positive influence
on PU, which validates earlier studies [17, 19]. Thus, innovative consumers are likely
to perceive mobile payments as useful in their financial transactions.
Further, the results confirmed that F positively predicted the PU of mobile payment
services. This finding is consistent with the previous finding by Zarmpou et al. [19] that
identified F as a positive driver of PU for mobile services. This infers that usefulness is
influenced by the functionality of mobile payment systems, such as convenient interfaces,
quick response times, fast connection and transactions, and mobile infrastructure that
allows users to use mobile payment systems anywhere and anytime.
In conclusion, users will accept the use of mobile payment services due to two main
factors, which are usefulness and ease of use. Simultaneously, usefulness is affected by
the ease of use, innovativeness, and functionalities of mobile payment, such as trans-
action speed, response time, mobile infrastructure, and user-friendly interface. On the
other hand, ease of use is predicted by two important indicators, which are the user’s
innovativeness to accept sophisticated systems, including mobile payment, as well as
the user’s trust in mobile payment services.

5.1 Managerial Implications


The findings of this study will contribute to key stakeholders such as the developer
of mobile payment applications, banks, government, and policymakers. Based on the
results, mobile payment application producers or developers are recommended to better
understand users’ intention predictors and develop useful functions of mobile payment
such as convenience design, fast connection speeds and response time, and availability
Determinants of USER’S Acceptance of Mobile Payment 575

Fig. 2. Structural equation modeling (n = 301)

of use regarding time and place. Developers also need to maintain system security to
make users feel safe, confident, and trustworthy with mobile payment. Furthermore, the
results suggested that banks should consider these findings because of their reflection
on successful factors affecting mobile payment use to formulate effective marketing
strategies to encourage users to accept mobile payment. Besides, the government and
policymakers are also encouraged to employ the results of this study to support the banks
and developers in attracting citizens to use mobile payment services so that all stake-
holders can benefit from the digital economy. Likewise, policymakers should put great
effort into building knowledge and awareness of financial technology and its benefits,
and encourage citizens to try new technologies and mobile payment applications.
576 S. Soun et al.

5.2 Limitations and Future Research


This study has some limitations, which provide opportunities for future researchers to
conduct better studies. First of all, this study was conducted in Phnom Penh, Cambo-
dia, during the COVID-19 pandemic situation, which makes it difficult to reach target
respondents. Accordingly, the researcher used non-random samplings, which include
convenience, snowball, and purposive samplings, to distribute online questionnaires to
respondents to collect data. The drawback of this type of sampling technique is that it
might be biased and does not represent the whole population or generalize the findings.
Thus, future studies are recommended to employ random sampling techniques such as
systematic or cluster sampling to collect generalized data for the study. In addition, this
study was conducted on single factors such as perceived usefulness, perceived ease of
use, trust, innovativeness, and functionality to explore factors influencing users’ adoption
of mobile payment services, so insights in this study may be limited to these fundamen-
tal factors. Hence, future researchers are encouraged to expand their knowledge in this
area of study with more crucial factors that potentially affect the user’s intention to use
mobile payment services.

References
1. Evolution of digital payment industry, https://financebuddha.com/blog/evolution-digital-pay
ment-industry/. Accesed 04 Oct 2021
2. Bezhovski, Z.: The future of the mobile payment as electronic payment system. Eur. J. Bus.
Manage. 8(8), 127–132 (2016)
3. Tiwari, R., Buse, S.: The Mobile Commerce Prospects: A Strategic Analysis of Opportunities
in the Banking Sector. Hamburg University Press, Hamburg (2007)
4. National Bank of Cambodia: Project Bakong the next generation payment system. National
Bank of Cambodia, Phnom Penh (2020)
5. Lerner, T.: Mobile payment. 1st ed. Springer Vieweg, Mainz (2013)
6. The Top Mobile Payment Systems in Cambodia, https://cryptoasia.co/news/top-mobile-pay
ment-systems-cambodia/. Accessed 29 Sep 2021
7. Chav, T., Ou, P.: The factors influencing consumer intention to use internet banking and apps:
a case of banks in Cambodia. Int. J. Soc. Bus. Sci. 15(1), 92–98 (2021)
8. Do, N.H., Tham, J., Khatibi, A.A., Azam, S.M.F.: An empirical analysis of Cambodian
behavioral intention towards mobile payment. Manage. Sci. Lett. 9(12), 1941–1954 (2019)
9. Venkatesh, V.: Determinants of perceived ease of use: integrating control, intrinsic, motivation,
and emotion into the technology acceptance model. Inf. Syst. Res. 11(4), 342–365 (2000)
10. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information
technology. MIS Q. 13(3), 319–340 (1989)
11. Muñoz-Leiva, F., Climent-Climent, S., Liébana-Cabanillasa, F.: Determinants of intention to
use the mobile banking apps: an extension of the classic TAM model. Spanish J. Marketing –
ESIC 21(1), 25–38 (2017)
12. Oliveira, T., Thomas, M., Baptista, G., Campos, F.: Mobile payment: understanding the deter-
minants of customer adoption and intention to recommend the technology. Comput. Hum.
Behav. 61, 404–414 (2016)
13. Ramos-de-Luna, I., Montoro-Ríos, F., Liébana-Cabanillas, F.: Determinants of the intention
to use NFC technology as a payment system: an acceptance model approach. Inf. Syst. E-Bus.
Manage. 14(2), 293–314 (2015)https://doi.org/10.1007/s10257-015-0284-5
Determinants of USER’S Acceptance of Mobile Payment 577

14. Kim, C., Mirusmonov, M., Lee, I.: An empirical examination of factors influencing the
intention to use mobile payment. Comput. Hum. Behav. 26(3), 310–322 (2010)
15. Schierz, P.G., Schilke, O., Wirtz, B.W.: Understanding consumer acceptance of mobile
payment services: an empirical analysis. Electron. Commer. Res. Appl. 9(3), 209–216 (2010)
16. Hansen, J.M., Saridakis, G., Benson, V.: Risk, trust, and the interaction of perceived ease
of use and behavioral control in predicting consumers’ use of social media for transactions.
Comput. Hum. Behav. 80, 197–206 (2018)
17. Lu, J., Yao, J.E., Yu, C.-S.: Personal innovativeness, social influences and adoption of wireless
internet services via mobile technology. J. Strateg. Inf. Syst. 14(3), 245–268 (2005)
18. Saprikis, V., Markos, A., Zarmpou, T., Vlachopoulou, M.: Mobile shopping consumers’
behavior: an exploratory study and seview. J. Theor. Appl. Electron. Commer. Res. 13(1),
71–90 (2018)
19. Zarmpou, T., Saprikis, V., Markos, A., Vlachopoulou, M.: Modeling users’ acceptance of
mobile services. Electron. Commer Res. 12(2), 225–248 (2012)
20. Dahlberg, T., Mallat, N., Ondrus, J., Zmijewska, A.: Past, present and future of mobile
payments research: a literature review. Electron. Commer. Res. Appl. 7(2), 165–181 (2008)
21. Pavlou, P.A.: Consumer acceptance of electronic commerce: integrating trust and risk with
the technology acceptance model. Int. J. Electron. Commer. 7(3), 101–134 (2003)
22. Midgley, D.F., Dowling, G.R.: Innovativeness: the concept and its measurement. J. Consumer
Res. 4(4), 229–242 (1978)
23. Agarwal, R., Prasad, J.: A conceptual and operational definition of personal innovativeness
in the domain of information technology. Inf. Syst. Res. 9(2), 204–215 (1998)
24. How Cambodia can capitalise on strides in financial inclusion, https://southeastasiaglobe.
com/how-cambodia-can-capitalise-on-strides-in-financial-inclusion/. Accessed 02 Oct 2021
25. National Institute of Statistics: General population census of the Kingdom of Cambodia 2019.
Ministry of Planning, Phnom Penh (2020)
26. Bowerman, B.L., O’Connell, R.T., Murphree, E.: S: Business Statistics in Practice, 6th edn.
McGraw-Hill/Irwin, New York (2011)
27. Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E.: Multivariate Data Analysis. 8th ed.
Cengage Learning EMEA, Hampshire (2019)
28. Boslaugh, S., Watters, P.A.: Statistics in a Nutshell. 1st ed. O’Reilly Media (2008)
29. Bollen, K.A.: Indicator: methodology. Int. Encyclopedia Soc. Behav. Sci. 7282–7287 (2001)
30. Anderson, J.C., Gerbing, D.W.: Structural equation modeling in practice: a review and
recommended two-step approach. Psychol. Bull. 103(3), 411–423 (1988)
31. Fornell, C., Larcker, D.F.: Evaluating structural equation models with unobservables variables
and measurement error. J. Mark. Res. 18(1), 39–50 (1981)
A Proposed Framework for Enhancing
the Transportation Systems Based on Physical
Internet and Data Science Techniques

Ashrakat Osama1,3(B) , Aya Elgarhy1,3 , and Ahmed Elseddawy2,3


1 College of International Transport and Logistics, Cairo Governorate, Egypt
[email protected]
2 College of Management and Technology, Cairo Governorate, Egypt
3 Arab Academy for Science Technology and Maritime Transport, Cairo Governorate, Egypt

Abstract. Logistics and supply chain processes nowadays are not sustainable and
cause many problems. A traditional freight transportation system that moves the
commodities between nodes of the supply chain takes a lot of time and cost. It
accounts for large quantities of carbon dioxide emissions from fuel consumption.
The physical internet aims to facilitate the flow of moving goods through modular
units and sharing the resources to reduce time, effort, and cost. Also it helps
to change the way of moving goods across participants of the supply chain and
make the way physical goods are moved, handled, stored, and supplied across the
world more economical, environment-friendly, socially-efficient and sustainable.
As the participants of the supply chain will have access to share central hubs and
transportation means, this will help them to move the commodities from one place
to another more efficiently. This research takes Egypt as a case study to investigate
the main transportation problems and how to apply physical internet to solve it.
This paper aims to propose a framework to apply the physical internet with its tools
and data science techniques to overcome transportation problems, as it will help
to reduce transportation costs, reduce the harmful impact on the environment and
enhance the transportation system to be more efficient, through building a pool of
sharable resources and standardizing the goods to enhance collaboration between
the participants of the supply chain. In addition to the use of neural networks
that benefit the supply chain in many areas, it will support in decision making,
forecasting, and choosing the optimal path for transportation.

Keywords: Freight transportation problems · Physical Internet · Data science ·


Artificial intelligence · Neural network

1 Introduction
Logistics has become an integral part of our way of life, allowing people to consume
products from all over the world all year long at reasonable costs. It has evolved into
the backbone of global trade, owing to the efficiency of container shipping and handling
across continents [1]. Within the current structure of supply chains, logistics performance

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 578–595, 2023.
https://doi.org/10.1007/978-3-031-18461-1_38
A Proposed Framework for Enhancing the Transportation Systems 579

is limited in achieving two opposing aims. The first goal is to achieve small, frequent
shipments in a just-in-time manner, while the second goal is to achieve superior environ-
mental performance by making the best use of transportation modes, particularly heavier
but cleaner modes. Increasing supply chain collaboration in logistics networks is one
approach to take advantage of synergies and, as a result, collaboratively improve logis-
tics performance, particularly in transportation while dealing with independent logistics
organizations, enabling a performance that is equivalent to or greater than that produced
by pooling [2]. From an economic, environmental, and social standpoint, transportation,
storage, and product handling in today’s world do not correspond with the policy of
sustainable development. From empty transportation to underutilized distribution sites,
inefficiency can be seen at every level [3].
Upcoming freight transportation will be very different from what it is now. It is
no longer a question of choice, but rather a necessity. Freight transportation is a major
contributor to global warming, accounting for 7–8% of carbon emissions. While other
industries have reduced global greenhouse gas emissions over time, transportation is
the only one where emissions are increasing. It is considered one of the most difficult
economic sectors to decarbonize, partially because freight transportation demand is
anticipated to increase dramatically over the next several decades, but also because it
is largely reliant on fossil fuels [4]. In response to this problem, a framework has been
created in this research for usage in a network of open and interconnected networks,
called the Physical Internet (PI) that will change the way of movement and handling,
storing, and supplying the commodities across the world through the application of its
tools; in addition to using data science techniques through neural networks as it can assist
in choosing the optimal routing during the transportation. This framework will aid in
the reduction of fuel consumption and transportation expenses, as well as the reduction
of harmful environmental effects and the enhancement of the transportation system to
make it more flexible and efficient, in order to enhance the supply chain in economic,
social, and environmental aspects.
The structure of the paper will be as follows: next section will discuss the main
problems of freight transportation system and it will focus more on road transport, then
the physical internet concept will be discussed with its main principles and tools and
previous studies of papers that applied the concept. After that the framework model is
provided with its main stages and the benefits of applying the tools of physical internet
on the supply chain and how it can enhance the transportation process within the supply
chain and the challenges that may face the application of physical internet. Finally, an
analysis of the impact of the physical internet in solving the transportation problems is
discussed and a conclusion of the research paper and future work is provided.

2 Freight Transportation Problems in Egypt


Transportation is the country’s cornerstone of economic growth due to its direct impact
on the economy. The establishment of society as a critical component in achieving eco-
nomic and social development goals [5]. As stated by [6] vehicle transport especially the
trucking, is the most frequent mode of transportation for moving freight along corridors
in most places. Almost all commercial freight is transported by road at some point, and
580 A. Osama et al.

road transport accounts for more than 80% of overland commerce activity. The proper
supply of road transport services is critical for the unrestricted flow of freight and people
along corridors.
In Egypt, as well as the Middle East in general, road safety is a major concern.
Egypt has a high incidence of traffic accidents and road deaths. The majority of traffic
deaths are of young and middle-aged adults, which have a significant impact on Egypt’s
expanding economy. As a result, one of the major concerns of Egypt’s scholars, society,
and the government is to reduce or avoid road accidents [7]. The trucking industry is
important as it is responsible for 66.7% of freight transport on roads ranging from light
to heavy trucks [8]. So, it’s very important to investigate the main problems that lead to
truck accidents.

18000
16000
14000
12000
10000
8000
6000
4000
2000
0
2013 2014 2015 2016 2017 2018 2019

Fig. 1. Road accidents rate [8]

Truck accidents can be caused by a variety of factors, including general inattention,


misdirected attention, falling asleep, and distraction in some circumstances [9]. Sleep
deprivation, lengthy shifts, constant driving hours, long travels, and tight schedules may
all contribute to driver weariness, which can lead to truck accidents [7]. Truck drivers
spend most of their lives on roads as they travel for long journeys to deliver the freight
from one region to another, they drive for a very long time it may be days or weeks,
which requires them to be awake and concentrated during their journeys, especially at
night. Some drivers take drugs to be awake and the others don’t but, in both cases, it’s
very exhausting and dangerous, as in many cases they are the main reason for many
accidents. In 2019, a study conducted by the Central Agency for Public Mobilization
and Statistics [10] declared that around 9992 accidents were recorded which caused
3484 deaths due to car accidents. In 2019, the population’s average rate of automobile
accidents was 1.0 accidents per 10,000 people, while the rate of car accidents was 0.9
accidents per 1000 cars, and the mortality rate was 3.6 deaths per 100,000 people, 30.3
deaths per 100,000 vehicles. The most significant part is that the human factor was the
leading cause of automobile accidents as it accounted for 79.7% of total causes of road
accidents, followed by vehicle technical faults accounting for 13.5% of total causes of
A Proposed Framework for Enhancing the Transportation Systems 581

road accidents. Figure 1 shows the number of vehicles accidents on roads from 2013 to
2019.
In addition to that, the large vehicles may cause deficiencies to the roads which
require huge costs to fix roads. Moreover, toll gates costs for trucks that move among
governance is considered another problem for the transportation system. Fuel costs, toll
gates, maintenance costs, and other expenses related to the journey of the truck make a
burden on transportation companies especially in cases of long distances and in empty
journeys. As it makes transportation companies pay costs which in turn will affect their
profit, the social life of the drivers also can be harmed as they are always traveling
for transporting the goods, and unfortunately, it can also cause accidents due to extreme
exhaustion to drive long distances. Truck capacity is not always being utilized effectively
due to the limited standard storage spaces that exist now.

Fig. 2. CO2 emissions from liquid fuel consumption (kt) - Egypt, Arab Rep [11].

Another important aspect that should be considered is the fuel consumed during the
journey of delivering the products and the percentage of CO2 emissions that can harm
the environment. Figure 2 shows the emissions of CO2 from liquid fuel consumption in
Egypt till 2016. Most trucks in Egypt are diesel-fueled vehicles. Trucks handle more than
97% of freight transport in Egypt with 204,377, 200 ton-kilometers per day [8] which
means that there are a large number of trips for long distances. Moreover in some cases
the return journey for the truck can be empty which causes waste of time, effort, costs
and may also harm the environment with more CO2 emissions. So, is there is any way
to avoid empty journeys with affecting drive the orders of the customer when needed
the emissions can be reduced?
To summarize, there are many problems with the freight transportation system in
Egypt that have many effects of inefficiency in three aspects: economic, social, and envi-
ronmental. From economic perspective problems include high costs for the long journey
as it includes extra costs paid for the movement of goods such as toll gate fees, fuel fees,
and car maintenance fees. On the other hand, from the social perspective problems are
related to the drivers as they travel for the long journey they suffered from exhaustion
582 A. Osama et al.

and lack of social life for working for long hours and days, and some health issues
related to driving for many hours that can harm the muscles and some drivers also resort
to taking medications and drugs to stay awake for long hours. From the environmental
aspect, vehicles depend mainly on diesel as a main source which make transportation
one of the most significant sources of carbon dioxide emissions. Finally, there are many
problems regarding the transportation in supply chain such as inefficiencies in utilization
of capacity as there are limited sizes for containers most of them are 20- and 40-feet
containers which require some vendors to consolidate the goods until reaching the full
capacity to save some costs, and the most important problem is sometimes the return
journey is empty which causes many costs. These problems lead to inefficiencies in the
transportation system such as increased costs, harm to the environment, increased deliv-
ery time, and some problems for drivers. So, this research aims to propose a framework
of implementing the physical internet to solve transportation problems.

3 Physical Internet

The Physical Internet (PI, π) was proposed to address the existing global logistics’ lack
of economic, environmental, and social sustainability, and is based on the rapid evolution
of the digital world, owing to different standardizations that have helped reshape digital
communications in networks [12]. Physical internet has many benefits that can solve
problems that face the transportation system, and it will enhance the whole supply chain
performance. Physical Internet will improve the efficiency and sustainability of logistics
in its broadest meaning by an order of magnitude. The concept of the universal inter-
connectedness of logistics networks and services is exploited by the Physical Internet. It
proposes encapsulating goods and products in globally standardized, green, modular, net-
worked, and smart containers that can be moved and distributed over rapid, dependable,
and environmentally friendly multimodal transportation and logistics systems [12].
In the beginning, the physical internet concept was introduced, and its standardized
tools and basic guidelines were discussed as Physical internet was first presented by [12,
13]. The basis and core principles of the physical internet were proposed to address the
existing global logistics lack of economic, environmental, and social sustainability, and
is based on the rapid evolution of the digital world, owing to different standardizations
that have helped reshape digital communications in networks, as well as how the Internet
metaphor relates, were discussed.

3.1 Logistics Web

The Logistic Web is a global network of physical, digital, human, organizational, and
social actors and networks that serve the dynamic and evolving logistics demands of the
world. The Physical Internet intends to enable the Logistics Web to be more open and
global while being dependable, resilient, and adaptable in the pursuit of efficiency and
sustainability.
A Proposed Framework for Enhancing the Transportation Systems 583

Fig. 3. Components of logistics web [12]

As shown in Fig. 3. The Mobility Web, the Distribution Web, the Realization Web,
and the Supply Web are four interconnected webs that make up the Logistics Web. The
Mobility Web is concerned with the movement of physical things across a global network
of open unimodal and multimodal hubs, transits, ports, highways, and paths. The Delivery
Web is concerned with the distribution of things throughout a global network of open
warehouses, distribution hubs, and storage places. Making, assembling, personalizing,
and retrofitting products as best fits inside the worldwide interconnected set of open
factories of all types is what the Realization Web is all about. The Supply Web is a global
interconnected network of open suppliers and contractors for delivering, receiving, and
supplying objects. Each Web makes use of the other Webs to improve its performance
[12, 14].

3.2 Physical Internet Components

From a conceptual standpoint, the essential component of the system includes four main
elements, as shown in Fig. 4, PI-containers, PI-nodes, and PI-movers.
The first component is PI-container, Physical Internet encapsulates physical items
in physical packets or containers, which will be called PI-containers to distinguish them
from current containers. Each PI-container has a unique global identification from an
informational standpoint. It helps to provide container identity, integrity, routing, con-
ditioning, monitoring, traceability, and security via the Physical Internet. Radio Fre-
quency Identification (RFID) and/or Global Positioning System GPS technologies are
now thought to be suitable for equipping PI-Container tags [15].
584 A. Osama et al.

components of physical internet

PI-containers
PI-site

PI-nodes
PI-faciliƟes
PI-system PI-vehicle

PI-movers
PI-transit • PI-boats
• PI-locomoƟve
PI-switch • PI-plane
PI-bridge • PI-robot
• PI-truck
PI-sorter PI-carrier
PI-composer • PI-trailer
• PI-tug
PI-sorter • PI-wagon
PI-gateway PI-conveyor
PI-hub PI-handler
PI-distributor

Fig. 4. Main components in physical internet-according to [12]

3.3 Encapsulation with PI-Containers

As stated by [16] all the different packages, cases, totes, and pallets now used in the
encapsulation levels one to four are proposed to be replaced by standard and modular
PI-containers. However, they must be available in a variety of structural grades to meet
the wide range of planned applications.

Fig. 5. Proposed physical internet encapsulation characterization – Source (Montreuil et al., 2014)
A Proposed Framework for Enhancing the Transportation Systems 585

Transport physical internet containers (TPI), handling physical internet containers


(HPI), and packing physical internet containers (PPI) are the three categories of PI-
containers that are developed, constructed, and used. As shown in Fig. 5 transport con-
tainers are a step forward from the present shipping containers used in encapsulation layer
4. In encapsulation levels 2 and 3, basic handling unit loads, and pallets are replaced
with handling containers. Handling containers are functionally equivalent to existing
basic handling unit loads such as cases, boxes, and totes, but with improved generic
PI-container requirements. HPI-containers are smaller and designed to modularly fit
into TPI-containers. Packaging containers, also known as PPI-containers or PI-packs,
are like contemporary product packaging in that they contain unit items for sale and are
exhibited in retail establishments all over the world. The packaging containers change
the existing encapsulation layer 1 packages. It can be easily collapsed to support vehicle
trailers that would normally be returned empty to the car manufacturer, allowing them
to be used for providing parts to the facility [16].
Second component is PI-node. PI-nodes are sites that are linked to the logistical pro-
cesses dedicated to working with PI-containers. There is a range of PI-nodes that provide
services ranging from basic carrier transfer between PI-vehicles to extensive multimodal
multiplexing of PI-containers. PI-nodes theoretically encompass PI-sites, PI-facilities,
and PI-systems, which are sites, facilities, and systems meant to function as physical
nodes of the Physical Internet. As they are places that are linked to logistics, changes
such as switching from one means of transportation to another may be influenced by the
activity at PI-node. They might also lead to contract modifications for the PI- containers.
PI-transits are PI-nodes with the task of facilitating and completing the transition from
a transportation mode to another as for example, in PI-carriers, it is responsible to move
PI-containers from their inbound to outgoing PI vehicles; road-based transportation can
be very simple with a PI-site near the intersection of the two highways. PI-trailers may be
transferred from PI-trucks to PI-trains or PI-boats, and vice versa. PI-switches are nodes
that enable and achieve the unimodal transfer of PI-containers from an arriving PI-mover
to a leaving PI-mover. PI-sorters are crucial components of the transfer points, as they
receive containers from different areas and sort them to be moved to their destination
[12, 13].
PI-hubs are nodes that allow π-containers to be transferred from incoming movers
to outgoing movers. As PI-Container is quickly emptied, sorted and transferred with PI-
Conveyors, a switch to PI-Hubs would result in a faster flow and shorter waits. PI-Hubs
will be at the heart of rapid, efficient, and reliable multimodal transportation.

Fig. 6. The normal process vs pi-hub of the cross-dock [17]


586 A. Osama et al.

Package flows are directed in the network by mixing/distribution centers, while


package flows are carried by transportation modes such as road or rail. Intermodal
terminals allow cargo to transfer between transportation modes. PI services are provided
by a variety of Logistics Service Providers (LSPs), who ensure that all types of physical
products are delivered promptly [18]. Figure 6 shows the cross-docking as in the normal
process it takes time to categorize products, store them, then assign them to each area,
while when physical internet is used the process is done automatically by using conveyors
and with the help of some algorithms the products are sorted and assigned to each truck
more efficiently.
The third component of the physical internet is PI-movers. PI-containers are gener-
ically moved around by π-movers in the Physical Internet. Transporting, conveying,
managing, lifting, and manipulating are all verbs that can be translated as moving. Trans-
porters, conveyors, and handlers are the three basic types of π-movers. The notion of
π-transporters encompasses π -vehicles and π -carriers, which are specifically designed
to make transferring π-containers easy, secure, and efficient. The fact that π-vehicles
are self-propelled distinguishes them from π-carriers, which must be pushed or dragged
by π-vehicles or by handlers. PI-vehicles include π-trucks, π-locomotives, π-boats, π-
planes, π-lifts, and π-robots, to name a few. Similarly, the π-carriers category comprises
π-trailers, π-carts, π-barges, and π-wagons [13].

3.4 Previous Studies


There are several research papers that discuss the physical internet and its relationship
with the supply chain, and many studies try to implement tools of physical internet to
different areas in supply chain and transportation. The paper [19] presented an Open
Logistics Interconnection (OLI) paradigm for the Physical Internet, like the Open Sys-
tems Interconnection (OSI) model for the Digital Internet. Then papers were published
that describe the structure of PI-hubs for road, railroad, and road-based transit centers
and the requirements for intermodal [20]. Some works tried to assess the impact of
physical internet on logistics and evaluate the performance, as simulation models and
other analytical models were conducted. In [21] the first simulator was presented for
the physical internet environment. It is being used to aid in the exploration and analysis
of the effects of transitioning from the present tight transportation system to an open
logistics network in France. The simulator is a multi-agent program capable of simu-
lating large-scale virtual mobility webs with thousands of actors and agents interacting
to move cargoes and containers throughout a network of locations, including unimodal
and multimodal hubs. A simulation model is built by [21] to evaluate the efficiency of
a connected logistics network through the physical internet using parameters like load
delivery time, carbon emission, and driver trip time. The performance measurements
from the simulation for the Physical Internet are compared against standard industry key
performance indicators in a case study involving fast-moving products in the consumer
goods sector in France. This paper concludes that Physical Internet does not affect the
logistics system’s operational efficiency while considerably lowering carbon emissions
and logistical expenses. The author in [22] proposed an evolutionary strategy to over-
come the problem of open PI-hub network design that happened due to its massive scale.
As the model is made for decentralized and distributed routing solutions for the physical
A Proposed Framework for Enhancing the Transportation Systems 587

internet over traditional network architecture for flow assignment difficulties. The study
in [23] proposed a freight transportation model based on the Physical Internet, in which
freight is transported from hub to hub utilizing various tractors assigned to each hub.
The concept of combining two trailers into a road train is being considered. The data
shows that both consolidation and waiting to enhance the likelihood of a return hauling
opportunity have a beneficial influence on overall cost, fill rate, the average amount of
the night spent at home, and GHG emissions. A mathematical program and a break-
down algorithm for solving the problem of optimal space utilization by determining the
size and number of modular containers to pack a collection of items as the paper [24]
demonstrate how the deployment of standardized containers results in greater vehicle
space utilization through a case study. Protocols for Physical Internet transportation were
introduced [2, 25]. As an aggregate of optimum point-to-point dispatch models between
pairs of cities, a systems model of conventional and Physical Internet networks is built.
This is then utilized to define the behavior of conventional and Physical Internet logistics
systems for a variety of key performance metrics in logistics systems. The Physical Inter-
net’s advantages include lower inventory costs and lower total logistics system costs.
They simulate asynchronous container shipment and creation inside a linked network of
services, as well as the optimum path routing for each container to save transportation
costs. The study in [26] suggested a multi-agent simulation model for freight transporta-
tion resilience on the Physical Internet, taking disturbances at hubs into account. To deal
with various forms of disturbances, they presented two dynamic transit protocols. The
study [17] presented a novel routing technique based on the physical Internet-Border
Gateway Protocol (PI-BGP), which is the Internet’s version of BGP. It created a new
protocol to provide a fresh approach to the problem of PI-container routing on the Phys-
ical Internet. They can ensure a rapid flow inside and between PI-hubs by focusing on
the exclusive routing of PI-Containers following the PI-BGP, eliminating delays and
solving stocking difficulties. The study [27] uses a basic model to capture the core of
the problem to investigate the resilience of a network delivery system focusing more on
the time and the total cost of travel to measure performance. The findings imply that
networks with redundancy may adapt well to fluctuations in demand, but hub-and-spoke
networks without redundancy cannot make use of the Physical Internet’s benefits. [12].
Physical internet has many benefits that can solve problems that face the transporta-
tion system, and it will enhance the whole supply chain performance. Physical Internet
will improve the efficiency and sustainability of logistics in its broadest meaning by an
order of magnitude. The concept of the universal interconnectedness of logistics net-
works and services is exploited by the Physical Internet. It proposes encapsulating goods
and products in globally standardized, green, modular, networked, and smart containers
that can be moved and distributed over rapid, dependable, and environmentally friendly
multimodal transportation and logistics systems[12], that’s why a framework will be
provided in next sections in order to link some tools of physical internet to the supply
chain which leads to enhancement in transportation process that’s why data science will
integrate the framework in order to facilitate the transportation process.
588 A. Osama et al.

4 Data Science
Data science is an interdisciplinary approach that aims to get value from the data. There is
a massive amount of data that can be collected from websites, sensors, customers, smart-
phones, and other things. As data science process the data to obtain useful insights. Data
science helps to make the supply chain more efficient, as it helps in the forecasting pro-
cesses through the supply chain regarding demand and supply forecasting, networking,
transportation, and inventory optimization. In transportation data science with the help
of machine learning can predict the optimal routes for transport, the congestion areas,
tracking, and scheduling of shipments, etc.

5 Artificial Intelligence
Artificial intelligence (AI) does not imply creating a super-intelligent machine capable
of solving any problem in a flash, but rather creating a machine capable of human-
like behavior. It refers to machines that can perform one or more of the following
tasks: comprehending human language, performing mechanical tasks requiring complex
maneuvering, solving computer-based complex problems involving large amounts of
data in a short amount of time, and providing human-like responses, and so on. The term
machine learning refers to computer software that can learn to perform actions that aren’t
expressly designed by the program’s inventor. Instead, it can reveal and conduct that the
author is fully ignorant of. Despite their origins in fundamentally distinct fields, machine
learning has brought together a large array of algorithms that are essentially created from
the concepts of pure mathematics and statistics. They share one additional element in
addition to the roots: the use of computers to automate difficult computations. These
computations eventually lead to the solution of problems that appear to be so difficult
that they appear to be solved by an intelligent creature, or Artificial Intelligence [28].
Artificial intelligence with the application of machine learning can benefit the supply
chain in many ways from demand prediction to warehouse management, delivery with
drones, and route optimization as it can analyze the data of the routing to find the optimal
route that delivers without any delays and reduces costs.

6 Neural Network
Neural networks are an emerging artificial intelligence system that is built on recent
advances in human brain tissue biology research. Its principle is to simulate the structure
and functioning of the human brain. Neural network technology has made a breakthrough
in the understanding of some of artificial intelligence’s (AI) limitations and has been
effectively implemented in a variety of disciplines, demonstrating that the efficiency and
accuracy of other AI systems cannot be matched. With real-time processing capabilities,
the neural network has a great adaptability capacity and can quickly assess and handle
emergent restrictions [29]. There are three different layers of the neural networks as
shown in Fig. 7 which include: Input, hidden, and output layers are the three layers.
The data is transmitted from one layer to the next, beginning with the input layer and
progressing via the hidden levels to the output layer, where the result is specified. Each
A Proposed Framework for Enhancing the Transportation Systems 589

layer gets an input value and produces an output value. A layer’s input value is the
preceding layer’s output value. Neural networks have the potential to learn from their
surroundings. This is accomplished by adjusting the weights until the artificial neural
network can generalize the outputs’ findings. After the learning process is complete,
the neural network may be utilized with fresh inputs to produce new predictions or
characterize current data patterns [30]. There are two types of networks single hidden
layer and shallow or deep multilayer perceptron neural network.

Fig. 7. Neural networks mechanism [29]

Neural network technology has been used in many facets of supply chain manage-
ment. The neural network in the supply chain can be used in three different areas including
optimization, forecasting, and decision support: The most widely used computing tech-
nique for solving optimization issues is the neural network. It has a significant impact on
supply chain management. The application of neural networks to handle supply chain
management optimization problems such as shop scheduling, warehouse management,
transportation route selection, and so on are currently being researched [29]. In most
cases, there is a single input layer, one output layer, and a variety of hidden layers.
The number of input and output variables is equal to the number of nodes in the input
and output layers, respectively. The nodes in each layer are connected to nodes in the
following layer by different weights that the neural network programming puts up each
training period, starting with the input layer. The method is repeated until the output
has the smallest loss in comparison to the observation [31]. Generally single and mul-
tilayer neural networks can be used in supply chain issues, but the multilayer will be
more accurate as data pass through many hidden layers which can give more accurate
results to make the right decisions. The multi-layer system can be used in forecasting
transportation to find the optimal route to be taken within the supply chain to save fuel
consumption and time needed to deliver cargo from one node to another throughout the
supply chain.

7 Proposed Framework
Based on the previous literature that has been discussed a framework has been provided
in order to enhance the transportation system and overcome the main problems facing it.
590 A. Osama et al.

This section illustrates the framework with some tools of the physical internet applied
in stages through the supply chain to enhance it. A framework has been proposed as
shown in Fig. 8 that illustrates how physical internet and neural networks can benefit
the supply chain.

Fig. 8. Proposed framework model- authors’ own

The first stage is mainly about the gathering process as goods in the perception
layer that need to be shipped are received and collected into PI-containers with their
different sizes. As mentioned before, the encapsulation is done through only four tiers
including packaging containers, TPI-container, HPI-container, and PPI-container. When
goods will be produced and packaged on PPI containers, then they will be inserted
into HPI-container without any pallets, and with capacity utilization of 100% as the
containers have standardized measures that will enable a full capacity utilization. After
that if the containers need to be shipped and consolidated in TPI-containers or the
HPI containers can be handled by themselves also. These containers send their data
through using internet of things (IoT) technology, as each container has its RFID and
GPS technologies that allows the tracking process of the container. The goods are sent
to the transmission stage through the network stage. In-network stage infrastructure
layer consists of routes that can be used to move PI-containers by PI- vehicles. These
roads should support carrying heavy weights and should also have needed infrastructure
that will allow tracking containers through RFIDs. The artificial intelligence and neural
network can be applied in this layer for routing the means of transport to lead them to
their specified hub. The neural network will analyze many factors such as congestion
A Proposed Framework for Enhancing the Transportation Systems 591

areas, fuel consumption, roads existing to identify the optimal route to be taken by
trucks. In the transmission stage, PI-movers take place as PI-handlers such as PI-forklift
helps to move each container or group of small consolidated containers from one place
to another PI vehicles are structured to fit the new sizes of container. PI-vehicles will
move the PI-container to pass from a hub to another till reaching its destination, and
the PI-vehicle will return back to its hub loaded with other PI-containers. Then in the
processing stage cross-docking process takes place where goods are unloaded from the
trucks then it passes through the PI-Sorters and PI-conveyor which move the containers
into the racks then they are moved again from the racks to the other side of the hub for
reshipment to reach its destination. Finally, in the application stage after containers move
through different hubs the goods are delivered to their customers at their destination and
PI-containers will be used to contain other products.

8 Benefits of Adapting the Physical Internet Framework


In the normal encapsulation process, products can move through five or more tiers
including one or more than one package for the product itself, then all the products are
consolidated in cartons or boxes, after that they are uploaded on pallets, then shipped
to containers, and finally, theses containers are shipped into their carriers. After that,
the containers which are moved may not be fully utilized and the current models of
containers have limited sizes 20 and 40 ft., but the PI-containers contain many different
sizes ranging from small to large containers which can be consolidated to form one unit
for utilizing the space. PI-containers are also connected with the internet through tools
of IoT which enable tracking of the products. Moreover, by using the physical internet
with their centralized PI-hubs the trucks will make short journeys from hub to hub with
high space utilization as the vehicles don’t return empty which will reduce the costs of
fuel and toll gates, and drivers also will enhance their social lives and won’t be obliged
to take drugs that make them awake due to long journeys. This will lead to reduced fuel
consumption, reduced costs of transportation, and a reduction in the number of accidents.

9 Challenges from Adapting Physical Internet Framework


As discussed previously in the paper the physical internet is a concept that aims to connect
the logistics operations and link participants throughout the supply chain and will have
environmental, economic, and social benefits. However, there are many constraints to
applying the new paradigm with its tools. From the packaging process as it may be a
resistance from factories to change their packs with PI-containers and packs which will
require changes in the process as the encapsulation will be different from the current
encapsulation or packaging process which may face resistance of companies to apply
the new tools. Not only manufacturing companies but all participants of the supply
chain should deal together to transfer their products with the new sizes of containers
and should pass through different hubs till reaching the destination. Resources will be
shareable with all participants of the supply chain around the world and the different
infrastructure will be needed. As PI-containers will affect changing the packaging tools
in the factory, hubs or warehouses may need to design different racking system to be able
592 A. Osama et al.

to handle new containers, material handling equipment need to be applied and vehicles
and other movers will need to be modified in order to fit to the new sizes of containers.
Infrastructure includes building hubs in different suitable points, building containers with
their new sizes, the hubs also should have sorters that will automatically be sorting and
shipping containers to their desired destinations, warehouses in ports and rail stations
should have sorters, and conveyors. Finally, and the most important challenge is that
some countries may not be able to apply physical internet with their tools which may
interrupt the global supply chain they may not be able to change to the new tools of
physical internet.

10 Effect of Applying the Physical Internet on Freight


Transportation

As discussed in previous sections the application of physical internet with its protocols
and tools can enhance the freight transportation process in the supply chain. As the
paper revealed in the beginning that transportation problems can be divided into three
aspects economic, social, and environmental. So, the implementation of the concept will
help in solving these problems. From the economical aspect applying to the physical
Internet tools will avoid the problem of long journey and the empty return journey as
PI- containers will be moved from hub to another till reaching its destination, which
will shorten the transportation journey as each vehicle will move from a hub to the other
within a specific region and will return back loaded which will enhance the efficiency
transportation as the costs of long journey will be cut as less amount of fuel will be
needed, the cost of repairing cars would it be reduced due to short journeys another cost
will be reduced. From the social perspective the social life of the driver will be enhanced
as he will not need to drive long-distance which will shorten the working hours, let them
go back to their homes on the same day, and they will not need to take medicines to make
them work which may reduce the rate of accidents caused by trucks. The environmental
aspects will be enhanced also as transportation journeys will be shortened which will
reduce the amount of fuel consumed, there will be no empty return journey so the fuel
will be utilized efficiently to move the product without any wastage as the truck will
always be loaded, and finally the number of cars may be reduced due to consolidation of
physical Internet containers. So it can be concluded from the previous research paper that
the physical Internet tools will help to enhance the efficiency of transportation process.

11 Conclusion

To conclude, this research focused on applying the physical Internet tools in order to solve
the transportation problems for its importance in the supply chain. The physical internet
is a new concept which aims to change the way of moving goods across participants
of the supply chain. The mechanism of the physical internet is similar to the digital
internet, where the goods are packaged through PI-containers different encapsulation
layers such as collecting of data, then containers are moved through connected routes to
hubs till reaching the final place. In the beginning, the main programs that face freight
A Proposed Framework for Enhancing the Transportation Systems 593

transportation system in Egypt especially with trucking industry has been discussed
and its impact on environmental social and economic aspects. Then literature about the
Physical Internet concept with its components and protocols has been discussed and
previous studies of papers implemented components of physical Internet in different
areas in supply chain has been reviewed. After that a framework, the stage, is considered
in the supply chain with tools applied from the physical Internet in each stage and
artificial intelligence also takes place. Then the benefits of application of the proposed
framework of five layers with the tools of physical Internet has been discussed and a
neural network also can lead to the optimal route to be taken by trucks. The application of
the new paradigm will affect the transportation process and make it more efficient besides
that it will have other social, economic, and environmental sectors. Then and analysis
of applying the framework in the transportation system would enhance the efficiency
well the system and solve many problems from a financial social and environmental
parts. It’s recommended for further research to include studying the infrastructure need
to implement the physical Internet and the effects of the concept in other different areas
in the supply chain.

References
1. Montreuil, B., Russell, D.M., Eric, B.: Physical internet foundations. In: Service Orientation
in Holonic and Multi Agent,-Studies in Computational Intelligence, vol. 472, pp. 151–166.
Bucharest, Springer-Verlag (2013)
2. Sarraj, R., Ballot, E., Pan, S., Montreuil, B.: Analogies between Internet network and logistics
service networks: challenges involved in the interconnection. J. Intell. Manuf. 25(6), 1207–
1219 (2012). https://doi.org/10.1007/s10845-012-0697-7
3. Matusiewicz, M.: Logistics of the future-physical internet and its practicality. Transp. J. 59(3),
200–214 (2020)
4. McKinnon, A.: Decarbonizing Logistics: Distributing Goods in a Low Carbon World. 1st ed.
Kogan Page Limited (2018)
5. Demir, E., Bektaş, T., Laporte, G.: A review of recent research on green road freight
transportation. Eur. J. Oper. Res. 237(3), 775–793 (2014)
6. “Measuring Regulatory Quality and Efficiency A World Bank Group Flagship Report com-
paring business regulation for domestic firms in 189 economies,” Washington (2016). World
Bank Group, www.worldbank.org. https://doi.org/10.1596/978-1-4648-0667-4. Accessed 17
Mar 2022
7. Ismail, A.M., Ahmed, H.Y., Owais, M.A.: Analysis and modeling of traffic accidents causes
for main rural roads in Egypt. J. Eng. Sci. 38(4), 895–909 (2010)
8. Japan international cooperation agency, & Oriental consultants co, ltd and A. corporation
katahira & engineering international. The comprehensive study on the master plan for
nationwide transport system in the Arab republic of Egypt. (online) Transport planning
authority- ministry of transport (2012). https://openjicareport.jica.go.jp/pdf/12057592_01.
pdf. Accessed 26 Dec 2021
9. Elshamly, A.F., El-Hakim, R.A., Afify, H.A.: Factors Affecting Accidents Risks among Truck
Drivers in Egypt. In: MATEC Web of Conferences,Taiwan, vol. 124 (2017)
10. Central Agency for public mobilization and statistics. Arab Republic of Egypt - Annual
bulletin of Vehicles and Trains accidents year 2019. CAPMAS (2019). https://censusinfo.
capmas.gov.eg/Metadata-en-v4.2/index.php/catalog/407/related_materials. Accessed 26 Dec
2021
594 A. Osama et al.

11. The world bank. CO2 emissions from liquid fuel consumption (kt) - Egypt, Arab Rep.
The world bank (2016). https://data.worldbank.org/indicator/EN.ATM.CO2E.LF.KT?end=
2016&locations=EG&start=1960&view=chart. Accessed 5 July 2021
12. Montreuil, B.: Toward a physical internet: meeting the global logistics sustainability grand
challenge. Logist. Res. 3(2–3), 71–87 (2011). https://doi.org/10.1007/s12159-011-0045-x
13. Montreuil, B., Meller, R., Ballot, E.: Towards a physical internet: the impact on logistics
facilities and material handling systems design and innovation. In: 11TH IMHRC proceedings
, WISCONSIN, USA, Progress in Material Handling Research, pp. 1–23 (2010)
14. Montreuil, B.: Manifesto for a Physical Internet Version 1.4. Canada Research Chair in Enter-
prise Engineering, Interuniversity Research Center on Enterprise Networks, Logistics and
Transportation (CIRRELT) Version 1.4, canda, vol. 2e edition, p. 499 (2012)
15. IEEE, IEEE guidelines for 64-bit global identifier (EUI-64) registration authority
(2017). https://standards.ieee.org/content/dam/ieee-standards/standards/web/documents/tut
orials/eui.pdf. Accessed 5 Jan 2022
16. Montreuil, B., Ballot, E., Tremblay, W.: Modular Design of Physical Internet Transport,
Handling and Packaging Containers, Progress in Material Handling Research, vol. 13 (2014)
17. Gontara, S., Boufaied, A., Korbaa, O.: Routing the Pi-Containers in the Physical Internet using
the PI-BGP Protocol. In: Proceedings of IEEE/ACS International Conference on Computer
Systems and Applications, AICCSA (2018)
18. Dong, C., Franklin, R.: From the digital internet to the physical internet: a conceptual
framework with a stylized network model. J. Bus. Logist. 42(1), 108–119 (2021)
19. Montreuil, B., Ballot, E., Fontane, F.: An open logistics interconnection model for the physical
internet. In: IFAC Proceedings Volumes (IFAC-PapersOnline), vol. 45, no. 6, pp. 327–332
(2012)
20. Montreuil, B., Meller, R.D., Thivierge, C., Montreuil, Z.: Functional design of physical inter-
net facilities: a road-based crossdocking hub. In: Progress in Material Handling Research,
Charlotte, NC. MHIA, USA (2012)
21. Hakimi, D., Montreuil, B., Sarraj, R., Ballot, E., Pan, S.: Simulating a physical internet
enabled mobility web: the case of mass distribution in France. In: Proceeding 9th International
Conference of Modeling, Optimization and Simulation, Bordeaux, MOSIM 2012, pp. 1–7
(2012)
22. Ballot, E., Gobet, O., Montreuil, B.: Physical internet enabled open hub network design
for distributed networked operations. In: Service Orientation in Holonic and Multi-Agent
Manufacturing Control, Springer, vol. 402, pp. 279–292. Springer (2012)
23. Furtado, P., Biard, P., Frayret, J.-M., Fakhfakh, R.: Simulation of a physical internet-based
transportation network. In: International Conference on Industrial Engineering and Systems
ManagementAt: Rabat – Morocco, vol. 5, pp. 1–8 (2013)
24. Lin, Y.-H., Meller, R.D., Ellis, K.P., Thomas, L.M., Lombardi, B.J.: A decomposition-
based approach for the selection of standardized modular containers a decomposition-based
approach for the selection of standardized modular containers. Int. J. Prod. Res. 52(15),
4660–4672 (2014)
25. Venkatadri, U., Krishna, K.S., Ülkü, M.A.: On Physical internet logistics: modeling the impact
of consolidation on transportation and inventory costs. IEEE Trans. Autom. Sci. Eng. 13(4),
1517–1527 (2016)
26. Yang, Y., Pan, S., Ballot, E.: Freight transportation resilience enabled by physical internet.
IFAC-PapersOnLine 50(1), 2278–2283 (2017)
27. Ezaki, T., Imura, N., Nishinari, K.: Network topology and robustness of Physical Internet
(2021). https://arxiv.org/pdf/2109.02290v2.pdf. Accessed 01 May 2022
28. Joshi, A.V.: Essential Concepts in Artificial Intelligence and Machine Learning. Springer,
Switzerland, pp. 9–20 (2020)
A Proposed Framework for Enhancing the Transportation Systems 595

29. Liu, H.: Forecasting model of supply chain management based on neural network. In: Pro-
ceedings of the 2015 International Conference on Automation, Mechanical Control and
Computational Engineering, China (2015)
30. Nunes da Silva, I., Hernane Spatti, D., Andrade Flauzino, R., Helena Bartocci Liboni, L.,
Franco dos Reis Alves, S.: Artificial Neural Networks: A Practical Course, 1st ed. Springer,
Switzerland, Cham (2-17)
31. Sohrabi, H., Klibi, W., Montreuil, B.: Modeling scenario-based distribution network design
in a Physical Internet-enabled open Logistics Web. In: 4th International Conference on
Information Systems, Logistics and Supply Chain, Quebec (2012)
A Systematic Review of Machine Learning
and Explainable Artificial Intelligence (XAI)
in Credit Risk Modelling

Yi Sheng Heng(B) and Preethi Subramanian

School of Computing, Asia Pacific University, Technology Park Malaysia,


57000 Kuala Lumpur, Malaysia
[email protected], [email protected]

Abstract. The emergence of machine learning and artificial intelligence has cre-
ated new opportunities for data-intensive science within the financial industry.
The implementation of machine learning algorithms still faces doubt and distrust,
mainly in the credit risk domain due to the lack of transparency in terms of deci-
sion making. This paper presents a comprehensive review of research dedicated to
the application of machine learning in credit risk modelling and how Explainable
Artificial Intelligence (XAI) could increase the robustness of a predictive model.
In addition to that, some fully developed credit risk software available in the mar-
ket is also reviewed. It is evident that adopting complex machine learning models
produced high performance but had limited interpretability. Thus, the review also
studies some XAI techniques that helps to overcome this problem whilst break-
ing out from the nature of the ‘black-box’ concept. XAI models mitigate the bias
and establish trust and compliance with the regulators to ensure fairness in loan
lending in the financial industry.

Keywords: Credit risk · Explainable Artificial Intelligence (XAI) · LIME ·


Machine learning · SHAP

1 Introduction
According to The Malaysian Reserve, the statistics published by Malaysian Depart-
ment of Insolvency shows that more than 95,000 people had their loan defaulted where
the defaulted loans were from personal loans (27.76%), hire purchase loans (24.73%),
housing loans (14.09%) and credit card (9.91%) between the year 2014 and 2018 [1].
Loan defaults will not only disrupt the individual’s credit score but will also introduce
monetary losses to banks. This is also witnessed from a related publication released by
Bank Negara Malaysia which states that the cumulative amount of impaired loans had
reached RM31 billion as of July 2021 [2]. This is a huge loss for the bank sector, and
it could lead to significant risk in Malaysia’s economy. Thus, financial institutions are
invigorated to employ a reliable credit risk model to minimize default risk.
Credit risk is known as the risk of the lender where the lender might not receive
the principal and interest from the borrower [3]. Moreover, credit risk assessment plays

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 596–614, 2023.
https://doi.org/10.1007/978-3-031-18461-1_39
Review of Machine Learning and XAI in Credit Risk Modelling 597

an important role in financial industries in evaluating the capability of a borrower to


repay a loan. Credit scoring has always been a challenge for financial institutions due to
the unpredictable certainty of future events. Due to the emergence of machine learning
technologies, the focus of credit risk modelling has gained consideration especially in
the field of data science. In this paper, research has been carried out using a systematic
review of literature such as journals, conference proceedings, academic publications, and
books to understand existing investigation and debates relevant to credit risk modelling.
The study also narrows down and takes a closer look at the explainability of machine
learning models for decision making in the financial industry.

2 Domain Research
2.1 Credit Risk in Financial Industry

There are many types of risks faced by the banking industry, as seen in Fig. 1, which
includes credit risk, market risk, liquidity risk, exchange rate risk, interest rate risk and
operational risk. Among all the different types of risk aforementioned, credit risk is one
of the main risks that most of the bank are facing nowadays [4].

Fig. 1. Various types of risks faced by banks [4]

Credit risk refers to the risk of loss imposed on creditors caused by borrowers due to
their inability to meet their obligations [4–7]. This type of risk causes ambiguity in terms
of the net income and the market value of the shares. Kolapo et al. [8] indicated that the
bank is likely to experience financial crisis if the bank is highly vulnerable to credit risk.
So, the performance of a bank can be determined using the approach that a bank used to
handle credit risk. This is further supported by Chen and Pan [9] who states that credit risk
is the most significant risk faced by banks and different banks have different approaches
to credit risk management which allows them to adapt to changing environments. In
598 Y. S. Heng and P. Subramanian

the opinion of Rehman et al. [7], ignorance about credit risk by bank personnel will
negatively affect the bank’s development and customers’ interest. Thus, credit risk can
be considered an essential field of study. This is because if some of the borrowers default
on the loans issued, it can eventually cause negative impact on banks including the entire
banking system whereby banking crisis might occur [10]. This means that banks with
high credit risk will face substantial loss mainly because borrowers defaulting on their
loan repayment which might potentially lead to bankruptcy and insolvency.
Credit risk can occur due to several factors. For instance, poor management, poor
loan underwriting, poor lending procedures, interference by the government bodies,
inappropriate credit policies, unstable interest rate, direct lending, low reserves, liquidity
level, huge licensing of banks, limited institutional capacity, insufficient supervision by
the central bank, lack of strictness in credit assessment and inappropriate laws [11].
Therefore, it is recommended for the bank to consider minimizing the risk such as
improvising the lending procedure, maintaining a well-documented information about
the borrowers and a stabilized interest rate to potentially reduce the number of loan
defaults and non-performing loans.
Effective credit risk management can enhance the compassion of the bank and the
confidence of the depositors. Moreover, the financial health of a bank is highly dependent
on the possession of good credit risk management. Hence, a good credit risk policy plays
an essential role in boosting the banks’ performance and its capital adequacy protection
[11]. Pradhan and Shah [12] examined the relationship between credit management
practices, credit risk mitigation measures and obstacles against loan repayment in Nepal
using survey-based primary data and has performed a correlation analysis. The results
revealed that credit risk management practices and credit risk mitigation measures have
a positive relationship with loan repayment whilst obstacles faced by borrowers have
no significant impact on loan repayment. This indicates that credit risk management
practices and credit risk mitigation actions taken by the bank can help to reduce credit
risk whereby borrowers will repay their loan on time which increases loan repayment
behavior.
The Basel Accords was developed with the aim of establishing an international
governing framework for controlling market risk and credit risk. This is to make sure
banks holds enough capital to protect themselves from the financial crisis. The new Basel
Capital Accord (Basel II) stated that banks should implement their internal credit risk
model to assess default risk [13]. The effectiveness of credit risk management will not
only help to maintain the profitability of the bank’s businesses but also helps in sustaining
the stability of the economy [14]. Moreover, Basel II relies on the following three pillars
for its functioning: Minimal capital requirement, Supervisory review process, Market
discipline.
According to Basel Committee on Banking Supervision [13], the risk parameters
of Basel II are probability of default (PD), exposure at default (EAD) and loss given
default (LGD). With these three risk parameters, the expected loss (EL) of the bank can
be computed with the formula below:
EL = PD ∗ EAD ∗ LGD (1)
In general, the banking industry plays an essential role in supporting the financial
stability within a country. Thus, it is crucial for financial institutions to fully understand
Review of Machine Learning and XAI in Credit Risk Modelling 599

and ensure that data driven decisions are reached by figuring out the Expected Loss as
outlined by Basel II in order to avoid the unfortunate impact of credit risk. With data
analytics, several machine learning techniques are used in order to predict the credit risk
and is reviewed in the following section.

2.2 Machine Learning in Credit Risk Modelling and Scorecard Creation


The advancement of machine learning techniques has provided several alternatives and
reliability for loan default classification and prediction instead of manual processing in
credit risk assessment [15]. With the rapid growth of big data in the industry, machine
learning and deep learning are crucial in credit risk modelling to assist commercial banks
in solving financial decision-making problems with the help of financial data [16]. There
are many different artificial intelligence and machine learning methods that have been
adopted for financial decision making to manage large loan portfolios. Examples of
machine learning techniques used for financial decision making are artificial neural
networks, decision trees and support vector machines [17]. These models will be able
to predict loan applicants as either good credit (accepted) or bad credit (rejected) based
on the historical data of the demographic characteristics such as marital condition, age
and income [18].
Comparison of Traditional Methods and Machine Learning Models. Credit risk
assessment is performed using the traditional methods or machine learning. The tra-
ditional methods of credit scoring make decisions based on either subjective scoring
or statistical scoring [19]. Vidal and Barbon [20] mentioned that in subjective scoring,
the decision is mainly based on qualitative judgement whereby the input from the loan
officer and the organization will be used to evaluate the potential borrowers.
Nevertheless, statistical scoring relies on quantified characteristics of the potential
borrowers and predict their likelihood of defaulting based on a set of rules and statistical
techniques. There are a broad variety of statistical credit scoring models used to predict
the probability of default such as Markov chain analysis, decision trees, profit analysis,
logistic regression and linear discriminant analysis [21, 22]. After careful review, it has
been found that logistic regression is widely used in the banking industry to minimize
their credit risk as it is easy to execute and explain. In one of the study conducted by
Memic [23], the author has employed traditional statistical methods such as logistic
regression and multiple discriminant analysis (MDA) for predicting credit default of
companies within the banking market in Bosnia and Herzegovina and its legal entities.
The results indicated that both models have produced excellent predictive accuracy where
logistic regression was found to have slightly better performance. The models have also
identified variables that are significant in predicting credit default. For example, return on
assets (ROA) variable is deemed to be statistically significant in logistic regression that
has a high influence in predicting credit default as compared to other variables. Obare
et al. [24] applied logistic regression to investigate individual loan defaults in Kenya
with a sample of 1000 loan applicants. Cross validation was then used to evaluate the
prediction results whereby the model achieved an accuracy of 77.27% with the train data
and 73.33% with the test data. The authors also disclosed that increasing the sample size
will improve the performance of logistic regression model whereby the model performed
the best with a sample size of 700. Another paper by Foo et al. [25] discussed about credit
600 Y. S. Heng and P. Subramanian

scoring model to predict housing loan defaults in Malaysia. The authors have employed
logistic regression of different variation using data acquired from the Malaysian Central
Credit Reference Information Systems (CCRIS). The variations of logistic regression
built involving the use of balance class, unbalanced class, with variable selection and
without variable selection. The authors suggested that all four models yield favorable
results, but logistic regression based on a balanced dataset with variable selection has
obtained a high percentage of correctly classified data and the best sensitivity assuming
a 0.5 cut-off value.
However, some of the machine learning techniques are reported to generate better
results as compared to statistical techniques. Tsai and Wu [17] has stated that it is
much more superior than the traditional statistical models. This can be supported by
Bellotti and Crook [22], where the author compared support vector machine (SVM)
against traditional methods such as logistic regression and linear discriminant analysis
to predict the risk of default. The results indicated that SVM with a linear and Gaussian
radial basis function (RBF) kernel produces the best result with an AUC of 0.783 for
both algorithms. Nevertheless, the difference in terms of performance between SVM and
traditional methods are not significant, but it is proven that SVM can be used as a feature
selection to identify important variables in predicting the probability of default. Lee
[26] has also implemented support vector machine (SVM) with RBF kernel in corporate
credit rating problem and utilized 5-fold cross-validation with grid-search technique to
search for the best parameter. Besides, the author compares the SVM’s result against
multiple discriminant analysis (MDA), case-based reasoning (CBR) and three-layer fully
connected back-propagation neural networks (BPN) whereby the results show that SVM
transcend other methods without overfitting.
Byanjankar et al. [27] used artificial neural network to predict the default probabil-
ity of peer-to-peer (P2P) loan applicants. Moreover, comparisons have been conducted
between neural network and logistic regression. The result shows that neural network is
effective in identifying default borrowers whereas logistic regression is better in identi-
fying non-default borrowers. Even so, neural network’s result is deemed promising as
it is crucial to forecast default loans in advance to prevent the creditors from investing
in bad applicants. In another P2P credit risk study conducted by Bae et al. [28], online
P2P lending default prediction models was developed using stepwise logistic regres-
sion, classification tree algorithms (CART and C5.0) and multilayer perceptron (MLP)
to predict loan default. After evaluating the performance of the models with 5-fold
cross-validation, the results reveal that MLP has the highest validation average accu-
racy, 81.78%, whereas logistic regression has the lowest validation average accuracy,
61.63%.
Moreover, Chandra Blessie and Rekha [29] have proposed a loan default prediction
based on Logistic Regression, Decision Tree, Support Vector Machine and Naïve Bayes.
The result indicated that Naïve Bayes classifier is tremendously efficient and gave a supe-
rior result than other classifiers. Aside from that, data cleaning, feature engineering and
exploratory data analysis (EDA) were conducted before training the model. Features that
were studied during EDA are application income, co-application income, loan amount,
credit history, gender loan status, gender, relation status, education status and property
Review of Machine Learning and XAI in Credit Risk Modelling 601

area. Yet another evidence by Mafas developed a predictive model for loan default pre-
diction in peer-to-peer lending communities using Logistic Regression, Random Forest,
and Linear SVM with the selected feature set where Random Forest outperformed and
achieved an accuracy of 92%. The significant fittest feature subset was obtained using a
Genetic Algorithm and was evaluated using a Logistic Regression model [30].
After careful review, it is clear that the machine learning models can easily work
with large datasets and generate predictions with high accuracy making it exceptional,
but statistical techniques are much simpler and user friendly thereby making it popular
for use in the financial industry. Machine learning model fitting also avoids overfitting
as it will defeat the purpose of the study. This section discussed the performance of
individual statistical and machine learning models. Newer research also experiments the
usage of ensemble models also called as stacking approach.
Ensemble Model vs Individual Model. Aside from individual models, some
researchers have reported that using ensemble models can yield better accuracy as com-
pared to individual models. Yao [31] experimented with a single Decision Tree and two
ensemble learning algorithms such as Adaboost and Bagging (Bootstrap Aggregation)
with Decision Tree as a baseline algorithm to predict the creditworthiness of the appli-
cants with the Australian credit dataset. The result indicates that ensemble learning,
Adaboost CART with 14 features produced better results than a single Decision Tree
without having much complexity. Likewise, another research has also adopted an ensem-
ble model but with a different approach which is an ensemble technique of support vector
machine (SVM) for credit risk assessment in Australian and German dataset by Xu et al.
[32]. For example, the author experimented with voting ensemble based on single SVM
and four SVM based ensemble models of four different kernel functions such as poly-
nomial kernel, linear kernel, RBF kernel and sigmoid kernel against individual SVM
models. Besides, Principal Component Analysis (PCA) is implemented before training
the model to select credit features and five-fold cross-validation is utilized for model
validation purposes. The results show that the ensemble model of SVM performed better
than the individual SVM classifier. Furthermore, the author has also suggested that the
use of the ensemble model for credit risk assessment is promising to improve prediction
performances.
Madaan et al. [33] proposed using Random Forest and Decision Tree to assess indi-
vidual loans based on their attributes. The authors had also conducted exploratory data
analysis to get acquainted with the dataset and performed data pre-processing. The data
are then split into training (70%) and testing (30%) set whereby the selected algorithms
will be used to train the model. The results of the classification report show that Random
Forest outperforms Decision Tree with an accuracy score of 80% and 73% respectively.
Another author, Zhu et al. [34], also proposed Random Forest classification but on a
different scenario which is to predict loan default in P2P online lending platform and
compare it against other machine learning methods such as Decision Tree, Support Vec-
tor Machine (SVM) and Logistic Regression. The results indicated that Random Forest
classification performs significantly better in identifying loan defaults. The authors have
overcome the challenge of imbalanced class in the dataset by applying SMOTE (Syn-
thetic Minority Oversampling Technique) method which can generate new samples for
602 Y. S. Heng and P. Subramanian

the minority class. Furthermore, the authors also suggested using larger datasets and fine-
tuning the models can potentially improve the accuracy of the model in future research.
Another P2P loan default prediction was conducted by Li et al. [35] based on XGBoost,
Logistic Regression and Decision Tree. The result indicated that the predictive accu-
racy of XGBoost technique (97.705%) outperforms other models under five-fold cross
validation. Other performance comparisons were compared such as AUC value, classi-
fication error rate, model robustness and model run time. The result shows that although
XGBoost has the best robustness and least error rate, the run time of the XGBoost is the
slowest compared to other models. However, the author mentions that XGBoost is dras-
tically better than traditional models in nearly all aspects. Moreover, the author has also
visualized the top ten features that have the most significant influence on loan default
rates based on the XGBoost classifier.
Zhao et al. [36] suggested to use ensemble learning classification model such as adap-
tive boosting (AdaBoost) with decision tree on credit scoring problem. Ten-fold cross-
validation was performed to assess and compare the performance between AdaBoost-DT,
Decision Tree, and Random Forest. The results show that AdaBoost-DT model yields
the highest accuracy. Moreover, the author has also recommended to experiment with
parameter optimization methods in future research. Udaya Bhanu and Narayana [37]
proposed using random forest, logistic regression, decision tree, K-nearest neighbor,
and Support Vector Machine for customer loan prediction. The author has also prepro-
cessed the data and apply feature engineering technique to enhance the performance of
machine learning algorithms. The comparative study shows that Random Forest shows
the best accuracy, 82% in classifying loan candidates with an excellent F1-score.
In addition to the above models, LightGBM is a recently popular machine learning
algorithm, which uses histogram algorithm and Leaf-wise strategy with depth limi-
tation. LightGBM model has been used to predict the financing risk profile of 186
enterprises where the researcher conducted comparison experiments using k-nearest-
neighbor’s algorithm, decision tree algorithm, and random forest algorithm on the same
data set. The experiments show that LightGBM has better prediction results than the
other three algorithms for several metrics in corporate financing risk prediction [38].
The reviewed literature has shown that ensemble models perform better compared
to individual models. However, there is not much attention given to the voting ensemble
model whereby it is a technique of combining the classifiers of different machine learning
algorithms which is worth further investigation. A general consensus in the machine
learning models either individual or ensemble would be to address data quality issues,
handle imbalanced class and tune hyper parameters in order to improve the performance
of the model.

Explainable Artificial Intelligence (XAI). The implementation of machine learning


algorithms for model building within the credit risk industry faces doubt and distrust
mainly due to the lack of transparency in terms of output predictions. According to
Dong et al. [39], models such as support vector machine and neural network lacks
interpretability and are often portrayed as a ‘black box’ model. This is primarily due
to the output results not clearly explained to general audiences and the banks find it
hard to provide the reasons for rejecting a loan. This issue is also being stated in recent
studies. Hadji Misheva et al. [40] stipulated that complex machine learning has proven to
Review of Machine Learning and XAI in Credit Risk Modelling 603

have high predictive accuracy in assessing customer credit risk. Still, these innovative and
advanced machine learning algorithms lack transparency that is essential to comprehend
the reason behind the rejection and approval of an individual’s loan application. The
author also added that it is tough to trace back to the steps that an algorithm took
to arrive at its decision as these models are developed directly from the data by an
algorithm. The lack of credibility, trust and explainability are the major challenges faced
by many researchers when introducing machine learning based models to companies in
the credit scoring field [41]. Thus, ‘black box’ models are deemed to be less suitable in
financial services due to the lack of interpretation. Even though the machine learning
model improves over time and generates excellent predictive results, yet many financial
institutions are still reluctant to fully trust the predictive model.

One of the potential solutions would be to incorporate transparent models, statis-


tical models such as linear models or decision trees. Despite having models with high
interpretability, it could also result in low predictive accuracy. Conversely, complex
machine learning like neural networks gives high predictive accuracy but with limited
interpretability [42]. To overcome this problem while also having the freedom to adopt
complex machine learning algorithms, explainable artificial intelligence (XAI) should be
incorporated to interpret the predictions made by the machine learning model and break
out from the nature of the ‘black-box’ concept. This method not only allows humans
to understand the output decision of the model, but it can also allow humans to trust
the results of complex machine learning models and eliminate any doubts. Some of the
popular XAI techniques commonly used are LIME and SHAP.
The explanation models can be classified into global methods and local methods.
Global methods aim to provide a general explanation of a black-box model’s behavior
by using the overall knowledge of the model, training, and the associated data. For
instance, feature importance will determine the top features that contribute the most in
predicting the outcome. On the other hand, local methods are responsible for explaining
a single outcome or instance of the black-box model. The single prediction performed
by the model can be explained by creating local surrogate models that are interpretable
and thereby exposing how a black-box model works [42, 43]. Hadji Misheva et al. [40]
mentions that LIME is used to obtain local explanations, whereas SHAP can be used to
obtain both local and global explanations in the XAI techniques.
LIME which stands for Locally Interpretable Model-Agnostic Explanations is a
post-hoc model-agnostic explanation method that seeks to approximate any black-box
machine learning model with an interpretable model to explain the single prediction. The
author has also mentioned that LIME is a novel approach that explains the prediction of
any classifier regardless of the algorithm. LIME will describe the model using a linearly
weighted combination of the input features to provide the explanations. Conversely,
SHAP, known as Shapley Additive explanations, interprets predictions based on coali-
tional game theory. It will return Shapley values that indicate how to fairly distribute the
‘payout’ (i.e. The prediction) among the features. Moreover, SHAP can provide a robust
and insightful measure of feature importance of a model in a summary plot whereby
Shapley value will represent the impact of the features on model output [40, 41]. Some
of the recent works have adopted LIME and SHAP in credit risk problems to explain
the decision made by the machine learning model.
604 Y. S. Heng and P. Subramanian

Provenzano et al. [44] implemented SHAP and LIME techniques to explain the
prediction of the high performing Light-GBM classifier that obtains 95% accuracy in
default classification. The author stated that adopting SHAP and LIME has helped in
understanding the important features in determining an individual result and thereby
increasing the confidence in the model. Another study conducted by Visani et al. [45]
has compared statistical model, Logistic Regression against machine learning model,
Gradient Boosting Trees on credit risk data whereby LIME was tested on machine learn-
ing model to check its stability. It is reported that Gradient Boosting Model outperformed
Logistic Regression and LIME is a stable and reliable technique when applied to the
machine learning model.
Hadji Misheva et al. [40] has also adopted both XAI techniques, LIME and SHAP,
in machine learning based credit scoring models on Lending Club dataset. The models
that the author train including logistic regression, XGBoost, Random Forest, SVM and
Neural Networks. The author has implemented LIME, as shown in Fig. 2, to explain
local instances on SVM and tree-based models (XGBoost and Random Forest) whereas
SHAP, as shown in Fig. 3, was used to obtain global explanations. The results of the
study imply that both LIME and SHAP offer reliable explanation in line with financial
reasoning. The author also mentions that SHAP is a powerful and effective technique
in highlighting the feature importance, but it can take a very long time to generate the
results. This is supported by Phaure and Robin [46] in their study of model explainability
in credit risk management whereby the author indicated that the computational time of
SHAP method is proportional to the number of feature, observation and the complexity
of the model.

Fig. 2. XGBoost model with LIME explanation on a customer that classified as a ‘default’ loan
type [40]
Review of Machine Learning and XAI in Credit Risk Modelling 605

Fig. 3. Summary plot - XGBoost model with SHAP tree explainer [40]

In short, introducing XAI techniques can help improve the explainability and trans-
parency of the black-box model rather than relying solely on machine learning output for
decision making. XAI will not only eliminate bias, but it can also assist in establishing
trust and in compliance with the regulators in financial institutions to ensure fairness
in loan lending. Therefore, XAI techniques, specifically LIME should be adopted to
explain the credit decision of the black-box model.

Credit Scorecards. The banking industry uses credit scorecards as a tool for risk man-
agement. Credit scorecards consist of a group of features that are widely used to predict
the default probabilities such as classifying good and bad credit risk. There are vari-
ous techniques used in the development of scorecards such as support vector machine,
genetic programming, artificial neural networks, multiple classifier systems, hybrid mod-
els, logistic regression, classification tree, linear regression and linear programming [39,
47]. Moreover, Dong et al. [39] stipulated that generating credit scorecards will poten-
tially contribute to effective credit risk management. The author added that the quality
of the credit scorecard can be measured such as using Percentage Correctly Classified
(PCC) to identify the accuracy of the prediction.
606 Y. S. Heng and P. Subramanian

Fig. 4. Example of credit scorecard [48]

Figure 4 shows an example of a credit scorecard used to evaluate the creditworthiness


of a loan applicant. For instance, the features such as age, cards, ec_card, income, and
status each will be assigned points based on statistical analysis. The sum of the points
accumulated will be the final score of the loan applicant. Therefore, the banks can easily
decide which loan should be accepted or rejected. For example, the bank can choose to
reject the loan application or charge them a higher interest rate if the applicant scores
below a certain range as they possess a greater risk. Hence, a credit scorecard will
facilitate a better decision-making process for the financial institution.

3 Related Works
This section will compare and analyze different credit risk models and software that are
fully developed and currently available in the market. Most of the credit risk models
developed are marketed towards medium and large sized companies such as banks and
enterprise creditors. Their goal is to assists companies who purchase their system in
determining the creditworthiness of potential borrowers and minimizing loan defaults.
With a timelier and accurate predictions, lenders can use the result generated to negotiate
with the borrowers. As part of the research, comparisons will be conducted between three
different commercial systems to understand their structures and functionalities. The three
systems selected in this study are GiniMachine, ABLE Scoring and ZAML.

3.1 GiniMachine
GiniMachine is an AI-driven credit scoring software that can help lenders make reliable
credit decisions within a short amount of time and the logo of GiniMachine can be seen
Review of Machine Learning and XAI in Credit Risk Modelling 607

Fig. 5. GiniMachine logo [49]

in Fig. 5 [49]. This system will employ machine learning for automated decision-making
where it is effective even towards thin-file borrowers. Thus, banks and fintech companies
can identify bad loans to avoid unwanted risk without relying on traditional credit scoring
or doing manual work that has many shortcomings. For instance, GiniMachine that is
based on AI technologies can analyze parameters that traditional method tends to ignore.
Furthermore, GiniMachine can easily adapt into changing environment that will fit nicely
into specific businesses and risk assessment rules. Let’s suppose, if the company has
released a new loan product, the system can process the information of the new loan
product and adjust accordingly to the needs of the lenders. The system will also generate
detailed reports, as shown in Fig. 6, that consist of statistical calculations regarding
the decision made by the model. Moreover, the system is easy to use as it is designed
specifically for non-technical individuals to operate the system. Thus, no specific training
is required to operate the system.

Fig. 6. GiniMachine’s scoring details [49]

3.2 ABLE Scoring

ABLE Scoring is another powerful credit scoring software that will assist in making
credit decisions to prevent bad loans and the logo of ABLE Scoring can be seen in Fig. 7.
608 Y. S. Heng and P. Subramanian

Fig. 7. ABLE scoring’s company logo [50]

Scorecards along with credit decisions can be easily generated via the scorecard builder,
as shown in Fig. 8. Moreover, ABLE Scoring allows lenders to score potential borrowers
in batches which will save a lot of time. Different machine learning models can be built
including the classical logistic regression model. The performance of each of the models
can be compared and evaluated in terms of performance and stability. Furthermore, the
result of the credit decision will be explained in the scorecards generated, as shown in
Fig. 9, which will help the lenders to better understand the output decision made by
the machine learning model to eliminate any doubts. It will also check for data formats,
consistency, and missing values to ensure the data is in high quality. The software is easy
to use without any specific training required. The users will just need to upload an XLS
file format to generate a scorecard report. ABLE Scoring promotes fast and smart credit
decisions based on AI models and it ensures a stable and high-quality lending process.
This software is being trusted by banks and fintech companies such as Eurasian Bank,
OTP Bank and Alfa Bank [50].

Fig. 8. ABLE scoring’s scorecard builder [50]


Review of Machine Learning and XAI in Credit Risk Modelling 609

Fig. 9. ABLE scoring’s scorecard generation [50]

Fig. 10. Zest AI logo [51]

3.3 Zest AI
Zest AI is yet another robust machine learning software that assists lenders and under-
writers to make better, more timely and transparent credit decisions. The logo of Zest
AI is shown in Fig. 10. Zest AI also aims to address the problems of traditional credit
scoring tools, such as gaps, errors or structural inequities that lead to the rejection of
good applicants [52]. With Zest AI, lenders can easily identify good borrowers and safely
increase loan approvals while minimizing the risk and losses. Besides, Zest AI provides
a bigger picture of every borrower with full interpretability to comply with the strictest
regulators and satisfy doubters [51]. For example, the custom-built logistic regression
scorecards in Zest AI will be used to assess the creditworthiness of the borrowers to help
lenders in their decision making. Figure 11 shows a sample of the scorecards generated
with Zest AI:

Fig. 11. Zest AI scorecard generation [46]


610 Y. S. Heng and P. Subramanian

Most importantly, it is a stable software that offers rapid analysis to help lenders
make quick business decisions and ensure fairness in lending operations. Thus, this
will potentially improve customer experience and make a positive impact on lending
businesses. Furthermore, the software owners can also rest assured as Zest AI offers
smooth transition and adoption from traditional credit scoring tools with professional
support. In addition, the software is also user-friendly whereby it can be operated by
non-technical staff without prior machine learning background. Zest AI is also being
recognized by one of the largest banks in Turkey, Akbank. Akbank has found Zest AI
software extremely effective in identifying good borrowers with minimal risks. Akbank
managed to reduce non-performing loans by 45% and less time needed to retrain and
build the models with Zest AI that initially took them seven months [53]. Besides, Zest
AI can adapt to changing requirements which further increases the confidence of their
client. Thus, the adoption of Zest AI can promote sustainable growth among banks and
other financial institutions in their lending businesses.

3.4 Evaluation of Related Works

The comparisons between the related work are essential to understand the attributes
of the fully developed credit scoring systems. Moreover, new ideas and opportunities
can be triggered by analyzing the existing systems, which will benefit future research.
Table 1 shows the comparisons between different credit risk systems that are currently
available in the market:

Table 1. Comparison of related work

Attributes GiniMachine ABLE scoring Zest AI


Features Automated credit scoring One button solution to Employ AI models to
empowered with AI and build scorecard for credit make smart lending
ML decision with AI models decisions
and score customers in
batch
Purpose Avoid bad and Ensure a continuous and Faster loan decisions
non-performing loans high-quality lending and ensure fairness in
process lending
Benefits Ease of use, save time and Easy to use, customizable, Time and resources
adaptable into changing stable, and transparent saving, easy to operate
environment result (explainable) and comply with the
regulators
Target user Non-technical credit Banks and Fintech Banks and Lending
analyst/lenders Companies Companies
Cost Paid Paid Paid
Demo Free demo available Demo provided upon Demo provided upon
request request
Review of Machine Learning and XAI in Credit Risk Modelling 611

Based on the analysis conducted, all the systems are built to ensure faster, fair, and
high-quality loan lending. This is because their target users are mostly banks and other
financial institutions whereby the primary goal is to mitigate credit risk and avoid bad
loans. The systems are also user-friendly, especially for non-technical staff to operate the
system without much training needed. Moreover, it is also important for the output result
to be transparent to comply with the regulators. However, it is noted that all three systems
solely focus on predicting the output, but it has no dashboard to visualize the trends of
loan customers. In that case, it will be an opportunity for the developer to include a
dashboard in the web application that will visualize the trends of loan customers.

4 Conclusion and Future Direction


Intensive research has been conducted via Google scholars and APU E-Database to
better understand the credit risk field and the machine learning techniques employed to
solve the underlying credit risk problem. The findings show that machine learning tech-
niques, especially ensemble models, perform extremely well in identifying loan defaults
which can potentially minimize future credit default risk. It is also noticeable that recent
research is centered around ensemble learning such as Random Forest, XGBoost and
AdaBoost. There are many papers focusing on applying machine learning algorithms to
solve credit risk problems, such as predicting the likelihood of loan defaults. Some of the
papers have compared the performance between statistical methods and artificial intelli-
gence methods. The findings have indicated that artificial intelligence methods produce
better classification accuracy as compared to statistical methods. However, in terms of
interpretability and simplicity, statistical methods are a better choice as compared to arti-
ficial intelligence methods. Furthermore, the research has moved towards a new era of
machine learning, explainable artificial intelligence (XAI), that can uncover and explain
a black-box machine learning model. Most importantly, implementing XAI in credit
risk models will allow humans to better understand the predictions made by the machine
learning models whilst establishing trust and compliance with regulatory requirements
within the financial institution. The adoption of XAI, such as LIME and SHAP, helps
improves the transparency of loan lending while speeding up the loan lending process,
which is a more robust approach than the traditional lending procedures.
The future research could consider to study about credit risk in commercial banks
and build machine learning models that will be used by credit analysts to identify and
predict loan defaults with the intention of assisting them in better decision making and
evaluating the profile of potential borrowers whilst minimizing future credit default risk
and preventing the recurrence of the global financial crisis. This could also be extended
to gathering more diverse and non-conventional data to enhance banks’ approaches to
assessing credit risk. Furthermore, future research should also explore different XAI
techniques available such as Shapash or Dalex, that are also compatible with many
machine learning frameworks in credit risk prediction. More focus on the comparison of
XAI models that support both local and global explanations will bring additional value
to the credit risk industry.
612 Y. S. Heng and P. Subramanian

References
1. Hani, A.: Credit cards, personal loans landing Malaysians in debt trap (2019). https://the
malaysianreserve.com/2019/08/08/credit-cards-personal-loans-landing-malaysians-in-debt-
trap/
2. Bank Negara Malaysia, Monthly Highlights and Statistics in July 2021 (2021). https://www.
bnm.gov.my/-/monthly-highlights-and-statistics-in-july-2021
3. Brock, Credit Risk (2021). https://www.investopedia.com/terms/c/creditrisk.asp
4. Goyal, K.A., Agrawal, S.: Risk management in Indian banks: some emerging issues. Int. J.
Econ. Res. 1(1), 102–109 (2010)
5. Chenghua, S., Kui, Z.: Study on commercial bank credit risk based on information asymmetry.
In: 2009 International Conference on Business Intelligence and Financial Engineering, BIFE
2009, pp. 758–761 (2009). https://doi.org/10.1109/BIFE.2009.175
6. Li, H., Pang, S.: The study of credit risk evaluation based on DEA method. In: Proceedings
of the 2010 International Conference on Computational Intelligence and Security, CIS 2010,
pp. 81–85 (2010). https://doi.org/10.1109/CIS.2010.25
7. Rehman, Z.U., Muhammad, N., Sarwar, B., Raz, M.A.: Impact of risk management strategies
on the credit risk faced by commercial banks of Balochistan. Financ. Innov. 5(1) (2019).
https://doi.org/10.1186/s40854-019-0159-8
8. Kolapo, T.F., Ayeni, R.K., Oke, M.O.: Credit risk and commercial banks’ performance in
nigeria: a panel model approach. Aust. J. Bus. Manag. Res. 2(02), 31–38 (2012)
9. Chen, K.-C., Pan, C.-Y.: An empirical study of credit risk efficiency of banking industry in
Taiwan. Web J. Chin. Manag. Rev. 15(1), 1–17 (2012). http://cmr.ba.ouhk.edu.hk
10. Waemustafa, W., Sukri, S.: Bank specific and macroeconomics dynamic determinants of credit
risk in islamic banks and conventional banks. Int. J. Econ. Financ. Issues 5(2), 476–481 (2015).
https://doi.org/10.6084/m9.figshare.4042992
11. Bhattarai, Y.R.: The effect of credit risk on nepalese commercial banks. NRB Econ. Rev.
28(1), 41–64 (2016). https://nrb.org.np/ecorev/articles/vol28-1_art3.pdf
12. Pradhan, S., Shah, A.K.: Credit risk management of commercial banks in Nepal. J. Bus. Soc.
Sci. Res. 4(1), 27–37 (2019). https://doi.org/10.3126/jbssr.v4i1.28996
13. Basel Committee on Banking Supervision: An Explanatory Note on the Basel II IRB Risk
Weight Functions (2005). www.bis.orgb/cbsirbriskweight.pdf
14. Psillaki, M., Tsolas, I.E., Margaritis, D.: Evaluation of credit risk based on firm performance.
Eur. J. Oper. Res. 201(3), 873–881 (2010). https://doi.org/10.1016/j.ejor.2009.03.032
15. Lai, L.: Loan default prediction with machine learning techniques. In: Proceedings of the
2020 International Conference on Computer Communication and Network Security, CCNS
2020, pp. 5–9 (2020). https://doi.org/10.1109/CCNS50731.2020.00009
16. Addo, P.M., Guegan, D., Hassani, B.: Credit risk analysis using machine and deep learning
models. SSRN Electron. J. (2018). https://doi.org/10.2139/ssrn.3155047
17. Tsai, C.F., Wu, J.W.: Using neural network ensembles for bankruptcy prediction and credit
scoring. Expert Syst. Appl. 34(4), 2639–2649 (2008). https://doi.org/10.1016/j.eswa.2007.
05.019
18. Chen, M., Huang, S.: Credit scoring and rejected instances reassigning through evolutionary
computation techniques. Expert Syst. Appl. 24(4), 433–441 (2003). https://doi.org/10.1016/
S0957-4174(02)00191-4
19. Schreiner, M.: Scoring: the next breakthrough in microcredit. In: CGAP, no. 7, pp. 1–64
(2003)
20. Vidal, M.F., Barbon, F.: Credit Scoring in Financial Inclusion. CGAP, July 2019
21. Eddy, Y.L., Engku Abu Bakar, E.M.N.: Credit scoring models: techniques and issues. J. Adv.
Res. Bus. Manag. Stud. 7(2), 29–41 (2017). https://www.akademiabaru.com/submit/index.
php/arbms/article/view/1240
Review of Machine Learning and XAI in Credit Risk Modelling 613

22. Bellotti, T., Crook, J.: Support vector machines for credit scoring and discovery of significant
features. Expert Syst. Appl. 36(2), 3302–3308 (2009). https://doi.org/10.1016/j.eswa.2008.
01.005
23. Memic, D.: Assessing credit default using logistic regression and multiple discriminant anal-
ysis: empirical evidence from Bosnia and Herzegovina. Interdiscip. Descr. Complex Syst.
13(1), 128–153 (2015). https://doi.org/10.7906/indecs.13.1.13
24. Obare, D.M., Njoroge, G.G., Muraya, M.M.: Analysis of individual loan defaults using logit
under supervised machine learning approach. Asian J. Probab. Stat. 3(4), 1–12 (2019). https://
doi.org/10.9734/ajpas/2019/v3i430100
25. Foo, L.K., Chua, S.L., Chin, D., Firdaus, M.K.: Logistic regression models for Malaysian
housing loan default prediction (2017). https://www.bnm.gov.my/documents/20124/826852/
WP11+-+Logistic+Regression.pdf/d22ef5a2-4bdb-4d39-28f3-c19253d2814e?t=158503059
9211
26. Lee, Y.C.: Application of support vector machines to corporate credit rating prediction. Expert
Syst. Appl. 33(1), 67–74 (2007). https://doi.org/10.1016/j.eswa.2006.04.018
27. Byanjankar, A., Heikkila, M., Mezei, J.: Predicting credit risk in peer-to-peer lending: a neural
network approach. In: Proceedings of the 2015 IEEE Symposium Series on Computational
Intelligence SSCI 2015, pp. 719–725 (2015). https://doi.org/10.1109/SSCI.2015.109
28. Bae, J.K., Lee, S.Y., Seo, H.J.: Predicting online peer-to-peer (P2P) lending default using data
mining techniques. J. Soc. E-bus. Stud. 23(3), 1–6 (2018)
29. Chandra Blessie, E., Rekha, R.: Exploring the machine learning algorithm for prediction the
loan sanctioning process. Int. J. Innov. Technol. Explor. Eng. 9(1), 2714–2719 (2019). https://
doi.org/10.35940/ijitee.A4881.119119
30. Victor, L., Raheem, M.: Loan default prediction using genetic algorithm: a study within
peer-to-peer lending communities. Int. J. Innov. Sci. Res. Technol. 6(3) (2021). ISSN No.
2456-2165
31. Yao, P.: Credit scoring using ensemble machine learning. In: Proceedings of the 2009 9th
International Conference on Hybrid Intelligent Systems, HIS 2009, vol. 3, pp. 244–246 (2009).
https://doi.org/10.1109/HIS.2009.264
32. Xu, W., Zhou, S., Duan, D., Chen, Y.: A support vector machine based method for credit risk
assessment. In: Proceedings of the IEEE International Conference on e-Business Engineering,
ICEBE 2010, pp. 50–55 (2010). https://doi.org/10.1109/ICEBE.2010.44
33. Madaan, M., Kumar, A., Keshri, C., Jain, R., Nagrath, P.: Loan default prediction using
decision trees and random forest: a comparative study. IOP Conf. Ser. Mater. Sci. Eng. 1022(1),
1–12 (2021). https://doi.org/10.1088/1757-899X/1022/1/012042
34. Zhu, L., Qiu, D., Ergu, D., Ying, C., Liu, K.: A study on predicting loan default based on the
random forest algorithm. Procedia Comput. Sci. 162, 503–513 (2019). Itqm. https://doi.org/
10.1016/j.procs.2019.12.017
35. Li, Z., Li, S., Li, Z., Hu, Y., Gao, H.: Application of XGBoost in P2P default prediction. J.
Phys. Conf. Ser. 1871(1), 1 (2021). https://doi.org/10.1088/1742-6596/1871/1/012115
36. Zhao, J., Wu, Z., Wu, B.: An AdaBoost-DT model for credit scoring. In: WHICEB 2021
Proceedings, vol. 15 (2021)
37. Udaya Bhanu, L., Narayana, D.S.: Customer loan prediction using supervised learning tech-
nique. Int. J. Sci. Res. Publ. 11(6), 403–407 (2021). https://doi.org/10.29322/ijsrp.11.06.2021.
p11453
38. Wang, D.N., Li, L., Zhao, D.: Corporate finance risk prediction based on LightGBM. Inf.
Sci., 602, 259–268 (2022). ISSN 0020-0255. https://doi.org/10.1016/j.ins.2022.04.058
39. Dong, G., Kin, K.L., Yen, J.: Credit scorecard based on logistic regression with random
coefficients. Procedia Comput. Sci. 1(1), 2463–2468 (2010). https://doi.org/10.1016/j.procs.
2010.04.278
614 Y. S. Heng and P. Subramanian

40. Hadji Misheva, B., Hirsa, A., Osterrieder, J., Kulkarni, O., Fung Lin, S.: Explainable AI in
credit risk management. SSRN Electron. J., 1–16 (2021). https://doi.org/10.2139/ssrn.379
5322
41. El Qadi, A., Diaz-Rodriguez, N., Trocan, M., Frossard, T.: Explaining credit risk scoring
through feature contribution alignment with expert risk analysts, pp. 1–12 (2021). http://
arxiv.org/abs/2103.08359
42. Wijnands, M.: Explaining black box decision-making. University of Twente (2021)
43. Confalonieri, R., Coba, L., Wagner, B., Besold, T.R.: A historical perspective of explainable
Artificial Intelligence. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 11, no. 1, pp. 1–21 (2021). https://doi.org/10.1002/widm.1391
44. Provenzano, A.R., et al.: Machine Learning approach for Credit Scoring (2020). http://arxiv.
org/abs/2008.01687
45. Visani, G., Bagli, E., Chesani, F., Poluzzi, A., Capuzzo, D.: Statistical stability indices for
LIME: obtaining reliable explanations for machine learning models. J. Oper. Res. Soc., 1–18
(2020). https://doi.org/10.1080/01605682.2020.1865846
46. Phaure, H., Robin, E.: Explain artificial intelligence for credit risk management. Deloitte,
April 2020
47. Bequé, A., Coussement, K., Gayler, R., Lessmann, S.: Approaches for credit scorecard cali-
bration: an empirical analysis. Knowl. Based Syst. 134, 213–227 (2017). https://doi.org/10.
1016/j.knosys.2017.07.034
48. Siddiqi, N.: Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring.
Wiley, Hoboken (2006)
49. GiniMachine: Credit Scoring Software (2021). https://ginimachine.com/risk-management/
credit-scoring/. Accessed 13 Oct 2021
50. RND Point: Credit Scoring Software (2021). https://rndpoint.com/solutions/able-scoring/.
Accessed 25 Oct 2021
51. Zest AI: Zest AI (2021). https://www.zest.ai/. Accessed 25 Oct 2021
52. Upbin, B.: ZAML Fair - Our New AI To Reduce Bias in Lending (2019). https://www.zest.
ai/insights/zaml-fair-our-new-ai-to-reduce-bias-in-lending. Accessed 25 Oct 2021
53. Zest AI: Model Management System (2021). https://www.zest.ai/product. Accessed 25 Oct
2021
On the Application of Multidimensional LSTM
Networks to Forecast Quarterly Reports
Financial Statements

Adam Gałuszka1 , Aleksander Nawrat1 , Eryka Probierz1 , Karol J˛edrasiak2(B) ,


Tomasz Wiśniewski3 , and Katarzyna Klimczak4
1 Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland
2 WSB University, Cieplaka 1C, 41-300 D˛abrowa Górnicza, Poland
[email protected]
3 Warsaw Stock Exchange, Ksi˛aż˛eca 4, 00-498 Warsaw, Poland
4 Warsaw School of Economics, Niepodległosci 162, 02-554 Warsaw, Poland

Abstract. Automatically analyzing financial data is the subject of much ongoing


research. The purpose of this study was to research the possibility of using deep
learning methods to predict and forecast the value of selected financial data in
financial quarterly reports: cash flow from operating activities, cash flow from
investing activities and cash flow from financing activities. The study examined
the quarterly financial reports of selected companies listed on the Warsaw Stock
Exchange (WSE), from September 2008 to December 2019, where each report
consists of about 250 indicators. Based on the principles of financial analysis and
the interdependency between financial indicators, a set of interdependent indi-
cators was established and a multidimensional long short-term memory network
(M-LSTM) was processed to predict future index values based on historical data.
A reinforcement learning technique was used to see if it would improve prediction
performance relative to the classical deep learning technique. The results show
that the institute’s value prediction is performed significantly better up to a one-
year horizon, i.e. up to four upcoming quarterly reports, given coupled financial
data than uncoupled. It is also shown how the update of observations (reinforced
learning) has an impact on the prediction result.

Keywords: Time series integration analysis · Prediction · Al-aided simulation ·


Cash flow from operating activities · Cash flow from investing activities · Cash
flow from financing activities · Financial report processing analysis · Automated
reasoning · M-LSTM network

1 Introduction
Financial prognostics are an emerging tool in managing corporate finances. Since fore-
casting allows reasonable and scientific prediction of future events, it makes it possible
to make decisions (e.g. for investment) with the potential for precise evaluation of its
effect on the financial situation of the enterprise. This is crucial in economic realities, in

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 615–624, 2023.
https://doi.org/10.1007/978-3-031-18461-1_40
616 A. Gałuszka et al.

which economic practice is increasingly using data from financial reporting and financial
analysis, making decisions based on such data. The functional importance of forecast-
ing, among others, is pointed out by Hedeyati and Esfandyari et al. [1–4]. Preparation
of a financial prediction involves tolerating the uncertainty it brings with it. However,
despite the fact that the drafting of an accurate forecast (due to the presence of the ran-
dom factors that cannot be taken into account at the preparation stage of the forecast) is
exceptionally complex, it enables a detailed analysis of the company’s future financial
situation and the accounting of elements that are not taken into consideration during
typical financial analysis. Thus, it allows a more specific determination of threats and
opportunities in the financial domain of the company’s operation. It seems, therefore,
that forecasting financials will be more and more used in business practice, especially in
subjects operating in international or global markets, as a subsidiary to traditional meth-
ods of financial analysis. Indeed, the fundamental methods of financial analysis need to
be enhanced. In addition, in sum, there is a pressing need to develop practical methods
for relatively simple projection of a company’s financial situation in the perspective of
at least more than one financial statement forwards, which is not straightforward due
to the difficulty of forecasting methods and the presence of contingency factors that are
unpredictable when building a forecast.
In summary, the following market gaps have been identified: lack of a tool to auto-
matic analysis and prognosis of financial reports data. Among the most widely used
deep learning methods in financial time series forecasting are methods based on long
short-term memory neural network structure [5–9]. This research is on the application of
such multi-dimensional deep learning structure to solve the problem of forecasting the
value of three indicators: cash flow from operating activities, cash flow from investing
activities and cash flow from financing activities, included in quarterly financial reports.
Simple structure of LSTM network (single-input single output) to predict other financial
statement items has been studied earlier in our works [19]. This idea is not new, e.g., in
[3] deep learning in relation to multi-agent stock trading system is analyzed.

2 Contribution
The purpose of the study is to prepare and analyze the forecast of coupled financial
reporting: sales revenue with operating result for a selected group of companies which
represent various types of companies with different characteristics, traded at the Warsaw
Stock Exchange. To reach the objectives, the following research hypothesis was stated:
Coupled financial forecasting improves the classical forecasting of the financial state
of a company, and therefore is an effective support for the investment decisions of
the individual investors on the Stock Exchange. The study used the author’s choice
of companies with the division into industrial, financial and service oriented ones. It
has been proposed to categorize companies into four types, named by us: Cows, Stars,
Phoenixes and Zombies. The survey period included the years 2008 to 2018 and forecasts
as well as factors in 2019.
The analysis used Matlab’s DeepLearnig Toolbox [10] for time series forecasting
using deep learning. Specifically, time series data forecasting using long-short-term
memory (LSTM) networks was used under two structures: single input-single output
network and multi input-multi output network.
On the Application of Multidimensional LSTM Networks 617

3 Data Selection, Characteristics and Division of Companies


The companies listed on stock markets are analyzed in many ways and there are over 400
companies listed on WSE in 2021. The selection process of companies for hypothesis
verification the selection of companies was based on the original division proposed
by WSE experts, being an extension and modification of the classic BCG model. The
BCG model is the oldest, best known and the simplest and yet still very useful method of
portfolio analysis and strategic controlling instrument. The method name is derived from
the American consulting firm Boston Consulting Group, which was the first to use this
tool in 1969. The method allows the assessment of business development opportunities
and defines its strategic position. By using this method, the company can determine
which goods (domain) should be withdrawn from the range, and that should bring more
profit in the future (see e.g. BCG Matrix 2021) [11]. Our extended WSE-companies-type
model (WSE-CTM) takes the form of three-dimensional matrix with third dimension
represented by color. The first (row) dimension represents type of the company similar
to BCG model: Cows, Stars, Phoenix and Zombies, where:
Cows – companies with stable exchange rate (low volatility in the long term), regular
payment of dividends;
Stars - from a small company it becomes a global leader;
Phoenix – the company went through a period of success, then collapse, but then
returned to growth;
Zombies - a company that eventually goes bankrupt after a good period of results.
The second (column) dimension indicates whether the company is of industry, finance
or services branch. The third (color) dimension means the size of the listed company,
where:

• Blue: blue chips, WIG20;


• Green: midcap, mWIG40;
• Red: small cap, sWIG80.

There were nine companies selected for analysis, each one representing different
group in WSE-CTM matrix: Amica, Sniezka, PZU, CD Project, TSGames, Quercus,
LiveChat, PBG, GetBack. In the analysis the companies were anonymized and WSE-
CTM matrix is presented in Table 1.

Table 1. WSE-CTM matrix for companies selected for analysis

Company type Industry Finance Services


Cows Comp 1, Comp 2 Comp 3
Stars Comp 4, Comp 5
Phoenix Comp 6 Comp 7
Zombi Comp 8 Comp 9
618 A. Gałuszka et al.

The cash flow from operating activities, cash flow from investing activities and cash
flow from financing activities data have been extracted from quarterly financial reports.
These are statements no. 10, 11 and 12 in standardized reporting format on WSE.

4 Methodology
The LSTM type of network is under development to forecast future index values from
historical data. LSTM networks are a type of the recurrent neural network (RNN) concept
designed to avoid the problem of long-term dependence, in of which each neuron carries
a memory cell that can store the prior knowledge used by the RNN or forget it if it is
needed [12]. They are currently widely used for successful applications in time series
prediction problems [13–16]. The LSTM-RNN is also designed to have a memory cell
that stores long-term interrelationships. In addition to the memory cell, the LSTM cell
contains an input gate, an output gate and a forgetting gate. For each gate in the cell, it
acquires the current input, the hidden state of the previous time, and state information
of the cell’s internal memory to perform certain operations and determine whether to
activate the actuation function. To forecast the values of upcoming sequence time steps,
the reactions are expected to be training sequences with values that are shifted by one
time step. In other words, at each step of the input sequence, the LSTM network is
taught to predict the value of the next time step. To forecast the future value of multiple
time steps, the function PredestAndUpdateState of the toolbox was deployed, which
forecasted the timing steps on a one-by-one basis, and the status of the network was
upgraded at each forecast. In the M-LSTM structure (Fig. 1), the dimension of the input
and output signals (vector) in our case is three, that is, the additional information affects
both the prediction and predictions with update results.

Fig. 1. M-LSTM structure (Authors’ drawing).

To benchmark the effectiveness of the forecasts, the root mean square error (RMSE)
that was calculated from the standardized data that is:

n
i=1 (Xtest,i − Xpred ,i )
2
RMSE = , (1)
n
where X test are test values and X pred are predicted values at time i. In order to compare
the forecasts quality RMSE has been normalized using the difference between maximum
and minimum of the test data method:
NRMSE1 = RMSE/(Xtest max − Xtest min ) (2)
On the Application of Multidimensional LSTM Networks 619

5 Results
5.1 Example of Forecasts for Industry Companies for M-LSTM (Three
Input-Three Output M-LSTM Structure)

Example 1. The illustrative data file contains a single time series with time steps rep-
resenting quarters and values corresponding to financial report items, here cash flows
from investing activities. The resulting data is an array of cells in which each element
is a single time step. In Fig. 2, cash flow data from investing activities and data with a
forecast are plotted.

Fig. 2. Cash flow from investing activities data (left) and data with forecast (right).

Fig. 3. Cash flow from investing activities forecast with NRMSE (left) and FORECAST WITH
UPDATES with NRMSE (right).

Figure 3 (left) illustrates the forecasts for the actual observed data values accompa-
nied by the prediction error. If the actually recorded indicator values of the forecasts are
already known, the state of the network over the observed values instead of the predicted
values can be updated, thereby confronting the learning results with the actual observed
values (the so-called reinforcement learning technique). The results are shown in Fig. 3
620 A. Gałuszka et al.

(right). In this case, updating the observations significantly (and not in a statistical
meaning) and improves the forecasting result.
Example 2. The exemplary data file contains a single series of time steps with intervals
corresponding to quarters and values representing financial statement item, here cash
flows from financial activities. The output is an array of cells in which each element is
a single time step. In Fig. 4, cash flow data from financing activities and data with a
forecast are shown.

Fig. 4. Cash flow from financing activities data (left) and data with forecast (right).

The Fig. 5 (left) demonstrates the forecasts for observed data values along with the
forecast error. If the values of the forecast indicators are available, the state of the network
can be actuated with the observed values instead of the forecasted ones, thus confronting
the learning results with the actually observed values. The findings are shown in Fig. 5
(right). In such a case, the observation update significantly (not statistically) decreases
the forecasting result.

Fig. 5. Cash flow from financing activities forecast with NRMSE (left) and forecast with updates
with NRMSE (right).
On the Application of Multidimensional LSTM Networks 621

Example 3. The Sample Data File contains a single time series with time steps which
correspond to quarters and values which correspond to financial report item, here cash
flow from operations activities. The output is an array of cells in which each element is
a single time step. Figure 6 shows the operating cash flow and forecast data.

Fig. 6. Cash flow from operating activities data (left) and data with forecast (right).

The Fig. 7 (left) demonstrates the forecasts for observed data values along with the
forecast error. If the values of the forecast indicators are available, the state of the network
can be actuated with the observed values instead of the forecasted ones, thus confronting
the learning results with the actually observed values. The findings are shown in Fig. 7
(right). In such a case, the observation update significantly (not statistically) decreases
the forecasting result.

Fig. 7. Cash flow from operating activities forecast with NRMSE (left) and forecast with updates
with NRMSE (right).
622 A. Gałuszka et al.

5.2 Accuracy of Financial Statements Forecast in One-Year Horizon


Table 2 shows the NRSME (2) for in each analyzed financial report item and for each
of the companies under consideration for conventional LSTM forecasts. For each finan-
cial statement item analyzed and for each company under consideration for M-LSTM
predictions, NRSME (3) is presented in Table 3.
In cases studied here the improvement of forecast has been noted for cash flow from
investing activities item (error is lower both for forecast and forecast with updates).
In case of cash flow from financing activities item the result is worse, i.e. additional
information coming from other statements is not useful and blurs the picture. For last
analyzed item, cash flow from operating activities, the result is ambiguous: additional
information improves simple forecast but worsens the forecast with updates.

Table 2. NRSME for classical LSTM predictions

Forecast Forecast with updates


NRMSE NRMSE
Financial statement item Min Max Avg Min Max Avg
cash flow from investing activities 0,124 8,419 2,961 0,130 4,867 1,738
cash flow from financing activities 0,047 2,297 0,809 0,048 1,727 0,623
cash flow from operating activities 0,490 33,780 9,082 0,416 21,830 6,137

Table 3. NRSME for M-LSTM predictions.

Forecast Forecast with updates


NRMSE NRMSE
Financial statement item Min Max Avg Min Max Avg
cash flow from investing activities 0,323 1,035 0,652 0,348 1,256 0,755
cash flow from financing activities 0,379 2,429 0,964 0,367 2,062 0,864
cash flow from operating activities 0,032 18,058 6,153 0,033 25,928 8,734

6 Discussion and Conclusion


The achieved results do not indicate unequivocally the improvement of quality of the
forecasts assuming additional information coming from set of accompanying financial
statements.
The difficulty is that the future values of financial report items depend not only on
past quantities (with the exception of fully deterministic systems having the well-known
excitations and without non-deterministic noise), but also on other external inputs that
may appear sporadically while taking the form of data that are not time-series (unforeseen
On the Application of Multidimensional LSTM Networks 623

the market situation). The forecast result in the case of analyzed items, i.e. cash flow from
investing activities and cash flow from financing activities is supposed to be considered
rather as an assisting tool in automatically reporting possible unusual fluctuations in the
reporting of financial data, and not as an efficient prognostic instrument [17, 18].

Acknowledgments. We would like to thank the stock exchange experts for their critical com-
ments. The work has been funded by GPW Data Grant No. POIR.01.01.01-00-0162/19 in 2021.
The work of Adam Gałuszka was supported in part by the Silesian University of Technology (SUT)
through the subsidy for maintaining and developing the research potential grant in 2022. The work
of Eryka Probierz was supported in part by the European Union through the European Social Fund
as a scholarship under Grant POWR.03.02.00-00-I029, and in part by the Silesian University of
Technology (SUT) through the subsidy for maintaining and developing the research potential grant
in 2022 for young researchers in analysis. This work was supported by Upper Silesian Centre for
Computational Science and Engineering (GeCONiI) through The National Centre for Research
and Development (NCBiR) under Grant POIG.02.03.01-24-099/13. The work of Karol J˛edrasiak
and Aleksander Nawrat has been supported by National Centre for Research and Development as
a project ID: DOB-BIO10/19/02/2020 “Development of a modern patient management model in
a life-threatening condition based on self-learning algorithmization of decision-making processes
and analysis of data from therapeutic processes”.

References
1. Hedayati Moghaddama, A., Hedayati Moghaddamb, M., Esfandyari, M.: Stock market index
prediction using artificial neural network. J. Econ. Finance Adm. Sci. 21, 89–93 (2016)
2. Kyoung-jae, K.: Financial time series forecasting using support vector machines. Neurocom-
puting 55(1–2), 307–319 (2003)
3. Korczak, J., Hemes, M.: Deep learning for financial time series forecasting in a-trader system.
In: 2017 Federated Conference on Computer Science and Information Systems (FedCSIS),
Prague, pp. 905–912 (2017)
4. Franc-D˛abrowska, J., Zbrowska, M.: Prognozowanie finansowe dla spółki X – spółka
logistyczna. Zeszyty Naukowe SGGW w Warszawie. Ekonomika i Organizacja Gospodarki
Żywnościowej 64, 251–270 (2008). (in Polish)
5. Chen, K., Zhou, Y., Dai, F.: A LSTM-based method for stock returns prediction: a case study
of China stock market. In: 2015 IEEE International Conference on Big Data (Big Data),
pp. 2823–2824 (2015). https://doi.org/10.1109/BigData.2015.7364089
6. Zhao, Z., Rao, R., Tu, S., Shi, J.: Time-weighted LSTM model with redefined labeling for
stock trend prediction. In: 2017 IEEE 29th International Conference on Tools with Artificial
Intelligence (ICTAI), pp. 1210–1217 (2017). https://doi.org/10.1109/ICTAI.2017.0018
7. Roondiwala, M., Patel, H., Varma, S.: Predicting stock prices using LSTM. Int. J. Sci. Res.
(IJSR) 6 (2017). https://doi.org/10.21275/ART20172755
8. Qiu, J., Wang, B., Zhou, C.: Forecasting stock prices with long-short term memory neural
network based on attention mechanism. PLoS ONE 15(1) (2020). https://doi.org/10.1371/jou
rnal.pone.0227222
9. Fischer, T., Krauss, C.: Deep learning with long short-term memory networks for financial
market predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018). https://EconPapers.repec.org/
RePEc:eee:ejores:v:270:y:2018:i:2:p:654-669
10. www.mathworks.com
11. BCG Matrix (2021). http://www.netmba.com/strategy/matrix/bcg/
624 A. Gałuszka et al.

12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
13. Elsaraiti, M., Merabet, A.: Application of long-short-term-memory recurrent neural networks
to forecast wind speed. Appl. Sci. 11, 2387 (2021). https://doi.org/10.3390/app11052387
14. Shumway, R.H., Stoffer, D.S.: Time Series Analysis and its Applications: With R Examples,
4th edn. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52452-8
15. Huang, J., Chai, J., Cho, S.L.: Deep learning in finance and banking: a literature review and
classification. Front. Bus. Res. China 14, 13 (2020). https://doi.org/10.1186/s11782-020-000
82-6
16. Gałuszka, A., Pacholczyk, M., Bereska, D., Skrzypczyk, K.: Planning as artificial intelligence
problem - short introduction and overview. In: Nawrat, A., Simek, K., Świerniak, A. (eds.)
Advanced Technologies for Intelligent Systems of National Border Security. SCI, vol. 440,
pp. 95–103. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-31665-4_8
17. Roondiwala, M., Patel, H., Varma, S.: Financial time series forecasting with deep learning: a
systematic literature review: 2005–2019 (2020)
18. Sezer, O.B., Gudelek, M.U., Ozbayoglu, A.M.: Appl. Soft Comput. J. 90, Article no. 106181
(2020)
19. Gałuszka, A., Probierz, E., Olczyk, A., Kocerka, J., Klimczak, K., Wisniewski, T:. The appli-
cation of SISO LSTM networks to forecast selected items in financial quarterly reports - case
study. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds.) Compu-
tational Science and Its Applications - ICCSA 2022 Workshops, Malaga, Spain, July 4–7,
Proceedings pt 5, pp. 605–616 (2022)
Utilizing Machine Learning to Predict Breast
Cancer: One Step Closer to Bridging the Gap
Between the Nature Versus Nurture Debate

Junhong Park(B) and Miso Kim

Seoul Scholars International, 982-5, Daechi-dong, Gangnam-gu, Seoul, Republic of Korea


[email protected]

Abstract. with breast cancer, scientists have been trying to find the most effective
solutions and treatments. Moreover, studies on genes through machine learning
were conducted. By identifying the factor that influences breast cancer the most,
this knowledge can be used to prevent or treat patients with breast cancer appropri-
ately. Furthermore, the result of this experiment extends onto the debate of nature
versus nurture. If the result concludes that Only Gene or Only Mutation has a
stronger effect on tumors, then it weighs nature more in this debate. Likewise, if
Only Others is the dominant factor, then this emphasizes the nurture more in this
debate. Gathered data was processed and went through eight different machine
learning algorithms to predict the tumor size and stage. The ‘Others’ was con-
cluded as the most influential factor for the tumor. Among the ‘Others’, the type
of breast surgery and the number of chemotherapy received were identified with
the highest correlation with tumor size and stage. In conclusion, this solidifies the
nurture’s stance on the debate. The data on the external effects and the usage of
a developed machine learning model can be adopted to improve the experiment
because they would increase the accuracy of the result.

Keywords: Breast cancer · Data analysis · Nurture versus nature · Machine


learning · RNA

1 Introduction

1.1 Background

Cancer is an abnormal growth of cells due to a mutation caused by DNA alteration and/or
exterior factors [1]. It is “the leading cause of death worldwide” in 2020, and there were
more than 10 million confirmed cases of cancer [2]. Breast cancer is one of the most
widespread and deadliest cancers [2]. There are different types of breast cancer: Ductal
Carcinoma In Situ (DCIS), Invasive Ductal Carcinoma (IDC), Lobular Carcinoma In
Situ (LCIS), Invasive Lobular Cancer (ILC), Triple Negative Breast Cancer, Inflamma-
tory Breast Cancer (IBC), Metastatic Breast Cancer, and other less frequent types [3].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 625–643, 2023.
https://doi.org/10.1007/978-3-031-18461-1_41
626 J. Park and M. Kim

Possible symptoms of breast cancer are “new lump in the breast or underarm (armpit)
[, t]hickening or swelling of part of the breast[, and i]rritation or dimpling of the breast
skin [4]. Up to 110 genes are related to breast cancer, and mutations in BRCA1 and
BRCA2 are known to have significant effects on the risk of getting breast cancer [5].
Machine learning is a technique that makes the machines train and learn about
information in ways similar to those of humans, which is making predictions through
the data [6]. Recently, machine learning has been used in genomic predictions, and it
can “adapt to complicated associations between data and the output [and] adapt to very
complex patterns” [7]. Also, it can “help us identify underlying genetic factors for certain
diseases by looking for genetic patterns amongst people with similar medical issues”
[8]. Machine learning has identified 165 new cancer genes in a recent study [9].

Fig. 1. Estimated number of new cases in 2020, worldwide, females, all ages

Figure 1 identifies Breast Cancer as the most detected cancer worldwide for women in
2020 [10]. Moreover, Fig. 2 indicates an increased probability for an ordinary US woman
to be diagnosed with invasive breast cancer. This shows the importance of analyzing the
key factors of breast cancer. If those factors were found, it would help a lot of women
prevent getting cancer.
Utilizing Machine Learning to Predict Breast Cancer 627

Fig. 2. Increased percentage of a woman’s lifetime risk of being diagnosed with invasive breast
cancer in the United States

1.2 Objective
While there were significant advancements in developing treatments and cures, the
majority of the experts suggest that it is still important to detect the tumor at the earliest
stages possible. This study aims to determine which factor among the number of genes,
the presence of mutation, or the other external factors influence tumor size and stage the
most. After concluding the most influential factor, this data can be used for prevention.
If genes or mutations of them were concluded as the most influential factor, then this
data would be crucial for prevention. Some tumors might be too small at the moment
to be noticed, as, for example, solid tumors are only detected through imaging when
“approximately 10^9 cells [are] growing as a single mass” [11]. Thus, people need to
wait until the tumor grows up to that point. However, if doctors see abnormal activities
in that certain factor, then they can start to stay alert and take precautions. Moreover,
early detection of breast cancer is crucial because it can lead to an “increased number
of available treatment options, increased survival, and improved quality of life” [12].
If ‘Others”, the exterior factors, are concluded as the most influential factor among the
three, then the right, important treatments can be given properly. For example, if radio-
therapy, one factor of the “Others”, has the highest inverse correlation with tumor size
and stage, then radiotherapy can be actively utilized to treat patients with breast cancer.
628 J. Park and M. Kim

Existing research did not consider the exterior factors, and their accuracies were
only based on one factor such as a particular gene. On the other hand, our experiment
focused not only on the normal genes but also on the mutated genes, which increases
the accuracy of the genetic datasets. Moreover, the proposed study included exterior
factors such as whether the patients have received chemotherapy or not. Furthermore,
accuracies were tested with multiple factors. For instance, the data have been mixed for
mutation and exterior factors and have been tested for accuracy. This is crucial because
it is more detailed for us to figure out which factors influenced the result the most. In
addition, this research predicted tumor size and stage as well, which is more detailed
compared to other research that only focused on the presence of the relationship between
a gene and breast cancer. Last but not least, Nature versus Nurture was mentioned and
debated in this article, and the gap between Nature and Nurture in this debate has been
bridged via machine learning algorithms.

2 Literature Review
Urda et al. used three free-public RNA-Seq expression datasets from The Cancer Genome
Atlas website. The datasets are linked to BRCA, COAD, and KIPAN genes. Those
databases are analyzed using a standard Least Absolute Shrinkage and Selection Operator
(LASSO) as a baseline model. DeepNet(i) and DeepNet(ii) are used for the application
of the deep neural net model for analysis. The results suggest that the straightforward
applications of deep nets described in this work are not enough to outperform simpler
models like LASSO. They found out that deep learning processes took more time to create
models than LASSO. They conclude that using a simple feature selection procedure to
reduce the genes and later fitting a deep learning model will take much more processing
time to achieve similar predictive performances [13].
Castillo et al. obtained the data from different cancer breast datasets in the National
Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) web plat-
form. The data were obtained from three datasets: RNA-seq, Microarray, and integrated.
First, they used Train-Test split to obtain the level of gene expression in RNA-seq and
Microarray. To test the accuracy of the data obtained through RNA-seq and Microar-
ray, they used Support Vector Machine (SVM), Random Forest (RF), and K-Nearest
Neighbors (k-NN). Moreover, they used minimum-Redundancy Maximum-Relevance
(mRMR) to apply feature selection. At last, they found that SFRP1, GSTM3, SULT1E1,
MB, TRIM29, and VSTM2L genes are the most relevant and frequent in the datasets.
The researchers concluded the study with high accuracy of the six genes found through
different techniques and confirmed once more by informing that five of the final six
genes were previously noted as genes related to breast cancer [14].
Liñares Blanco et al. used RNA-seq expression on the data from The Cancer Genome
Atlas. They used a standard statistical approach and two algorithms, including Random
Forest and Generalized Linear Models. There are different expressions between the con-
ventional statistical approach and the created algorithms as 99% of the genes generated
by an algorithm are represented differently to from the standard statistical approach.
There are some similarities in the method of identifying genes that have the potential
to structure tumors. For instance, filtering methods helped to identify tumors on the
unknown genes linked with the cell cycle [15].
Utilizing Machine Learning to Predict Breast Cancer 629

Wang et al. obtained data from The Cancer Genome Atlas (TCGA). They used
Random Forest, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting
Machine (LightGBM). LightGBM achieved the highest performance and accuracy, and
all three methods indicated that the hsa-mir-139 gene showed the highest correlation
with breast cancer. Moreover, there were other genes such as hsa-mir-21 and hsa-mir-
183 that yielded high correlations as well. By the result, the researchers could conclude
two ideas. First, LightGBM can be primarily used to detect cancer, as it showed the
highest accuracy and efficiency than other machine learning algorithms. Second, the
genes that exhibited correlation with breast cancer could work as biomarkers in cancer
diagnoses [16].
Johnson et al. used 3 RNA-seq data sets from Rat Body Map, National Center for
Biotechnology Information, and the Cancer Genome Atlas. Machine Learning meth-
ods such as support vector machines, random forest, decision table, J48 decision tree,
logistic regression, and naïve Bayes with three normalization techniques and two RNA-
seq analysis pipelines called the standard Tuxedo suite and RNA-Seq by Expectation-
Maximization (RSEM) were used. Generally, random forest proved to have the highest
accuracy between 71.3% and 99.8%. RNA-seq-based classifiers should utilize transcript-
based expression data, feature selection preprocessing, Random Forest classification
method, but not normalization [17].

3 Methods and Materials


3.1 Data Description
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)
database, downloaded from the cBioPortal, contains 1980 breast cancer samples. The
dataset was accumulated by Professor Carlos Caldas from Cambridge Research Insti-
tute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada
and published in Nature Communications (Pereira et al., 2016). In this experiment, the
dataset includes three factors called “Only Mutation,” “Only Gene,” and “Others.” Only
Mutation refers to whether each gene has a mutation in them or not, and Only Gene
refers to the Z-score of the gene expression compared to the average amount of gene
expression in the data set; Fig. 3 is shown below as a sample of the Z-score data. Others
refer to exterior factors related to cancer [18].

Fig. 3. Example of the Z-score data for gene dataset


630 J. Park and M. Kim

3.2 Correlation Matrix

Figure 4 and 5 show the correlation matrix between tumor size or stage and five genes
with the highest correlation values. Generally, they show very low correlations between
genes and target features, which are tumor size and tumor stage.

Fig. 4. Correlation matrix of tumor stage and five different Genes’ Z-Scores

Fig. 5. Correlation matrix of tumor size and five different genes’ Z-Scores

Figures 6 and 7 show the correlation matrix between tumor size or stage and five
mutations with the highest correlation values. Similarly, they show very low correlations.
Utilizing Machine Learning to Predict Breast Cancer 631

Fig. 6. Correlation matrix of tumor stage and data for five different mutations

Fig. 7. Correlation matrix of tumor size and data for five different mutations

Figures 8 and 9 represent the correlations between Tumor specifics and different
factors in “Others,” and the highest correlation among the datasets is around 0.2–0.3
correlation. This is the highest correlation among datasets. For Tumor size, Chemother-
apy, and Type of Breast Surgery yielded correlations of 0.21 and 0.25 respectively. In
the case of the Tumor stage, Chemotherapy and 0.33, and Type of Breast Surgery and
0.25 correlation are shown. Conclusively, it was found that treatment processes such as
surgery or chemotherapy had some correlated effect on the size or stage of the tumor.
632 J. Park and M. Kim

Fig. 8. Correlation matrix of tumor stage and factors in “Others”

Fig. 9. Correlation matrix of tumor size and factors in “Others” chemotherapy: 0.206/type of
surgery: 0.25
Utilizing Machine Learning to Predict Breast Cancer 633

3.3 Random Forest


Random Forest (RF) recruits variables by randomly extracting them from a given data. As
depicted in Fig. 10, it is a method of creating multiple decision trees and deriving optimal
models with majority results, and it randomly selects certain independent variables rather
than all variables to restore and extract multiple sample datasets. Since it is randomly
selected regardless of the influence of the variable, the correlation of each model is
removed from the sample data set to reduce average volatility and become stable. Each
model is generated by applying the CART algorithm from sample datasets, and the
optimal final model is calculated through a simple majority. It has better generalization
performance and is highly accurate as it replicates close to the stage [19].

Fig. 10. Diagram of how random forest works

3.4 Light Gradient Boosting Machine


LightGBM (LGBM) is an efficient technique that develops an extreme gradient boosting
(XGB) algorithm, which excludes certain data with a small gradient. This method is
called gradient-based one-side sampling (GOSS). LGBM can draw accurate conclusions
because of GOSS, which enables much faster and better learning than existing XGB.
Exclusive feature bundling (EFB) bundles mutually exclusive features in the dataset to
reduce the dimension. As the computation speed of the algorithm increases with a high
634 J. Park and M. Kim

dimension dataset, EFB allows LGBM a faster computing speed. It is especially useful
in the dataset including multiple string datasets, as the algorithm should convert them
into a One_Hot vector, which increases the dimension of the dataset [20]. A diagram to
help understand the mechanism for LGBM is shown in Fig. 11.

Fig. 11. Diagram of how light gradient boosting machine works

3.5 Nature Versus Nurture

Nature versus nurture theory first arose in the field of psychology. It questions whether
our behaviors originated from our genes or our environments. Subsequently, other fields
started to engage in this theory in their debates. This theory is applied to biology as well,
and it has been around us for decades. The question goes like this: are diseases caused by
genetic predispositions or environmental factors? There are some diseases such as sickle
cell anemia and Huntington’s chorea that are solely affected by genetics. However, this
is not the case for most diseases as both factors are influential for the diseases to be
developed. The debate continues to determine which factor is more influential [21].

3.6 Experiment Workflow

First of all, the data was obtained from the Breast Cancer Gene Expression dataset
(METABRIC) and divided into seven parts to investigate which sectors affect tumor
stage and size the most. The seven parts include Only Gene, Only Others, Only Muta-
tion, Mutation + Others, Gene + Others, Mutation + Gene, and All. These data were
then preprocessed through label encoding, null value checking, feature selection, and
train-test splitting. In the feature selection process, variables that are directly related to
tumor size or stage are removed from the “Others” dataset. After that, eight machine
learning algorithms obtained accuracy scores, root mean squared error (RMSE), and
mean absolute error (MAE) values, and the relationship between the data and the tumor
size or stage could be confirmed. The overview of the experiment workflow is shown
below in Fig. 12.
Utilizing Machine Learning to Predict Breast Cancer 635

Fig. 12. Diagram of how the experiment is conducted

4 Result

The graph below yields different root mean squared error (RMSE) values for different
types of data. Several different regressors have been used to get the RMSE values.
Through XGB Regressor, the value for Only Mutation resulted in 15.08, but when the
“Others” datasets were applied, the result showed 12.26. “Only Gene” received 15.04,
and the addition of the “Others” yielded a decrease to 12.09 as well. When the “Others”
factor was added to each, “Only Mutation” and “Only Gene” in Random Forest, the
RMSE values decreased by 2.71 and 3.10 respectively. Last, the addition of “Others”
in the Extra Trees Regressor obtained the decrease of RMSE values by 1.62 for “Only
Mutation” and 2.42 for “Only Gene”.
Figure 13 shows that XGB Regressor, Random Forest Regressor, and Extra Trees
Regressor resulted in the lowest RMSE values.
636 J. Park and M. Kim

Fig. 13. Root Mean Square Error (RMSE) values for seven different groups

Table 1. Root Mean Square Error (RMSE) values for seven different groups

Data type The algorithm Lowest RMSE The algorithm Highest RMSE
used for the value used for the value
lowest RMSE highest RMSE
value value
Only Mutation XGB Regressor 15.08 LGBM Regressor 16.21
Only Gene Extra Trees 14.77 Decision Tree 18.19
Regressor Regressor
Only Others Linear 11.76 Decision Tree 15.67
Regression Regressor
Mutation + XGB Regressor 12.26 Decision Tree 16.42
Others Regressor
Gene + Others Random Forest 12.01 Linear 17.42
Regressor Regression
Mutation + Extra Trees 15.00 Linear 21.28
Gene Regressor Regression
All Random Forest 11.98 Linear 20.19
Regressor Regression

Regressors were divided into two separate charts due to the huge difference between
the mean absolute error (MAE) values. Table 1 reveals different MAE values for different
types of data using Decision Tree Regressor, Random Forest Regressor, and Linear
Regression. With Random Forest Regressor, Others affected the Only Mutation and
Only Gene datasets to have a decrease in MAE values. Only Mutation faced a decrease
of 1.55, and Only Gene went through a decrease of 1.41.
Utilizing Machine Learning to Predict Breast Cancer 637

Fig. 14. Mean Absolute Error (MAE) values for seven different groups

Figure 14 displays Random Forest Regressor as the algorithm with the lowest MAE
values from this group of algorithms. The lowest and the highest MAE_1 values and the
algorithms used for them are represented in Table 2.

Table 2. Mean Absolute Error (MAE) values for seven different groups

Data type The algorithm Lowest MAE_1 The algorithm Highest MAE_1
used for the value used for the value
lowest MAE_1 highest MAE_1
value value
Only Decision Tree 10.29 Linear Regression 10.82
Mutation Regressor
Only Gene Extra Trees 10.21 Linear Regression 13.06
Regressor
Only Others Linear 11.76 Decision Tree 15.67
Regression Regressor
Mutation + Random Forest 9.22 Linear Regression 10.01
Others Regressor
Gene + Random Forest 9.80 Linear Regression 13.37
Others Regressor
Mutation + Random Forest 10.17 Linear Regression 15.55
Gene Regressor
All Random Forest 8.78 Linear Regression 15.55
Regressor
638 J. Park and M. Kim

Figure 15 indicates different MAE values for different types of data using XGB
Regressor, LGBM Regressor, and Linear Regression. The second set of MAE values was
evaluated through XGB Regressor. XGB Regresssor resulted in 227.32 for the “Only
Mutation” dataset, but when the “Others” datasets were applied, the Regressor decreased
to 150.22. For the “Only Gene” dataset, the XGB Regressor yielded 226.10. In contrast,
the XGB Regressor resulted in 146.13 after the “Others” was added. Furthermore, Table 3
presents different MAE values for seven groups of the experiment.

Fig. 15. The Second Set of Mean Absolute Error (MAE) values for seven different groups

Table 3. Second Set of Mean Absolute Error (MAE) values for seven different groups

Data type The algorithm Lowest MAE_2 The algorithm Highest MAE_2
used for the value used for the value
lowest MAE_2 highest MAE_2
value value
Only XGB Regressor 227.32 LGBM Regressor 262.84
Mutation
Only Gene Extra Trees 218.26 LGBM Regressor 228.05
Regressor
Only Others XGB Regressor 154.76 Extra Trees 205.72
Regressor
Mutation + XGB Regressor 150.22 Extra Trees 212.50
Others Regressor
Gene + XGB Regressor 146.13 Extra Trees 152.64
Others Regressor
(continued)
Figure 16 and Table 4 shows the accuracy of different machine learning algorithms
with different sections of the data. With the aid of “Others”, “Only Mutation” and “Only
Gene” gained 5.3 and 8.03% each in LGBM Classifier.
Utilizing Machine Learning to Predict Breast Cancer 639

Table 3. (continued)

Data type The algorithm Lowest MAE_2 The algorithm Highest MAE_2
used for the value used for the value
lowest MAE_2 highest MAE_2
value value
Mutation + Extra Trees 224.85 LGBM Regressor 229.76
Gene Regressor
All XGB Regressor 145.30 Extra Trees 158.58
Regressor

Fig. 16. Accuracy for seven different groups

Table 4. Accuracy for seven different groups

Data type The algorithm Lowest accuracy The algorithm Highest accuracy
used for the score used for the score
lowest accuracy highest accuracy
Only KNeighbors 45.13 XGB Classifier 50.12
Mutation Classifier
Only Gene Logistic 49.88 Random Forest 55.58
Regression Classifier
Only Others KNeighbors 54.99 XGB Classifier 66.42
Classifier
Mutation + KNeighbors 52.55 LGBM Classifier 67.4
Others Classifier
(continued)
640 J. Park and M. Kim

Table 4. (continued)

Data type The algorithm Lowest accuracy The algorithm Highest accuracy
used for the score used for the score
lowest accuracy highest accuracy
Gene + Logistic 57.91 XGB Classifier 66.67
Others Regression
Mutation + KNeighbors 45.61 Random Forest 55.34
Gene Classifier Classifier
All KNeighbors 52.55 XGB Classifier 66.67
Classifier

5 Discussion
To sum up, the decrease in both RMSE and MAE values using different machine learning
algorithms as the “Others” was added, showed us how “Others” contributed the RMSE
and MAE scores to be lower, indicating how the factors with “Others” are better at pre-
dicting tumor size. Moreover, the accuracy of data using the LGBM classifier expressed
a similar effect. The addition of “Others” influenced the accuracy score to increase,
resulting in better predictions of the tumor stages. The debate of nature versus nurture
has continued with a consensus that both factors are influential to some extent in dis-
eases. However, according to the analysis above, “Others”, the exterior factors have been
concluded as the most influential factor among the three in predicting tumor size and
stage. In particular, Figs. 17 and 18 exhibit the type of breast therapy and chemotherapy
as the most important factor among those in others for determining tumor size and stage
respectively. This shows how the other exterior factors such as the age of diagnosis are
more important than the number of genes or mutated genes themselves, thus concluding
that nurture takes a bigger part in deciding tumor size and stage.

Fig. 17. Feature importance scores of different factors in “Others” for tumor size
Utilizing Machine Learning to Predict Breast Cancer 641

Fig. 18. Feature importance scores of different factors in “Others” for tumor stage

5.1 Limitations
The most striking limitation of this research is its low accuracy. Although data fused
with “Other” datasets has a relatively higher accuracy of 50–60% than other datasets, the
highest value is 67.4%, which is derived when LGBM Classifier is used in the “Mutation
+ Others” datasets. In future research, when designing a model related to this issue, the
focus should be on improving the accuracy of the model. Another potential weakness
in our study is that these researchers do not have records of the patients’ usages of
carcinogens that are often consumed through drinking or smoking. Since carcinogens
damage the DNA in our cells, consumption of them can affect the extent or the size
and the stage of the tumor [22]. Due to this weakness, these researchers weren’t able to
evaluate the full extent of the exterior factor.

6 Conclusion
The given data, METABRIC, is divided into seven datasets. They are preprocessed
through four methods. Then, the preprocessed data sets become databases, and they go
through eight different machine learning algorithms. The addition of “Others” resulted
in lower RMSE and MAE values and higher accuracy, meaning that “Others” affected
the result to predict the tumor size and stage better. According to the correlation matrix
and feature importance graph, types of breast surgery and chemotherapy were identified
as the most influential factors in determining tumor size and stage. The nature versus
nurture debate is about whether visible or writable effects of biological diseases are due
to genetics or the environment. While it is important to recognize that both factors are
influential, this experiment strengthens nurture’s side on this debate by showing that
“Others”, the exterior factors, improved the accuracy and predictability of the dataset.
Further research must investigate external effects related to carcinogens such as smoking
and drinking alcohol and quantify and put such data into datasets. Deeper research is
needed to improve the accuracy of the model based on this richer external data so that
higher correlations can be found.
642 J. Park and M. Kim

References
1. National Cancer Institute: What is cancer? 5 May 2021 https://www.cancer.gov/about-can
cer/understanding/what-is-cancer
2. WHO|World Health Organization: Cancer, 3 March 2021. https://www.who.int/news-room/
fact-sheets/detail/cancer
3. National Breast Cancer Foundation: 19 September 2019. Other Types. https://www.nationalb
reastcancer.org/other-types-of-breast-cancer. Accessed 13 Dec. 2022
4. Centers for Disease Control and Prevention: What are the symptoms of breast cancer? 14
September 2020. https://www.cdc.gov/cancer/breast/basic_info/symptoms.htm
5. Breastcancer.org: Researchers identify 110 genes associated with breast cancer, 20 December
2018 https://www.breastcancer.org/research-news/110-genes-associated-with-breast-cancer
6. IBM Cloud Education: What is machine learning? IBM - United States, 15 July 2020. https://
www.ibm.com/cloud/learn/machine-learning
7. Montesinos-López, O.A., et al.: A review of deep learning applications for genomic selection.
BMC Genomics 22(1) (2021). https://doi.org/10.1186/s12864-020-07319-x
8. Jethanandani, M.: Machine learning and genetics. 23andMe Education Program, 8 August
2018. https://education.23andme.com/machine-learning-and-genetics/
9. Max-Planck-Gesellschaft: 165 new cancer genes identified with the help of machine learning.
ScienceDaily, 12 April 2021. www.sciencedaily.com/releases/2021/04/210412142730.htm
10. Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., Bray, F.:
Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide
for 36 cancers in 185 countries. CA: A Cancer J. Clinicians 71(3), 209–249 (2021). https://
doi.org/10.3322/caac.21660
11. Frangioni, J.V.: New technologies for human cancer imaging. PubMed Central (PMC), 20
August 2008. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2654310/
12. M. (2018, October 31). Early detection of breast cancer information. MyVMC. https://www.
myvmc.com/investigations/early-detection-of-breast-cancer/
13. Urda, D., Montes-Torres, J., Moreno, F., Franco, L., Jerez, J.M.: Deep learning to analyze
RNA-seq gene expression data. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2017. LNCS,
vol. 10306, pp. 50–59. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59147-6_5
14. Castillo, D., Gálvez, J.M., Herrera, L.J., Román, B.S., Rojas, F., Rojas, I.: Integration of RNA-
SEQ data with heterogeneous microarray data for breast cancer profiling. BMC Bioinform.
18(1) (2017). https://doi.org/10.1186/s12859-017-1925-0
15. Liñares Blanco, J., Gestal, M., Dorado, J., Fernandez-Lozano, C.: Differential gene expression
analysis of RNA-seq data using machine learning for cancer research. In: Tsihrintzis, G.A.,
Virvou, M., Sakkopoulos, E., Jain, L.C. (eds.) Machine Learning Paradigms. LAIS, vol. 1,
pp. 27–65. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15628-2_3
16. Wang, D., Zhang, Y., Zhao, Y.: LightGBM. In: Proceedings of the 2017 International Con-
ference on Computational Biology and Bioinformatics - ICCBB 2017 (2017). https://doi.org/
10.1145/3155077.3155079
17. Johnson, N.T., Dhroso, A., Hughes, K.J., Korkin, D.: Biological classification with RNA-SEQ
data: can alternatively spliced transcript expression enhance machine learning classifiers?
RNA 24(9), 1119–1132 (2018). https://doi.org/10.1261/rna.062802.117
18. Alharbi, R.: Breast cancer gene expression profiles (METABRIC). Kaggle: Your Machine
Learning and Data Science Community, 27 May 2020. https://www.kaggle.com/raghadalh
arbi/breast-cancer-gene-expression-profiles-metabric
19. Breiman, L.: Random forests. Mach. Learn. 45(3), 5–32 (2001). https://doi.org/10.1023/a:
1017934522171
Utilizing Machine Learning to Predict Breast Cancer 643

20. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural. Inf.
Process. Syst. 30, 3146–3154 (2017)
21. Medicinenet.com. (2022). https://www.medicinenet.com/nature_vs_nurture_theory_genes_
or_environment/article.htm. Accessed 3 Jan 2022
22. How These Common Carcinogens May Be Increasing Your Risk for Cancers. Verywell
Health: (2022). https://www.verywellhealth.com/carcinogens-in-cigarettes-how-they-cause-
cancer-514412#:~:text=A%20carcinogen%20is%20any%20substance,cancer%20helps%
20in%20prevention%20efforts. Accessed 3 Jan 2022
Recognizing Mental States when Diagnosing
Psychiatric Patients via BCI and Machine
Learning

Ayeon Jung(B)

Busan Foreign School, 45 Daecheon-ro 67 beon-gil, Haeundae-gu, Busan, South Korea


[email protected]

Abstract. A psychiatric disorder is any disorder that interferes with a person’s


thoughts, emotions, or behavior, including anxiety disorders, and depressive dis-
orders. In order to diagnose these disorders, patients are often required to go
through a series of diagnostic tests that demand concentration for accurate results.
This paper aims to lower misdiagnoses that result from the lack of concentration
in patients by developing an algorithm that recognizes mental states using 989
columns of EEG brain wave data. These data columns were entered into a train
split function and analyzed using different classification models of Decision Tree,
Logistic Regression, Random Forest, Gradient Boosting, Adaptive Boosting, K-
neighbors, LGBM, and XGB. The control experiment analyzed the raw dataset,
the second and third experiments utilized feature extraction algorithms of PCA
and ICA analysis respectively, and the fourth experiment used correlation matrix
analysis to produce accuracy scores. The highest accuracy score of 98.19% was
produced by the LGBM Classifier in the control experiment and the most efficient
feature selection method was PCA. The highest PCA processed data yielded an
accuracy score of 87.7% using the random forest classification model while the
correlation matrix analysis yielded an accuracy score of 80.04% using the LGBM
classifier. These findings will allow psychologists and psychiatrists to provide
methods to help patients answer all questions at a sustained level of sufficient
concentration, ultimately allowing a higher percentage of proper diagnoses to be
made.

Keywords: Concentration · Mental state · Diagnosis · EEG · Machine learning

1 Introduction
1.1 Background

A psychiatric disorder is any disorder that interferes with a person’s thoughts, emotions,
or behavior, including anxiety disorders, and depressive disorders. Physical exams and
lab tests are conducted [1] to diagnose these disorders. When people come to the psychi-
atrist for the first time, they fill out a few simple questionnaires, such as how long they’ve
had their symptoms, whether they have a personal or family history of mental health

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 644–655, 2023.
https://doi.org/10.1007/978-3-031-18461-1_42
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 645

concerns, and whether they’ve received any psychiatric therapy. These questionnaires
help psychiatrists understand the basic information about the patient’s condition [2].
These psychiatric evaluations usually take about 30 to 90 min; therefore, concentrating
during the entire evaluation is vital for accurate results [3].
A Brain-Computer Interface (BCI) is a computer-based system that allows direct
communication between a brain and an external device. It “acquires brain signals, ana-
lyzes them, and translates them into commands that are relayed to an output device to
carry out a desired action” [4] and is “often aimed at assisting, augmenting or repairing
human cognitive or sensory-motor functions” [5]. Research on BCIs has helped physi-
cally disabled patients by allowing for advancements in complex control over cursors,
prosthetics, wheelchairs, and other devices. It also opened up new possible methods of
neurorehabilitation for patients who suffer from strokes or other nervous system disor-
ders [6]. The field of BCI is rapidly growing, with the BCI market revenue in the United
States by an application having heightened from 127.9 million to 354.3 million U.S.
dollars in a span of 10 years from 2012 to 2022 as shown in Fig. 1 [7].
Specifically, an interest in electroencephalographic (EEG) based BCI approaches
have increased as “recent technological advances such as wireless recording, machine
learning analysis, and real-time temporal resolution” developed [8]. Electroencephalog-
raphy is “a medical imaging technique that reads scalp electrical activity generated by
brain structures” [5]. The EEG displays signals that are divided into several bands; “The
most commonly studied waveforms include delta (0.5 to 4 Hz); theta (4 to 7 Hz); alpha
(8 to 12 Hz); sigma (12 to 16 Hz) and beta (13 to 30 Hz)” [9].
Machine learning and deep learning are branches of artificial intelligence (AI) and
they mainly focus on learning automatically through experience. They are the state of
the art algorithms and have been applied to various fields including science, healthcare,
manufacturing, education, financial modeling, policing, and marketing [10].

Fig. 1. Brain computer interface (BCI) market revenue in the United States from 2012 to 2022,
by application
646 A. Jung

1.2 Purpose
This study aims to find ways to lower the prevalence of misdiagnoses that occur due to
patients’ experiencing poor concentration during their psychiatric evaluation. By analyz-
ing different mental states presented in the form of EEG brainwave signals, it becomes
possible to determine who is verifiably concentrated as EEG signals are an objective
indicator. Recognizing the mental state of patients using EEG brain wave signals will
allow psychiatrists and psychologists to receive accurate questionnaire answers and make
proper diagnoses instead of relying on self-reported claims of patients’ concentration
levels.
An algorithm-based model was developed in order to diagnose mental states using
EEG data. A total of four experiments were conducted with one control experiment
and three test experiments. The control experiment ran the raw data through machine
learning models to classify the mental states of participants. The first and second experi-
ments applied either the Principal Component Analysis (PCA) technique or Independent
Component Analysis (ICA) technique to the data. Both modified data sets were then run
through the machine learning classification models. The third experiment used a corre-
lation matrix to determine six columns of data with the highest corollary relationship
and applied them to the classification models.
The findings of this study will provide a method for improving rates of proper
diagnoses made by allowing psychiatrists and psychologists to recommend a break or
an exercise for enhancing concentration for patients who are unable to concentrate during
the diagnostic tests.
The rest of this paper proceeds as follows: the literature review section which covers
related research on this topic, methods, and materials section for introducing various
algorithms, and workflow. The result section covers the results of each experiment, while
the discussion part states the principal finding of our paper. Finally, in the conclusion
section, we summarize our research.

2 Literature Review
Kaczorowska et al. investigated the algorithms for removing EEG signal artifacts by
utilizing ICA and PCA in their experiment. Artifacts included were eye blinks, speaking,
and hyperventilation. The experiment’s subjects consisted of twenty people of similar
ages, and each of them entered the silent testing place with artificial lighting. After
entering the testing room, subjects were asked to carry out standard actions with the aim
of collecting resting-state data, cognitive activity action data, and noise data. Mitsar EEG
201 was utilized to record the EEG data from the subjects with a frequency of 500 Hz.
PCA and ICA were both factor analysis algorithms, which were based on the coordinate
system’s rotation, and non-Gaussian data’s linear representation respectively. PCA and
ICA were applied to the gathered dataset, and the experiment yielded a result that PCA
surpassed ICA. The computation speed of the PCA was faster, and utilizing it was less
demanding, which could lead to more usage of the PCA for removing artifacts in the
upcoming research [11].
Wang et al. examined the relationship between emotional state and EEG data via
machine learning algorithms. They utilized some movie clips obtained from Oscar films
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 647

as a dataset to trigger the subjects’ emotions. Subjects consisted of six right-handed


people, who were 18–25 years old, half of them male and the other half female. For the
preprocessing stage, extracted EEG signals were down-sampled to 200 Hz, EMG and
electro-oculogram were deleted, and then split into same-length epochs. Furthermore, the
power spectrum feature, wavelet feature, and nonlinear dynamical feature were investi-
gated for feature extraction. PCA, LDA, and correlation-based feature selector were then
applied to the extracted features from the previous stage. They attempted to figure out
the best feature selector and feature extraction methods through various trials. Finally,
the SVM classifier yielded an accuracy score of 91.77%, whose dimension was reduced
via the LDA algorithm, and features were reduced to 30 [12].
Mahajan et al. carried out research to diagnose normal and abnormal people from
petit mal epileptic patients via EEG signals. Preprocessed EEG signals were utilized
as input variables for a neural network classifier and then compared the result to the
SVM classifier. DWT sub-bands of EEG signals were used to decompose the EEG into
time-frequency representations, and coefficients of the dataset were calculated to show
statistical features. Two different feature extraction methods, including PCA and ICA,
were applied. Then the ANN classifier and SVM classifier yielded the accuracy, sensitiv-
ity, and specificity score. Though both models showed similar accuracy and specificity
score, it was concluded that the application of SVM as a classifier outperformed the
ANN, as the combination of PCA-ANN showed a 62.93% of sensitivity score while the
combination of PCA-SVM resulted in a 96.15% of sensitivity score [13].
Cheon et al. conducted research about developing an objective diagnosis kit for
olfactory impairment. EEG-lab software, which is based on the Matlab program, was
executed for the data preprocessing. EEG data were first downsampled to 256 Hz for
each channel, and then alpha, beta, theta, and gamma waves were extracted. Lastly,
the ICA algorithm was applied to remove eye blink artifacts and movements. Various
machine learning and deep learning algorithms were utilized with the aim of diagnosing
the disease, and the catboost classifier yielded the highest accuracy score of 87.56%.
Furthermore, a feature importance plot was conducted to figure out the most impor-
tant features for diagnosis, and it was concluded that Cz-gamma and Pz-gamma waves
were the most significant ones. Their finding was meaningful in that they developed an
objective diagnosis software for diagnosing olfactory impairment, even though they had
limitations that the number of patients in the experiment was quite small, which further
needed more patient datasets for accurate analysis [14].
Through analyzing the related research, we could figure out some methods which
are used efficiently in the fields of BCI and machine learning. However, some of those
researches; especially [12] did not propose the real-life application of their models.
Therefore, our paper concentrates on both performances of the models and the real-life
application of the proposed algorithms.

3 Methods and Materials


3.1 Data Description
Brainwave data was collected from the sample population using a Muse EEG headband
on the TP9, AF7, AF8, and TP10 EEG placements via dry electrodes. Each participant
648 A. Jung

was recorded for 60 s per state - relaxed, concentrating and neutral - after being exposed to
three different stimuli. The dataset was then processed with statistical feature extraction,
creating 2479 rows and 989 columns of data, and Fig. 2 showed visualizations of the
EEG data [15].

Fig. 2. Visualization of the EEG dataset based on the labels: concentrating, relaxed, neutral

3.2 Light Gradient Boosting Machine


A light gradient boosting machine (LGBM) is based on gradient boosting, which is an
ensemble algorithm used for classification and regression. As gradient boosting needs
labels in order to perform the task, this algorithm belongs to supervised learning in
the field of machine learning. However, the gradient boosting method had limitations
in computation speed and memory usage. The LGBM was invented with the aim of
overcoming those shortages, and novel technologies which are gradient-based one side
sampling (GOSS), an exclusive feature bundling (EFB) were introduced. GOSS aims to
maintain the data with a high gradient and then randomly delete those with low gradients,
and this allows GOSS to keep high information gain while reducing the number of data.
EFB is a highly efficient algorithm, especially in the One-Hot encoding. String datatype
should be converted to int datatype in order to train the LGBM. Therefore, the two
representative algorithms, Label Encoding, and One-Hot encoding were utilized. One-
Hot encoding makes the sparse matrix, which consists of only 1 and 0, and this consumes
a lot of RAM memory, as the dimension of the dataset gets higher. The EFB bundles that
sparse matrix derived from the One-Hot encoding, and this made the dataset get lower
dimensions than before. Through these two novel approaches, LGMB could reduce
memory usage, and increase the computation speed while maintaining the performance
of the gradient boosting algorithm [16].

3.3 Random Forest


The random forest algorithm consists of decision tree algorithms, and it belongs to
ensemble algorithms in machine learning. It utilizes the bagging method, which is
dividing the entire dataset into small datasets, and then uses them as input variables
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 649

for each divided data to yield the final results. Furthermore, the bagging method allows
redundancy during bootstrapping. The random forest can be easily utilized through
the Scikit-learn library, and there are several hyperparameters that could be opti-
mized. N-estimators, max_features, max_depth, max_leaf_nodes, min_samples_split,
and min_samples_leaf are the representative hyperparameters, and each of them refers
to the number of decision trees in the model, number of features in the division, max
depth of the model, max number of leaves in the model, number of minimum data sam-
ples for distribution, and minimal sample data for becoming a leaf [17]. The overall
structure of the random forest could be found in Fig. 3.

Fig. 3. The conceptual shape of the light gradient boosting machine

3.4 Workflow of the Experiment

For the control, all 989 columns of EEG signal data were processed into the train
split function and analyzed through eight classification models (Decision Tree, Logis-
tic Regression, Random Forest, Gradient Boosting, Adaptive Boosting, K-neighbors,
LGBM, XGB) without further analysis to produce an accuracy score. The first test
experiment condensed the data into five columns using the PCA analysis feature extrac-
tion algorithm. The extracted data were then processed through the same classification
models as mentioned above in the control experiment and resulted in an accuracy score.
The second test experiment condensed the data into five columns using another fea-
ture extraction algorithm, ICA Analysis. The third experiment extracted six column
features from correlation matrix analysis. The data were analyzed through the classifi-
cation models, which resulted in an accuracy score. Figure 4 yields the overall process
of the experiment.
650 A. Jung

Fig. 4. The description of the overall experimental workflow

4 Result

4.1 Without Feature Extraction


In the control experiment, the Decision Tree Classifier, Logistic Regression, Random
Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, K-neighbors Classi-
fier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 89.31%, 76.21%,
96.98%, 97.58%, 77.42%, 90.12%, 98.19%, and 97.18% respectively as shown in Fig. 5.

Fig. 5. Bar graph showing the comparison of the accuracy score from machine learning models
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 651

4.2 Feature Extraction: PCA

A visualization of the data before and after running the data through PCA Analysis for the
second experiment is shown in the Fig. 6. The 989 columns of data were reduced to five
columns and the graph shows that the reconstructed data was an accurate representation
of the original data.

Fig. 6. Visualization of the reconstructed data via PCA

In the first test experiment, the Decision Tree Classifier, Logistic Regression, Ran-
dom Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, KNeighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 79.84%,
82.06%, 87.7%, 86.09%, 68.35%, 86.09%, 87.1%, 85.08% respectively, as shown in
Fig. 7.

Fig. 7. Bar graph showing the comparison of the accuracy score from PCA + machine learning
Models
652 A. Jung

4.3 Feature Extraction: ICA

A visualization of the data before and after running the data through ICA Analysis for
the third experiment is shown in the Fig. 8. The 989 columns of data were reduced to
5 columns and the graph shows the reconstructed data. The ICA extracted data is a less
accurate representation compared to the PCA extracted data.

Fig. 8. Visualization of the reconstructed data via ICA

In the second test experiment, the Decision Tree Classifier, Logistic Regression,
Random Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, K-neighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 81.85%,
74.4%, 88.1%, 87.3%, 56.05%, 86.29%, 87.9%, and 86.09% respectively as depicted in
Fig. 9.

Fig. 9. Bar graph showing the comparison of the accuracy score from ICA + machine learning
models
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 653

4.4 Feature Extraction: Correlation


From the 989 columns of data, those with a correlation greater than 0.5 were initially
filtered and the 6 data columns with the highest correlation were extracted. The 6 data
columns are lag1_std_2, lag1_logcovM_2_2, lag1_freq_649_2, std_2, logcovM_2_2,
and freq_649_2 and their corollary relationships are visualized in a heat map Fig. 10.

Fig. 10. Visualizing the correlation matrix from the dataset

In the third test experiment, the Decision Tree Classifier, Logistic Regression, Ran-
dom Forest, Gradient Boosting Classifier, Adaptive Boosting Classifier, K-neighbors
Classifier, LGBM Classifier, and XGB Classifier yielded an accuracy score of 71.77%,
66.53%, 78.02%, 76.21%, 57.86%, 71.17%, 80.04%, and 75.81% respectively as shown
in Fig. 11.

Fig. 11. Bar graph showing the comparison of the accuracy score from correlation + machine
learning models
654 A. Jung

5 Conclusion
5.1 Discussion

This study uses BCI and AI approaches to recognize insufficient concentration levels
when patients are being diagnosed. The algorithm produced a very high overall accuracy
score, with the highest being 98.19% using the LGBM Classifier on the raw dataset in the
control experiment. The most efficient feature selection method was PCA, though similar
in accuracy to ICA, from the three feature extraction models. The highest PCA processed
data yielded an accuracy score of 87.7% using the random forest classification model
while the correlation matrix analysis yielded an accuracy score of 80.04% using the
LGBM classifier. These feature selection methods will be able to produce an accurate
and efficient diagnosis of mental states in future studies with larger data sets. Unlike
previous research [12], the findings of this study can be used to analyze the brain waves
of patients of psychiatrists and psychologists while they are solving diagnostic tests.
Doctors will be able to recognize lowered concentration levels and advise that their
patients take a break or do an activity that enhances concentration. Patients will answer
all questions at a sustained level of sufficient concentration, ultimately allowing more
proper diagnoses to be made.

5.2 Summary/Restatement

This research paper’s purpose was to lower misdiagnosis rates that result from patients
experiencing deficient concentration levels during their psychiatric evaluation. Being
able to recognize faltered concentration using brain wave data will enable psychiatrists
and psychologists to provide methods for patients to regain concentration. The study
input 989 columns of EEG signal data into a train split function and analyzed them
using the classification models: Decision Tree, Logistic Regression, Random Forest,
Gradient Boosting, Adaptive Boosting, K-neighbors, LGBM and XGB. The first and
second test experiments condensed the data into five columns each using the feature
extraction algorithm PCA analysis or ICA analysis. The third test experiment extracted
six features using the correlation matrix analysis. The highest accuracy score of 98.19%
was produced by the LGBM Classifier in the control experiment, which could satisfy the
objective of this paper. The most efficient feature selection method was PCA but ICA
also yielded similar accuracy scores, and this finding could help future research about
reducing features of the EEG datasets. The proposed model could help patients answer
questions with a sustained level of high concentration, which enables a more precise
diagnosis of psychiatric disorders.

References
1. MAYO CLINIC. https://www.mayoclinic.org/diseases-conditions/mental-illness/diagnosis-
treatment/drc-20374974. Accessed 30 Dec 2021
2. WebMD. https://www.webmd.com/mental-health/mental-health-making-diagnosis.
Accessed 30 Dec 2021
Recognizing Mental States when Diagnosing Psychiatric Patients via BCI 655

3. J. Flowers Health Institute. https://jflowershealth.com/psychiatric-evaluations/. Accessed 25


Dec 2021
4. Shih, J.J., Krusienski, D.J., Wolpaw, J.R.: Brain-computer interfaces in medicine. Mayo Clin.
Proc. 87(3), 268–279 (2012)
5. Papadelis, C., Braun, C., Pantazis, D., Soekadar, S.R., Bamidis, P.: Using brain waves to
control computers and machines. Adv. Hum.-Comput. Interact. 2013, 1–2 (2013)
6. McFarland, D.J., Daly, J., Boulay, C., Parvaz, M.A.: Therapeutic applications of BCI
technologies. Brain-Comput. Interfaces 4(1–2), 37–52 (2017)
7. Statista. https://www.statista.com/statistics/1015164/worldwide-brain-computer-interface-
market-revenue-by-application/. Accessed 2 Jul 2020
8. Abiri, R., Borhani, S., Sellers, E.W., Jiang, Y., Zhao, X.: A comprehensive review of EEG-
based brain–computer interface paradigms. J. Neural Eng. 16(1), 011001 (2019)
9. Nayak, C.S., Anilkumar, A.C.: EEG Normal Waveforms. StatPearl Publishing, Tampa (2021)
10. Zhang, X., et al.: The combination of brain-computer interfaces and artificial intelligence:
applications and challenges. Ann. Transl. Med. 8(11), 712 (2020)
11. Kaczorowska, M., Plechawska-Wojcik, M., Tokovarov, M., & Dmytruk, R. Comparison of
the ICA and PCA methods in correction of EEG signal artefacts. In: 2017 10th International
Symposium on Advanced Topics in Electrical Engineering (ATEE) (2017)
12. Wang, X.W., Nie, D., Lu, B.L.: Emotional state classification from EEG data using machine
learning approach. Neurocomputing 129, 94–106 (2014)
13. Mahajan, K., Vargantwar, M.R., Rajput, S.M.: Classification of EEG using PCA, ICA and
neural network. Int. J. Eng. Adv. Technol. 1(1), 80–83 (2011)
14. Cheon, M.J., Lee, O.: Detecting olfactory impairment through objective diagnosis: catboost
classifier on EEG data. J. Theor. Appl. Inf. Technol. 99(14), 3596–3604 (2021)
15. Kaggle. https://www.kaggle.com/birdy654/eeg-brainwave-dataset-mental-state. Accessed 23
Dec 2021
16. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf.
Process. Syst. 30, 3146–3154 (2017)
17. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Diagnosis of Hepatitis C Patients via Machine
Learning Approach: XGBoost and Isolation
Forest

Ting Sun(B)

Korea International School, #27 385, Daewangpangyo-ro, Bundang-gu, Seongnam-si,


Gyeonggi-do, South Korea
[email protected]

Abstract. Although the transmission of Hepatitis C through blood transfusion is


getting less and less prevalent with the use of anti-HCV tests for blood donors,
availability and practice of screening remain low in developing countries. This
results in Hepatitis C patients who are unaware of their condition until it wors-
ens to become chronic liver diseases that are diagnosed through more costly or
invasive methods–liver biopsy and radiology scans. Due to these limitations of
the current methods of diagnosis, this study seeks to develop a machine learning
model to diagnose patients with different stages of liver disease: hepatitis c, liver
fibrosis, and cirrhosis. In this research, machine learning algorithms were applied
to a dataset containing HCV patient information, and the algorithms were evalu-
ated for their accuracy and performance in classifying the patients with the proper
diagnosis. Findings from the study indicated that XGBoost can most accurately
classify patients with an accuracy score of 95.48, but other algorithms used had
high accuracy scores as well: the algorithm with the lowest accuracy score–Deci-
sion Tree–still had a score of 92.66. The second experiment also showed that the
Isolation Forest algorithm could detect and isolate the suspect blood donors of
the data with a relatively high accuracy of 93.22%. As both experiments of the
study yielded a machine learning model of high accuracy, the algorithms used can
be implemented into a diagnostic kit for liver disease to be used in developing
countries where accessibility to current diagnosis tools is limited.

Keywords: HCV · Isolation forest · Diagnosis · Machine learning · XG boosting

1 Introduction
1.1 Background
Hepatitis C is a liver infection caused by the hepatitis C virus (HCV). It causes liver
inflammation and if not treated, may become a chronic liver disease. In fact, 80% of
infected individuals become chronic carriers and may develop fibrosis or cirrhosis–more
severe cases of liver damage [1]. Fibrosis is caused by liver scarring and high levels of
it can lead to cirrhosis which requires aggressive treatment [2]. HCV is spread through
contact with the blood from an infected person, including the sharing of equipment used

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 656–668, 2023.
https://doi.org/10.1007/978-3-031-18461-1_43
Diagnosis of Hepatitis C Patients via Machine Learning Approach 657

for drug injection as well as blood transfusions and organ transplants. However, with
the introduction of anti-HCV tests for blood donors, the rate of transmission of HCV by
blood transfusion has significantly decreased overall [3].
Although this decrease is prevalent in developed countries, there are still cases in
developing countries where a limited number of screening programs are available for
proper diagnosis and blood transfusion still remains a significant way of HCV trans-
mission [4]. Mostly in Asia and Africa, it was found that 31 of the 142 “developing”
countries do not undertake any anti-HCV screening, and another 37 screens less than
100% of blood [5]. Furthermore, while HCV only requires a blood sample for the screen-
ing test, later stages of liver diseases such as fibrosis and cirrhosis traditionally require
liver biopsy for proper diagnosis. A liver biopsy involves the removal of a small piece
of the liver tissue to be observed under a microscope for damage. Although it has been
the gold standard for diagnosing liver fibrosis, it is being recognized that liver biopsy
has several drawbacks. It has been found that 33.1% of the samples taken for biopsy are
misclassified by at least one grade of the fibrosis stage. Furthermore, despite its diagnos-
tic inaccuracy, the cost is also a major issue for the implementation of liver biopsy [1].
Due to the invasiveness of liver biopsies, noninvasive methods of liver disease diagnosis
like radiology are also available.
However, like liver biopsy, diagnostic imaging such as magnetic resonance imaging
(MRI) or computerized tomography (CT) scans is costly for both the patient and the
hospital, often being limited in availability in developing countries [6]. As a result, a
continuing increase in the rate of acute hepatitis C infections, and the evident healthcare
disparity regarding accessibility to HCV diagnosis and liver disease diagnosis demand
the need for a new, noninvasive, and affordable diagnosis tool. One way this demand can
be fulfilled is the use of machine learning (ML) to detect patterns of liver diseases within
existing HCV, fibrosis, and cirrhosis patient data. Then, the ML model can compare
blood donor data to the detected patterns to make predictions about the condition of the
blood donor as well as to further classify them with Hepatitis C, Fibrosis, or Cirrhosis.

1.2 Objective
The objective of this research is to develop a machine learning model that can accurately
classify Hepatitis C, liver fibrosis, and cirrhosis patients with preexisting medical records
of patients. There are currently several approaches to the diagnosis of Hepatitis C and
chronic liver diseases. For Hepatitis C, the most common system of diagnosis is utilizing
the anti-HCV test and the PCR test [7]. However, Hepatitis C often goes unseen due
to its asymptomatic nature. Therefore, it’s only when the condition has worsened to
become a chronic liver disease that the patients seek medical treatment. When it comes
to that point, further screening is required to diagnose them with the proper condition
(e.g. liver fibrosis, cirrhosis). This further screening includes radiology scans and liver
biopsies which are more costly and, in the case of liver biopsies, more invasive. With
this, it’s even more difficult for developing countries to have people test for liver fibrosis
and cirrhosis.
However, machine learning can be an alternative to liver disease diagnosis since
it only requires patient data from simple blood tests. In this study, multiple machine
learning models are applied to preexisting data of patient records from past blood tests
658 T. Sun

or medical checkups. Having a machine learning model to predict the state of the patient
reduces the need for costly or invasive methods of diagnosis. Furthermore, the suspect
blood donor information of the dataset can be analyzed to create a more accurate model
for disease prediction, and this approach could be useful when the given dataset is not
accurate enough. The rest of this paper is organized as follows: The prior work on
diagnosing hepatitis C is discussed in the Literature review section. The Materials and
Methodologies, including the proposed model, and datasets for diagnosing hepatitis C,
are stated in Sect. 3. In the Result section, the result of the proposed model is explained.
Lastly, in the Conclusion section, the principal finding of this paper and summary of this
paper is exhibited.

2 Literature Review

Akella et al. aimed to use machine learning algorithms to predict the extent of fibrosis in
patients with Hepatitis C as an alternative to the current invasive methods of diagnosis.
The study used a dataset from the machine learning repository of the University of
California Irvine which contained patient data from Egyptian hospitals. After selecting
six relevant features from the dataset, trained different machine learning algorithms with
the dataset to apply them in disease detection. The evaluation of the algorithms was done
with four parameters: accuracy, sensitivity, specificity, and the area under the receiver-
operating curve (AUROC). Six of the nine ML algorithms had evaluation parameters all
in the range of 0.60 to 0.96 for Experiments A and B and 0.34 to 0.64 for Experiment
C. Among the nine algorithms they used, XGBoosting had the best performance overall
with an accuracy of 0.81, AUROC of 0.84, a sensitivity of 0.95, and a specificity of 0.73
[8].
The purpose of Abd El-Salam et al.’s research was to develop a more efficient
technique for disease diagnosis through classification analysis, using machine learn-
ing techniques. The study focused on the diagnosis of Esophageal Varices, a common
side-effect of liver cirrhosis. The dataset used in the study was obtained from fifteen
different centers in Egypt between 2006 and 2017 and included twenty-four individual
clinical laboratory variables. After the dataset was cleaned for missing values, different
machine learning algorithms were applied to the data in order to predict esophageal
varices. To evaluate the performance of the classification algorithms, the sensitivity,
specificity, precision, Area under ROC (AUC) analysis, and the accuracy of the models
were calculated. Through the evaluation, the Bayesian Net algorithm was found to per-
form more efficiently and effectively than the other algorithms: 74.8% for the area under
the ROC curve and 68.9% for the accuracy. The paper concluded that the studied ML
models could be used as alternatives to gastrointestinal screening, the current method of
esophageal varices testing, for cirrhotic patients [9].
Chicco and Jurman utilized electronic health records (EHRs) of patients to analyzed
and process using machine learning classifiers. The EHRs of 540 healthy controls and 75
patients diagnosed with hepatitis C (total of 615 subjects) collected at Hannover Medical
School were considered to be the discovery cohort while another independent dataset
containing EHRs of 123 hepatitis C patients from Kanazawa University in Japan was
considered to be the validation cohort. Both the discovery cohort and validation cohort
Diagnosis of Hepatitis C Patients via Machine Learning Approach 659

had missing data that were replaced through Predictive Mean Matching (PMM). The
study performed binary classification analysis (for the discovery cohort) and regression
analysis (for the validation cohort) using Linear Regression, Decision Trees, and Random
Forests. The feature importance was also analyzed to investigate which clinical features
of the discovery cohort dataset were the most predictive of the status of the subject.
As a result, Random Forests achieved the top results for both binary classification and
regression: R2 of +0.765 and MCC of +0.858 [10].
Ahammed et al. aimed to classify the state of a patient’s liver condition by using
machine learning algorithms. The dataset analyzed was collected from the University of
California, Irvine machine learning repository. It contained related data of almost 1385
HCV infected patients, all of them classified by a stage of liver fibrosis. Using Synthetic
Minority Oversampling Techniques (SMOTE), the data was preprocessed and balanced.
Feature selection was also applied to the data to identify the most relevant features of the
dataset to improve the quality of the model’s performance. Then, the preprocessed data
was applied to various classifiers to determine which model is the best for liver condition
classification. As a result, KNN showed better outcomes than other classifiers with an
accuracy of 94.40%. Although there were limitations to their study in terms of the data
analysis, it was concluded that KNN could be a potential machine learning model for
HCV patient classification [11].

3 Methods and Materials

3.1 Data Description

The dataset was collected from the University of California Irvine Machine Learn-
ing Repository, as shown in Fig. 1. It contained 14 attributes and 615 instances. All
attributes except Category and Sex were numerical, with ten attributes being the labo-
ratory data: Albumin (ALB), Alkaline Phosphatase (ALP), Alanine Amino-Transferase
(ALT), Aspartate Amino-Transferase (AST), Bilirubin (BIL), Choline Esterase (CHE),
Cholesterol (CHOL), Creatinine (CREA), Gamma Glutamyl-Transferase (GGT), and
Protein (PROT). The Category attribute contained categorical data: Blood Donor, sus-
pect Blood Donor, and the progress of Hepatitis C (Hepatitis C, Fibrosis, Cirrhosis), and
this could be found in Fig. 2. Likewise, the Sex attribute also contained categorical data:
m (male) and f (female) [12].
660 T. Sun

Fig. 1. Overall contents of the dataset from the UCI repository

Fig. 2. Pie chart showing the values from the “Category” column

3.2 XG Boosting
XG Boosting is an abbreviation for eXtreme Gradient Boosting, which is an ensemble
machine learning algorithm, and used for both classification and regression. As the gra-
dient boosting algorithm has limitations that it is vulnerable to overfitting and has low
speed, XG Boosting was introduced to overcome those shortcomings. As the XG Boost-
ing is developed from gradient boosting, the main structure of the algorithm is quite
Diagnosis of Hepatitis C Patients via Machine Learning Approach 661

similar. It is composed of Classification And Regression Trees (CART) based decision


tree algorithm, which was introduced in 1984, by Breiman. In order to overcome the
shortcomings of overfitting, two methods were proposed: shrinkage, and column sub-
sampling. Shrinkage indicates that after each stage of tree boosting, shrinkage scales
newly added weights by a factor. Subsampling refers to a technique for reducing data
size by removing a portion of the original data and is mainly used in the random for-
est algorithm. Furthermore, the subsampling technique accelerates parallel algorithm
calculations [13]. Figure 3 shows the overall process of the XG boosting algorithm.

Fig. 3. The overall process of the XG boosting algorithm

3.3 Isolation Forest


Isolation forest is a machine learning algorithm used for anomaly detection and is based
on the decision tree algorithm. It belongs to unsupervised learning in the machine learn-
ing criteria, which is utilized when there does not exist any target or label in the dataset.
Training data are primarily utilized to determine the distance or density between data
points with the aim of identifying the standard class through anomaly detection algo-
rithms [14]. The isolation forest starts with these assumptions: 1. fewer anomalies result
in fewer partitions – shorter routes in a tree structure, and 2. instances with distinctive
attribute-values are more likely to be segregated in early partitioning. The isolation forest
does not use density or distance matrix to detect anomalies, which enables the reduc-
tion of significant computational costs, and has a linear time complexity, which requires
a small constant, and a small memory usage. Furthermore, it can handle exceedingly
huge data sets and high-dimensional issues with a large number of irrelevant features.
These advantages make the isolation forest overcome the shortcomings of the conven-
tional anomaly detection algorithms [15]. Figure 4 displays the overall procedures of
the Isolation Forest algorithm.

3.4 Experimental Design


Firstly, the Hepatitis C dataset needed to be preprocessed before any classification could
be done. As the features Category and Sex are categorical values and not numerical
662 T. Sun

Fig. 4. The overall process of the isolation forest algorithm

values, Label Encoder was used to encode the categorical values into values between 0
and n − 1 (n being the number of classes in the category). Additionally, the data was
checked for any null values, and those found were removed from the data. With the
appropriate format of data, the data was split into train and test subsets. The training
subset is used to fit the machine learning model while the test subset is used to evaluate
the machine learning model used with the training set. Furthermore, it was noticed that
the values of the dataset differ greatly in range; for example, the CHOL values are in
ones while some CREA values go above a hundred. And the problem with this is that
some features might affect the model more than other features do. Therefore, a standard
scaler was applied to the dataset to standardize the values of the features so that the
relative size of the values does not interfere with the model [16].
After the data has been preprocessed, different classification algorithms were fitted
onto the train set. The algorithms used were Logistic Regression, Decision Tree, LGBM,
XG Boosting, Gradient Boosting, and Random Forest. To evaluate the performance of
these ML models, the test set was used to find the accuracy scores of the models.
Aside from classification, the anomalies in the data were also detected to better the
performance of the models. This anomaly detection was done through the Isolation
Forest algorithm where the blood donors and the suspect blood donors were detected
through data isolation. To analyze the performance of the anomaly detection algorithm,
its probability of detecting the blood donors and its probability of detecting the suspect
blood donors were found for both the train and test set. The overall process of the
proposed experiment is explained in Fig. 5.

4 Result
4.1 Experiment for Classification of Patients
Upon trying several ML algorithms to determine which ML algorithm best processes the
HCV dataset, the accuracy scores of each algorithm were found. The lowest among the
six algorithms used was Decision Tree with an accuracy score of 92.66. Random Forest
also had a relatively low accuracy score of 93.79. Both LGBM and Gradient Boosting
algorithms had accuracy scores of 94.35. The algorithm with the highest accuracy score
Diagnosis of Hepatitis C Patients via Machine Learning Approach 663

Fig. 5. Experimental design of this research: two stages

among the six was XG Boosting with an accuracy score of 95.48. Figure 6 depicts the
accuracy scores of multiple algorithms.

Fig. 6. Bar graph showing the accuracy score from various machine learning classifiers

The feature importance of each attribute in the dataset was determined to visualize
which attributes were affecting the ML algorithm’s classification the most. In general,
the laboratory data attributes seemed to have high feature importance scores with the
exception of Age having a slightly higher score than Cholesterol (CHOL). The feature
with the highest importance score was Aspartate Amino-Transferase (AST), indicating
the level of this enzyme in a person greatly affected the algorithm’s choice of classi-
fication. Alanine Amino-Transferase (ALT) had the second highest feature importance
score while Albumin (ALB) had the third highest feature score. On the other hand, the
gender of the person had very little effect on the ML algorithm as indicated by the low
664 T. Sun

importance score of the attribute Sex. Figure 7 shows the feature importance score from
the variables.

Fig. 7. Bar graph showing the feature importance score from the variables

4.2 Experiment for Detecting Specific Case

To visualize the data consisting of inlier (blood donor) and outlier features (non-blood
donor), the dimensionality of the data had to be reduced due to a large number of
features. This dimensionality reduction was done through Principal Component Analysis
(PCA) technique, a statistical technique primarily used in machine learning for this exact
purpose. Once the data has been reduced to three dimensions, the inliers and outliers of
the data could be clearly visualized (Fig. 8 below). The purpose of this data visualization
was to detect the outliers present in the data.
Diagnosis of Hepatitis C Patients via Machine Learning Approach 665

Fig. 8. 3D plot showing the inliers (blood donor) and outliers (suspect blood donor)

Fig. 9. Confusion matrix plot showing the result from the isolation forest (training set)

Using the Isolation Forest algorithm for the training set, the probability of detecting
the blood donors in the dataset was 1.0, while the probability to detect the suspect
blood donor was 0.9459. With this, the accuracy of this unsupervised anomaly detection
model on the training set was 94.66%. As for the test set, the probability to detect Blood
Donor was also 1.0, while the probability to detect a suspect blood donor was 0.9314.
Furthermore, the accuracy of the unsupervised anomaly detection model was 93.22%.
The confusion matrix of the results could be found in Fig. 9 and Fig. 10.
666 T. Sun

Fig. 10. Confusion matrix plot showing the result from the isolation forest (test set)

5 Conclusion

5.1 Discussion

In the first experiment of this study, it was found that XGBoosting allowed for the
most accurate model for the HCV data. Specifically, the accuracy score of the model
was 95.48, the highest among all classifiers. As the model’s accuracy was very high,
it can be a practical diagnostic tool for classifying actual patients with the three stages
of liver disease: Hepatitis C, liver fibrosis, and cirrhosis. Furthermore, this machine
learning model can be developed into a noninvasive and cost-effective diagnosis kit in
developing countries. Unlike MRI and CT scans that are often only available to highly
populated regions of a country (due to high cost and low availability), machine learning
models can be more easily distributed across the country at a lower cost. Additionally,
while patients have to pay high prices for radiological scans, the ML model only requires
the electronic records of patients, which can simply be obtained from the liver function
test, or a blood test that shows enzyme or protein levels.
In the second experiment, outliers like suspect blood donors were detected in the
dataset, which makes the model even more applicable to real world settings. This is
because real datasets, especially in developing countries, may have unreliable data that
interfere with the classification. So through the anomaly detection in the model, it can
detect and isolate those unreliable data to improve its performance.
For further research, aside from the features in the dataset used in the study, the race
of the patient can also be added as a feature. Although race or ethnicity is not proven
to be directly related to the liver disease itself, there are significant differences in the
prevalence of Hepatitis C among different races and ethnicities [17]. Therefore, more
research can be done with a similar method but with a new dataset that contains the
race/ethnicity information of the patients. Through this further research, the ML model
Diagnosis of Hepatitis C Patients via Machine Learning Approach 667

for diagnosis can be improved for better accuracy and also could develop the system via
Raspberry pi.

5.2 Summary/Restatement

This study aimed to develop a machine learning model that can diagnose a patient
with Hepatitis C, liver fibrosis, or cirrhosis solely by analyzing the medical records of
patients. In order to develop the model, several ML classification algorithms were applied
to the HCV dataset. The accuracy score of each algorithm was found to determine the
performance of the models; as a result, XGBoosting had the best performance with an
accuracy score of 95.48. Because the dataset contained data on suspect blood donors,
the Isolation Forest algorithm was used to detect outliers. The probability of detecting
suspect blood donors was 93.14% with an accuracy of 93.22%. With both XGBoosting
and Isolation Forest having high accuracy scores, it shows that patients can be accurately
classified with the proper stages of liver disease through machine learning. To improve
the model for even more accuracy, data on the race and ethnicity of HCV patients can
be analyzed to account for those attributes.

References
1. Sebastiani, G.: Chronic hepatitis C and liver fibrosis. World J. Gastroenterol. 20(32), 11033–
11053 (2014)
2. Healthline. https://www.healthline.com/health/hepatitis-c-fibrosis-score#fibrosis-score.
Accessed 15 Jan 2022
3. NHS website. https://www.nhs.uk/conditions/hepatitis-c/diagnosis/. Accessed 15 Jan 2022
4. Selvarajah, S., Busch, M.P.: Transfusion transmission of HCV, a long but successful road map
to safety. Antivir. Ther. 17(7 Pt B), 1423–1429 (2012)
5. Prati, D.: Transmission of hepatitis C virus by blood transfusions and other medical
procedures: a global review. J. Hepatol. 45(4), 607–616 (2006)
6. Frija, G., et al.: How to improve access to medical imaging in low- and middle-income
countries? EClinicalMedicine 38, 101034 (2021)
7. Bajpai, M., Gupta, E., Choudhary, A.: Hepatitis C virus: screening, diagnosis, and interpre-
tation of laboratory assays. Asian J. Transf. Sci. 8(1), 19 (2014)
8. Akella, A., Akella, S. Applying machine learning to evaluate for fibrosis in chronic hepatitis
C. MedRxiv (2020)
9. Abd El-Salam, S.M., et al.: Performance of machine learning approaches on prediction of
esophageal varices for Egyptian chronic hepatitis C patients. Inform. Med. Unlocked 17,
100267 (2019)
10. Chicco, D., Jurman, G.: An ensemble learning approach for enhanced classification of patients
with hepatitis and cirrhosis. IEEE Access 9, 24485–24498 (2021)
11. Ahammed, K., Satu, M.S., Khan, M.I., Whaiduzzaman, M.: Predicting Infectious state of
hepatitis C virus affected patient’s applying machine learning methods. In: 2020 IEEE Region
10 Symposium (TENSYMP) (2020)
12. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/HCV+data.
Accessed 17 Jan 2022
13. Chen, T., Guestrin, C.: XGBoost. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (2016)
668 T. Sun

14. Cheon, M.J., Lee, D.H., Joo, H.S., Lee, O.: Deep learning based hybrid approach of detecting
fraudulent transactions. J. Theor. Appl. Inf. Technol. 99(16), 4044–4054 (2021)
15. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International
Conference on Data Mining (2008)
16. Ferreira, P., Le, D. C., Zincir-Heywood, N.: Exploring feature normalization and temporal
information for machine learning based insider threat detection. In: 2019 15th International
Conference on Network and Service Management (CNSM) (2019)
17. CDC. https://www.cdc.gov/hepatitis/statistics/2019surveillance/Figure3.6.htm. Accessed 17
Jan 2022
Data Analytics, Viability Modeling
and Investment Plan Optimization of EDA
Companies in Case of Disruptive Technological
Event

Galia Marinova1 , Aida Bitri1 , and Vassil Guliashki2(B)


1 Technical University of Sofia, Kliment Ohridski Boulevard 8, 1000 Sofia, Bulgaria
[email protected]
2 Institute of Information and Communication Technologies – Bulgarian Academy of Sciences,
Sofia, Bulgaria
[email protected]

Abstract. The existence of electronic design automation (EDA) companies


strongly depends on the impact of technological development factors. Destruc-
tive technological developments (innovations) often lead to mergers and acqui-
sitions (M&A), the emergence of new start-ups, and the disappearance of those
who are unable to generate innovation and follow the technology improvement
within a tolerable delay. This paper presents two mathematical models. The first
one is aimed to optimize the investments of EDA companies maximizing their
profit, and the second one − is to minimize the reaction delay of the companies
to an innovation event. Both models are formulated to improve (extend) the EDA
companies’ viability in order to avoid the monopoly state over the market. Based
on public data, a simulation task was formulated for 20 EDA companies. The
optimization was performed by means of MATLAB solvers. The obtained results
show that the proposed models can be useful, and the approach presented could
be applied to tasks with real data.

Keywords: Data analytics · Modeling · Optimization · Investment plan ·


Viability · EDA companies

1 Introduction
The question of the survival and viability of companies has excited many researchers.
There are two main approaches to predicting bankruptcy or optimizing the company’s
investment policy in order to improve the viability of the company: 1) the first is based on
the techniques of discriminant analysis (see for example [16–18]). An important work
in this connection is that by Altman [1]. In this approach statistical data can be used as a
representative sample of two groups of companies: survivors and disappearances. Then
a separating surface (in the simplest case a plane) is built to separate the two groups.
The positioning of the separating surface is optimized so that the number of incorrectly
classified companies is minimal. Using the optimal parameters of the separating surface,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 669–680, 2023.
https://doi.org/10.1007/978-3-031-18461-1_44
670 G. Marinova et al.

predictions can be made with a certain accuracy, to which group of companies belongs
a new company, not included in the sample, as well as give advice on which parameters
the company needs to improve in order to fall into the group of survivors. 2) The second
approach is the direct one. It is based on the optimization of one or more criteria (objective
functions) and improves the values of factors, which are crucial for the better viability
of the companies.
Important related works applying the direct approach are connected with a devel-
oped theory on viable system modeling. The model of viable systems [2] is developed
to evaluate whether a company or organization is able to survive or not in a rapidly
changing environment/ market. The initial idea was to develop a cybernetic theoretical
model of the brain [2], corresponding to the management of a company. This model is
applied to a real example in the production of steel bars [3]. The author S. Beer seeks to
answer the question: How are systems “capable of self-existence”? The viable system
model (VSM) considers the individual organization as a whole system that exists in
constant balance with its rapidly changing (market) environment. Based on the theory
of cybernetics, Beer came to the conclusion that the model that determines the viability
of any system has five necessary and sufficient subsystems, interactively included in
each organism or organization, which are able to maintain their identity. Because the
theoretical model proved to be too complex [5], a simplified version of the model was
subsequently created and published in the book “Brain of the Firm: A Development
in Management Cybernetics” (see [4]). The simplified model uses neuro-physiological
terminology instead of mathematics. Later, a new version of VSM was developed, called
“The Heart of the Enterprise” [6]. Finally, looking at the application of the VSM, Beer
published a third book, entitled “Diagnostics of the Organization System” [7]. In the
VSM model, five subsystems have been developed, each with its own role, but working
closely with each other. The first three systems refer to the current activities of the orga-
nization, where the first subsystem consists of elements producing the organization. The
fourth system focuses on the future effects of external changes and requirements affect-
ing the organization. The fifth system maintains a balance between current activities and
future external changes and ensures the viability of the organization. All systems interact
with each other and are connected with the dynamically changing environment/market.
With the help of VSM model, it is possible to study the internal and external balance
of the organization/company and to make improvements necessary for its survival. Other
recently published works on this topic are [10, 15].
Dynamic Financial Analysis [12] is used to reveal economic factors and to create
different scenarios in the business modeling of insurance companies [8, 9, 19]. This
approach can be useful to understand the dependencies between the business models
and the external impacts on the company’s profitability. Unfortunately, the real available
economic risks can make the created dynamic models unreliable.
In the specific area of electronic design automation (EDA) the innovations events
have a huge impact on the survival of the companies and the emergence of new start-ups.
There is a gap in the modeling of such a specific environment. The purpose of this paper
is to fill this gap by offering new models in this area. Comparisons with other similar
models are not possible, as the authors do not have data on such developments in the EDA
area. This study is focused on the second-mentioned (direct) approach in the field of EDA
Data Analytics, Viability Modeling and Investment Plan Optimization 671

companies. As noted in [14], the first available data for this company type is from the year
1961, i.e. this business sector is 60 years old. Moreover, the existence of an EDA company
is particularly dynamic and depends heavily on many factors related to technological
development. Very important are also the factors influencing the market. A major factor
influencing the viability of EDA companies is the emergence of technological disruptive
events (innovations). The factors influencing the electronic design automation software
market are analyzed in [11].

2 Factors Influencing the EDA-Companies’ Viability


The factors influencing the viability of EDA companies are considered in this section.
Applying data analytics, Marinova and Bitri [13] present an analysis and formalization
of the key factors determining the viability of EDA companies. Very important in this
respect are the investments in research and developments (R&D) and their specific fields
as a precondition for generation of innovations, as well as the inclusion of “talented” peo-
ple in the R&D work and the company’s connections with academia. Resulting factors are
the response time/tolerable delay Dmax for the application of an innovation/technology
improvement by the companies, as well as the reduction or increase of profits of the
companies. Based on statistical data on the disappearance of companies after innovative
events, the authors of [14] have obtained that the tolerable delay Dmax until 1986 is
2 years, and after 1986 - up to now is one year. In case a company reacts to an innovation
event with a delay, greater than Dmax , the correspondent company will disappear. In
addition, in [14] is pointed out that the reaction to an innovation event and Dmax are
a result of the sensing and learning processes in the EDA company, which is related
to attracting and using “talented” people in the company (connections with academia).
Finally, a conclusion is drawn (see [14]) that when the reaction delay to an innovation
event increases, the profit of the EDA company as well as its viability decreases.
Based on data analytics, key factors determining the viability of EDA companies
are selected and two mathematical models for improving the viability of EDA compa-
nies are formulated and tested in this paper. The first model maximizes the profit of
EDA companies, and the second model minimizes the delay Dmax of EDA companies.
The formulated mathematical models are presented in Sect. 3. An illustrative example
(simulation optimization problem) is presented in Sect. 4. In Sect. 5 the results of the per-
formed optimization by means of MATLAB solvers are considered. Finally, conclusions
are drawn in Sect. 6.

3 The Formulated Mathematical Models


The mathematical models formulated in this section aim to improve the viability of EDA
companies. The formulated optimization criteria have a similar effect on the viability
and hence they are not contradicting. For this reason, there is no reason they to be used
to formulate a bi-criteria optimization model. Also, they cannot be unified in a single
criterion because the money and the time are incommensurable.
In this study, a period of 1 year before an innovation event (at the moment te ) is
considered. It is assumed that this innovation event will have a great impact on the
672 G. Marinova et al.

viability of the EDA companies. For this reason, also the period of 1 year after this event
is considered (see Fig. 1) because it is equal to the tolerable delay Dmax .

Fig. 1. The time period considered.

Let N denote the number of EDA companies, which will be considered, and by M i
is denoted the sum of investments of a company i for the considered period, i = 1, 2, …,
N;
Based on the work [14] here will be considered five fields of possible investments
for the EDA companies (see Table 1). For each investment type, a return coefficient is
introduced. For example, if the invested amount of type j is 1, and the return coefficient
r j = 1.2, it means that the expected return of this investment is evaluated to be equal to
1.2 after 1 year. The variables in our model will be the proportions k1i , …, k5i of each
company i (see Table 1). In the table below SW&HW is used to denote Software and
Hardware.

Table 1. Investment types and model parameters

№ Type of investments Proportions of total investment Mi Return coefficients


1 Development of interfaces k1i r1 = 1.35
2 Mathematical methods and k2i r2 = 1.13
Solvers
3 SW&HW description k3i r3 = 1.15
languages
4 Processing power k4i r4 = 1.25
5 Development of models and k5i r5 = 1.12
search for application area

Let Boolean variables y1, …, y5 be used to show are there investments of a given
type, where yj = 0 means “no investments of type j”, and yj = 1 means “made invest-
ments of type j”. For example y1 = 1 means “made investments for Development
of Interfaces”, and y3 = 0 means “there are no investments in SW&HW descrip-
tion languages”. Then the following thresholds of investments necessary to achieve
a technological breakthrough event (innovation) are introduced (see Table 2):
By making an investment equal to or greater than the set threshold, the correspondent
company is able to implement the necessary novel innovation in its production. Based
Data Analytics, Viability Modeling and Investment Plan Optimization 673

Table 2. Investment thresholds.

Type of investments Investment thresholds


y1 = 1, y2 = 1, y5 = 1 P1
(attraction and use of talented people)
y3 = 1 P2
y4 = 1 P3

on the above annotation we could express the reaction delay d i to a given innovation
event by the company i. Let for example the investment threshold P3 = 40 units, and
at the time-moment t e − 1 the i-th company invests 30 units in this field (Processing
power). Assuming that the investments are constant even after the innovation event,
then this company will be able to follow the innovation in the correspondent field after
P3/(k4i Mi ) = 40/30 years, equal to 1 year and 4 months. Hence the delay of i-th company
regarding this field will be 4 months after the innovation event. The maximal reaction
delay of i-th company to an innovation event can be expressed in the form:
P1 P2 P3
di = max { − 1; − 1; − 1} (1)
(k1 + k2 + k5)Mi k3Mi k4Mi
Possible optimization criterion based on the reaction delays is:
min {max {d1 , d2 , ...di , ... , dN }} (2)
Better is the criterion to consider only the reaction delays greater than one year.
The profit Ri of i-th company for a period of 1 year can be expressed as:
5
Ri = rj kj i Mi − Mi (3)
j=1

Regarding the obligatory constraints, it should be mentioned that the sum of pro-
portions of the total investment is equal to 1. Other constraints are connected with the
investment types corresponding to P1 because they are connected with the attraction and
use of talented people in the company. It could be expected that the available talented peo-
ple are not enough to cover the needs of all companies, and that not every one company
has connections with academic circles in the mentioned areas. From Table 1 it follows,
that the investments with r1, r2 and r5 (Development of Interfaces, Mathematical Meth-
ods, and Solvers, and Development of Models and search for Application area) bring
60% of the total profit. Hence it could be concluded, that companies that invest a smaller
percent-sum in the mentioned fields have the wrong investment plans and don’t attract
and use enough “talented” people. This circumstance can be expressed by a constraint
on the sum: (k1i + k2i + k5i ) for the concrete company i. Competitive interactions may
impact the resources (“talented” people). For simplicity it is assumed, that once hired,
talented people are loyal to their employer and do not change the company in which they
work.
Based on these considerations the following two mathematical models are proposed
to improve the viability of EDA companies:
674 G. Marinova et al.

MODEL I:
N 5 
max rj kji Mi − Mi
i=1 j=1
N 5 
= −min[ rj kji Mi − Mi ] (4)
i=1 j=1

subject to:
5
kj i = 1; i = 1, 2, . . . , N ; (5)
j=1

(and for some specific il*, where l* ∈ {1, 2, …, N;}):

k1i1∗ + k2i1∗ + k5i1∗ ≤ a1; (6)

k1i2∗ + k2i2∗ + k5i2∗ ≤ a2; (7)

k1i3∗ + k2i3∗ + k5i3∗ ≤ a3; (8)

k1i4∗ + k2i4∗ + k5i4∗ ≤ a4; (9)

lower and upper bounds:

kji ∈ [0.01 1.], j = 1, . . . , 5; i = 1, 2, . . . , N; (10)

where a1, a2, a3 and a4 are specific constants, reflecting the wrong investment policy
of companies with numbers il*.
MODEL II:

min  (di2), where (di > 1, and i ∈ {1, 2, . . . , N; }) (11)

subject to constraints (5)–(10).


To test the performance of Model I and Model II a simulative illustrative example is
generated, based on public data.

4 An Illustrative Example

In this illustrative example are included 20 companies, i.e. N = 20. In advance are given
the total investments of the considered companies and the investment thresholds for
generating an innovation. Variables in the optimization process are the proportions of
total investment for each company. Here are considered five types of investments, hence
for 20 companies the example includes 100 variables.
The example is a simulative one, because the investment units are not real, but virtual
units proportional to the data of real companies, from the database, published on site:
https://sec.report/
Data Analytics, Viability Modeling and Investment Plan Optimization 675

Table 3. Total investments of 20 EDA companies

Company № Investment M i Company № Investment M i


1 130 11 180
2 70 12 100
3 420 13 40
4 50 14 210
5 145 15 80
6 310 16 140
7 90 17 190
8 170 18 360
9 205 19 65
10 60 20 120

This is an American database that stores financial data for companies listed as public
(stock markets).
The total investments of all 20 EDA companies for a period of 1 year are presented
in Table 3.
It is assumed that the values of investment thresholds necessary to achieve an
innovation event are:
P1 = 120; P2 = 30; P3 = 50;
Regarding the constraints (6)–(9) it is assumed that il* ≡ {6, 11, 16, 20}.
The corresponding constraints look as follows:
k16 + k26 + k56 ≤ 0.30; (12)

k111 + k211 + k511 ≤ 0.36; (13)

k116 + k216 + k516 ≤ 0.42; (14)

k120 + k220 + k520 ≤ 0.48; (15)

5 Optimization Results
The above illustrative example is solved by means of MATLAB solvers fmincon and
patternsearch.
The following tasks are formulated and solved:

1. Task 1: criterion (4) subject to constraints (5)–(10) corresponding to Model I.


2. Task 2: criterion (11) subject to constraints (5)–(10) corresponding to Model II.
676 G. Marinova et al.

5.1 Optimization Results for Task 1

Starting fmincon solver after eight iterations the result shown in Fig. 2 was obtained.
The initial profit value is: F0 = 145.0301826571;
The optimal profit for all companies is: F* = 155.7600020019;
The obtained improvement is 6,897%.

Fig. 2. The profit maximization by means of fmincon solver


Data Analytics, Viability Modeling and Investment Plan Optimization 677

For comparison starting with the same initial point, patternsearch solver found the
best solution after four iterations with profit value: F2* = 154.290515625. The obtained
improvement is a little bit smaller: 6,002%. The corresponding result is shown in Fig. 3.

Fig. 3. The profit maximization by means of patternsearch solver

The optimal solution with profit value F corresponds to different reaction delays of
the companies. Only company 3 and company 18 have a negative delay, where d3 ≈
− 0.4; and d18 ≈ − 0.3; this means that both companies are able to (and will) realize
innovations in the fields corresponding to investment threshold P1. The constraints (12)–
(15) lead to the wrong investment plan for the companies №№ 6, 11, 16, and 20. In this
case only the company № 6 with the greatest total investment survives, and companies
№№ 11, 16, and 20 obtain reaction delay di > 1 and will be merged with other companies
or will disappear.

5.2 Optimization Results for Task 2


Starting fmincon solver after 95 iterations the result shown in Fig. 4 was obtained.
Compared to the simpler model used in Task 1 the iteration number and the solution
time are greater, but this is not very essential, because the consumed time is relatively
small. The important result here is the quality of the obtained solution.
The initial value of criterion (11) is D0 = 200.317862;
The optimal value is D* = 45.808838712163507;
The obtained improvement is 77,132%. Obviously, this improvement is greater than
the improvement achieved by means of Model I in Task 1.
For comparison starting with the same initial point patternsearch solver found the
best solution after six iterations with a value: D2* = 44.207690063484968; for this
678 G. Marinova et al.

Fig. 4. The reaction delay minimization by means of fmincon solver

solution obtained improvement is greater: 77,931%. The corresponding result is shown


in Fig. 5.
The optimal solution with profit value D2 corresponds to different reaction delays
of the companies. Again company 3 and company 18 have a negative delay, where d3 ≈
− 0.29; and d18 ≈ − 0.17; both companies are able to realize innovations in the fields
corresponding to investment threshold P1. The constraints (12)–(15) lead to the wrong
investment plan for the companies №. 6, 11, 16, and 20. In this case only the companies
№ 6 and № 11 with the greatest total investment survive, and companies № 16 and № 20
obtain reaction delay di > 1 and will be merged with other companies or will disappear.
In general, the obtained reaction delays of EDA companies are smaller compared to the
optimal solution obtained for Task 1.
The obtained result shows that applying Model II better improvement of EDA
companies’ viability can be achieved than by means of Model I.
Data Analytics, Viability Modeling and Investment Plan Optimization 679

Fig. 5. The reaction delay minimization by means of patternsearch solver

6 Conclusions
Based on data analytics key factors determining the viability of EDA companies are
selected. These factors are used to formulate two single objective optimization models
aimed to improve the EDA companies’ viability by means of maximizing the total profit
of all considered EDA companies or minimizing the sum of the squared delays for all con-
sidered EDA companies after an innovation event. The criterion in the first model is linear
and maximizes the profit of the companies, which contributes to their longer viability. The
criterion in the second model is nonlinear and minimizes the reaction delays of the EDA
companies when they are greater than the tolerable delay Dmax with a length 1 year. The
results show that the second model leads to better solutions and is more effective than the
first model. Nevertheless, both models have shown good performance during the simula-
tion tests. The obtained results are encouraging and it can be concluded that these models
can be successfully used to solve tasks with real data in the mentioned area.
Further investigations of the proposed models should be performed on real data in the
same area. The generated solutions could contribute to improving the real investment plan
of EDA companies. Also, a conclusion could be drawn, which production parameters a
given company has neglected and has to improve in order to fall into the survivor’s group.
A direct approach for improving the viability of the companies is rarely used. There
are very few optimization methods, developed in this area. This is an open field for future
research.
680 G. Marinova et al.

Acknowledgment. This work is supported by the Bulgarian National Science Fund by the project
“Mathematical models, methods and algorithms for solving hard optimization problems to achieve
high security in communications and better economic sustainability”, Grant No: KP-06-N52/7.

References
1. Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate
bankruptcy. J. Financ. 25(4), 589–609 (1968)
2. Beer, S.: Cybernetics and Management. English Universities Press, London (1959)
3. Beer, S.: Towards the cybernetic factory. In: Principles of Self Organization (symposium).
Pergamon Press, Oxford (1960)
4. Beer, S.: Brain of the Firm: A Development in Management Cybernetics. Herder and Herder,
New York (1972)
5. Beer, S.: The viable system model: its provenance, development, methodology and pathology.
J. Oper. Res. Soc. 35(1), 7–25 (1984)
6. Beer, S.: The Heart of Enterprise. Wiley, Chichester (1979)
7. Beer, S.: Diagnosing the system for Organizations. Wiley, Chichester (1985)
8. Blum, P., Dacorogna, M.: Dynamic Financial Analysis - Understanding Risk and Value Cre-
ation in Insurance (2003). https://www.researchgate.net/publication/23749485_Dynamic_F
inancial_Analysis_-_Understanding_Risk_and_Value_Creation_in_Insurance. Accessed 10
Mar 2022
9. D’arcy, S.P., Gorvett, R.W., Hettinger, T.E., Walling III, R.J.: Using the Public Access DFA
Model (1998). http://www.casact.org/pubs/forum/98sforum/
10. Espejo, R., Reyes, A.: Organizational Systems: Managing Complexity with the Viable System
Model. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19109-1
11. Gaul, V.: Electronic Design Automation Software Market, pages 267, December 2020. https://
www.alliedmarketresearch.com/electronic-design-automation-software-market
12. Kaufmann, R., Gadmer, A., Klett, R.: Introduction to Dynamic Financial Analysis (2004).
http://www.casact.org/library/astin/vol31no1
13. Marinova, G.I., Bitri, A.: Assessment and forecast of EDA company viability in case of
disruptive technological events. In: MATHMOD 2022 Discussion Contribution Volume, 10th
Vienna International Conference on Mathematical Modelling, Vienna, Austria, 27–29 July
2022, pp. 33–34 (2022). ARGESIM Report 17 (ISBN 978-3-901608-95-7), pp. 33–34. https://
doi.org/10.11128/arep.17.a17084. https://www.argesim.org›a17084.arep.17.pdf
14. Marinova, G.I., Bitri, A.: Review on formalization of business model evaluation for technolog-
ical companies with focus on the electronic design automation industry. In: IFAC Conference
TECIS 2021, Moscow, Russia, pp. 630–634, September 2021
15. Mulder, P., Viable System Model (VSM) (2018). ToolsHero: https://www.toolshero.com/man
agement/viable-system-model/. Accessed 05 Feb 2022
16. Rubin, P.A.: Solving mixed integer classification problems by decomposition. Ann. Oper.
Res. 74, 51–64 (1997)
17. Soltysik, R.C.F., Yarnold, P.R.: The Warmack-Gonzalez algorithm for linear two-group multi-
variable optimal discriminant analysis. Comput. Oper. Res. 21, 735–745 (1994)
18. Warmack, R.E., Gonzalez, R.C.: An algorithm for the optimal solution of linear inequalities
and its application to pattern recognition. IEEE Trans. Comput. C 22, 1065–1075 (1973)
19. Wiesner, E.R., Emma, C.C.: A Dynamic Financial Analysis Application Lined to Corporate
Strategy (2000). http://www.casact.org/pubs/forum/00sforum/
Using Genetic Algorithm to Create
an Ensemble Machine Learning Models
to Predict Tennis

Arisoa S. Randrianasolo1(B) and Larry D. Pyeatt2


1
Abilene Christian University, Abilene, TX, USA
[email protected]
2
South Dakota School of Mines and Technology, Rapid City, SD, USA
[email protected]

Abstract. In this paper, we illustrate our study of using genetic algo-


rithms and machine learning to create an ensemble technique, which is
used to predict tennis games using limited amounts of data. The genetic
algorithm was used to improve the game representations, derived from
the players’ statistics differences, to be utilized by the machine learning
algorithms. The use of genetic algorithms also reduced the dependence
on human expertise in creating the game representations. The majority
of the ensemble models we generated were either as good or performed
higher than the predictions based on just the player’ official rankings.

Keywords: Sports predictions · Genetic algorithm · Machine


learning · Ensemble technique

1 Introduction
The prevalence of data or statistics about past games or players has helped
researchers to create predictive models for head-to-head games. From 2009
onward, there is a growing level of interest in applying machine learning to
sport [5]. Bunker and Susnjak showed that the application of machine learn-
ing to tennis is way less popular than its application to soccer and basketball.
They also noticed that ensemble techniques are not as frequently used to predict
sport compared to the stand alone models, such as artificial neural network and
decision tree.
In this paper, we report a study on using data about tennis players to create
an ensemble technique that will be used to predict the outcome of tennis games.
Our study consisted of two major parts. First, by utilizing a genetic algorithm, we
focused on finding good data representations to improve the accuracy of machine
learning algorithms. Second, we utilized the good data representations, from the
first part of the study, to create an ensemble technique, which will be used predict
the outcome of future tennis games. We tested our ensemble technique on the
Women’s singles in the 2020 Australian open, the 2021 French open and the
2021 US open. The predictions of the ensemble models were compared to the
predictions that are based solely on the player rankings.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 681–695, 2023.
https://doi.org/10.1007/978-3-031-18461-1_45
682 A. S. Randrianasolo and L. D. Pyeatt

1.1 Motivation

This paper is another attempt to create a general approach to predict head-to-


head games [4]. We are working on producing a general approach that can be
utilized to predict games in well known tournaments as well as in smaller tour-
naments. Smaller tournaments lack historical data. However, there are usually
some forms of data available that can be exploited to predict game outcomes.
Given this motivation to accurately generate predictions, using small data sets,
we constrained ourselves to use the 2019 data to predict the outcome of the
women’s singles competition at the 2020 Australian open. We limited ourselves
to employ the data from January 2021 to May 2021 to predict the outcome of
the women’s singles competition at the 2021 French open. Similarly, we limited
ourselves to only utilize the data from January 2021 to August 2021 to predict
the outcome of the women’s singles competition at the 2021 US open.
In our previous work [1], explained in the early observation below, we found
that using the difference between the players’ statistics and representing the
differences with the values –1,0 and 1 appeared to work well in predicting ten-
nis while using machine learning. Some of the differences were treated as not
significant, this resulted in the value of 0 in the representations. The notion of
significance was based on some tolerance values that were derived either from
intuition or from our own knowledge. Such a process required a considerable
human or expert involvement in crafting the data representation. We wished to
reduce such involvement. We opted to utilize the genetic algorithm approach in
finding the tolerance values to reduce the human involvements in finding good
game representations.
This paper will be organized as follows. We will start by looking at other
works in sports and tennis predictions. Then, we will explain an early observation
in tennis prediction from one of our previous works. The proposed ensemble
approach will be introduced. The last part of this paper will explain the testing,
results, the conclusion, and the future works.

2 Related Work

Machine learning is used widely in sport predictions. In most cases, a consider-


able amount of historical data is needed to train the model. For example, Mei-
Ling and Yun-Zhi, employed a one dimensional convolutional neural network,
an artificial neural network and a support vector machine to predict the Major
League Baseball [2]. Shuaib, Kirubanand compared support vector machine and
extreme gradient boosting, an ensemble learning, in predicting football games
[6]. Pretorius and Parry used a random forest to predict the 2015 Rugby World
Cup [9]. Brooks, Kerr, and Guttag used a SVM to predict the possibility of
having shots in soccer [10].
For predictions related to tennis, Newton and Keller predicted the winner
of a tennis tournament by using the player’s probability of winning a point on
serve. Using this probability, each player’s probability of winning a game, a set, a
Using Genetic Algorithm to Create an Ensemble 683

match, and a tournament were calculated. This approach identified the winners
of the 2002 US Open and Wimbledon tournaments [14].
Gu and Saaty combined data and judgements to predict the winners of the
tennis matches from the 2015 US Open [7]. The prediction accuracy was 85.1%.
They utilized data of tennis matches from 1990 to 2015 for the men and data of
tennis matches from 2003 to 2015 for the women. The predictors used by the app-
roach can be categorized as basic tournament information, player description,
and performance metrics. In addition to these data, the authors added other
human judgments such as tactics, state and performance, psychology, brain-
power, and experience. In total, they ended up with 7 clusters that contained 24
predictors. The predictors associated with the two players that were set to play
a game were sent to an Analytic Network Process to predict the winner.
Knottenbelt, Spanias, and Madurska developed a model to predict tennis
games using a transitive property [11]. The idea behind this transitive approach
was that if player a is better than player c and c is better than b, then it can
be inferred that player a is better than b. To predict the outcome of a game
that involved two players, a and b, the approach looked at historical data to
find previous common opponents for a and b. Given a common opponent, the
proportion of serve points won and the proportion of return points won by a and
b were calculated. The calculated values were used to calculate the measure of
advantage or disadvantage of a over b given the same common opponent. This
value, in its turn, was used in an O’Malley equation to produce the probability
for a beat b via the same opponent. The final step consisted of averaging the
probability of a beat b over all the possible opponents, usually limited to 50, from
the historical data. The approach’s best performance was a 77.53% accuracy on
predicting the 2011 US Open with a 9.01% return on investment.
Barnett and Clarke developed a model to show that it was possible to pre-
dict the outcome of a tennis match before the game and while the game was
progressing. This approach was tested on predicting the longest men’s match
between Roddick and El Aynaoui at the 2003 Australian Open [15]. The model
used the players’ winning percentages on both serving and receiving. For the
two players that were playing the game to be predicted, the following statistics
were recorded: percentage of first serves in, percentage of points won on the first
serve, the percentage of points won on second serve, the percentage of points
won on return of first serve, and the percentage of points won on return of sec-
ond serve. For the longest game that this approach was tested on, the statistics
came from the average of 70 games. These statistics were used in calculating the
players’ winning percentages on serving and receiving. These last percentages
were used in the formula for calculating the combined percentage of points won
on serve for player i against j and the combined percentage of points won on
return for player j against i. Finally, these combined percentages were used in
a Markov chain model to perform the prediction. The prediction produced the
players’ chance of winnings and the chance for the length of the game.
McHale and Morton utilized a Bradley-Terry model to predict outcomes of
tennis games. The prediction for player, i, to win a game is calculated by divid-
684 A. S. Randrianasolo and L. D. Pyeatt

ing i’s ability with the combined ability of i and j (the opponent) [12]. This
basic Braddley-Terry probability was updated by the authors to include decayed
weights of the previous games. The approach also gave more weights to previ-
ous games that were played on the same surface as the current surface to be
used on the game to be predicted. The player’s ability were rankings derived
from historical data from 2000 to 2008 and consisted of the number of wins, the
number of wins combined with date, match score, match score combined with
date, or match score combined with date and playing surface. Each prediction
made by the model required the usage of the previous three years of data. This
approach did better than the prediction based on the player’s official ranking for
games from 2001 to 2008. When utilizing match score and date, this approach
was 66.0% accurate. When utilizing match score, date, and the type surface, this
approach was 66.90% accurate.
Klaassen and Magnus created a model capable of performing predictions
before the match and during the match. The model produced the probabil-
ity for a player to win the match against the given opponent. This probability
was calculated using the probabilities of winning a point on service from the
two opposing players. The before match probability was extracted from Wim-
bledon single matches from 1992 to 1995 using a logit regression model. This
starting probability was later updated as the match progressed. This approach
was applied to the Sampras-Becker (1995) and Graf-Novotna (1993) Wimbledon
finals [16].
Candila and Palazzo used 26880 male tennis matches from 2005 to 2018 to
train and validate an artificial neural network (ANN) [3]. They used 32 pre-
dictors. The predictors consisted of players’ statistics, player’s biometrics, and
synthetically generated data such as fatigue and betting odds from bookmakers.
The goal for the ANN was to produce the probability for a player i to win a
match j. This approach was validated on predicting matches from 2013 to 2018
and it outperformed the logit regression, [16], the probit regression [13], the
Bradley-Terry type [12], and the point-based approach [15] in terms of return
on investment.
Wilkens utilized the men and women tennis match from 2010 to 2019 to show
that an ensemble technique is the best approach to use when betting in tennis
games [8]. This approach used the differences of players’ statistics, locations,
tournament information and four different odds from bookmakers as predictors.
The ensemble that was created consisted of logistic regression, neural network,
random forest, gradient boosting and support vector machine. The accuracy of
each model, when not in an ensemble, was about 70%. When the assemble was
used on the betting market, returns of 10% and more were detected.
The machine learning techniques in the survey literature either used a very
extensive historical data or used complex data engineering that required the
involvement of human experts. In our proposed approach we limit the amount
of historical data to be used to go back only at maximum 1 year. We are also
reducing the human involvement by using genetic algorithm to perform the data
engineering for us.
Using Genetic Algorithm to Create an Ensemble 685

3 Early Observation
In our previous work [1], we used statistics from 2019, available from
wtatennis.com, to predict the women’s 2020 Australian open. In all, 15 pre-
dictors were initially collected. The predictors are the ranking, aces per game,
number of matches, double faults per game, first serve percentage, second serve
points percentage, first serve points, percentage serve points won percentage,
service games won percentage, break points percentage, first return points per-
centage, second return points percentage, break points converted percentage,
return games won percentage, and return points won percentage.

3.1 The Game Representation


A game was represented by the differences of the corresponding predictors from
the two players playing in the game. The difference between the two players’
corresponding statistics was converted into −1, 0, or 1. As follow:


⎨0 if | rankp1 − rankp2 |< 20
difference in rank = −1 if rankp1 − rankp2 ≤ −20


1 otherwise.


⎨0 if | dfp1 − dfp2 |< 1
difference in double faults = −1 if dfp1 − dfp2 ≤ −1


1 otherwise.
For all other predictors:


⎨0 if | statip1 − statip2 |< 5
difference in stati = −1 if statip1 − statip2 ≤ −5


1 otherwise.

3.2 Early Testing and Result


We generated models by having machine learning algorithms trained on the first
round of the 2020 Australian open. The models predicted the second round up
to the final. The first round consisted of 64 games. Games that had players with
missing statistics were omitted. Each game, in the training set, was presented
twice in two different forms. For a game between players p1 and p2, assuming the
winner was p1, two representations were created with <p1, p2> with outcome 1
and <p2, p1> with outcome 0. The training set contained 122 games.
The training data and the games to be predicted were standardized using
the z-score normalization,
x−μ
z= ,
σ
where μ is the mean and σ is the standard deviation. For the various tuning
parameters involved with each machine learning algorithm, the training set was
686 A. S. Randrianasolo and L. D. Pyeatt

presented to the algorithm 100 times using a 80%–20% random split. The best
parameters, measured by average accuracy, were used to create the final models
that predict the games after the first rounds. The accuracy of the predictions
are captured in Fig. 1. We used the predictions based on the players’ rank, not
involving machine learning, as the benchmark to compare each model.
The final predictors used for each models were the ranking, number of
matches, aces per game, double faults per game and the percentages of the
first serve, first serve points, serve points won, break points, service games won,
first return points, second return points, return games won and return points
won.

Fig. 1. Ternary representation


Using Genetic Algorithm to Create an Ensemble 687

4 The Ensemble Approach


In our early observation, we used the tolerance values of 20,1, and 5 to zero
out the difference between the players’ statistics. These tolerance values were
based on either intuitions or our basic understanding of the tennis game. We
wish to reduce the human involvement in choosing the tolerance values. In this
proposed ensemble approach, we used a genetic algorithm to derive the tolerance
values for each predictor. The genetic algorithm searched for the best tolerance
values from the range 1–20 for each predictor. The genetic algorithm produced
a tolerance vector. Each value in the vector corresponds to the tolerance to be
used on the corresponding predictor. This vector was then used to create game
representations which, in its turn, became the inputs to the machine learning
algorithms. The difference between the two players’ corresponding statistics was
converted into −1, 0, or 1, in a general way, as follows:


⎨0 if | statiA − statiB |< toli
difference in stati = −1 if statiA − statiB ≤ −toli


1 otherwise.

Since the genetic algorithm is not always guaranteed to output exactly the
same vector in every run due to some randomness involved in the search, we
repeated the search process 101 times. We ended up with 101 tolerance vectors
that were used to create 101 models for each machine learning algorithm that we
considered. These 101 models formed our ensemble. Each model in the ensemble
was used to predict a game and a majority rule between the models was used to
consolidate the prediction of the ensemble. Figure 2 summarizes this approach.

4.1 Short Introduction to Genetic Algorithm

A genetic algorithm, as its name implies, is a search algorithm that mimics the
process used for genetic information transmission in humans or animals. It is
based on the survival of the fittest. In a genetic algorithm, a potential solution,
expressed as a string of characters, is called an individual. The set of individuals
forms the population. Each individual is associated with a fitness value. This
value indicates the quality of the solution that the individual represents. Like
in biology, individuals in the population are permitted to reproduce to generate
new solutions.
688 A. S. Randrianasolo and L. D. Pyeatt

Game Representation
(known outcomes)
Candidate
Tolerance
vectors Fitness evaluation
by a Machine Lerning
Algorithm (e.g. SVM)

Crossover

Mutation

Genetic Algorithm

Machine learning algorithm


Collection of 101 best (same as what was used in the
tolerance vectors Genetic algorithm for fitness)

101 set of Games representations


(known outcomes)
101 set of Games representations
to be predicted (unknown outcome)

Ensemble of 101 models

Predictions

Fig. 2. Approach’s flow chart

The reproduction part of the algorithm is called a crossover. During a


crossover, two individuals exchange characters to form a new solution. The fittest
individuals have better probabilities to be selected to participate in crossovers.
The eventual exchange of characters is governed by a crossover probability. This
probability determines whether the exchange is allowed to happen or not.
Using Genetic Algorithm to Create an Ensemble 689

Like in biology, individuals in the population are also permitted to mutate.


The mutation is governed by the mutation probability. The mutation is usually
achieved by altering one or more characters from the string that represents
an individual. A genetic algorithm is an iterative process. In each iteration,
crossovers are performed to generate new individuals representing new solutions.

4.2 Our Genetic Algorithm Set up

The individuals in the population were randomly generated candidate tolerance


vectors. The population size was fixed to 100 for our experiments, and the prob-
ability of crossover was set to 60%. A roulette wheel selection approach was
used to select the parents for the crossover. The crossover was performed at a
fixed point which was always at the middle of the candidate tolerance vectors.
The probability of mutation was 0.1%. The mutation was performed by either
adding 1, with a probability of 50%, or subtracting 1, with a probability of 50%,
to each of the values of a candidate tolerance vector. Each candidate tolerance
vector was used to generate the training dataset using the games from the tour-
nament’s first round; to which the observed outcomes are available. The training
dataset was presented to a machine learning algorithm 100 times with a random
80%-20% split each time. The fitness of each candidate tolerance vector was
calculated from the average of the 100 accuracy on the first round games. The
genetic algorithm was allowed to generate 500 new individuals before it stopped.
Then, survival of the fittest was used to place a new individual in the population.
Our genetic algorithm approach was modeled after the approach described by
Goldberg [17]. The best candidate tolerance vector obtained when the algorithm
stopped was saved into the list of vectors to be used to generate the models in
the ensemble.

4.3 The Machine Learning Algorithm Chosen

We picked the machine learning algorithms, listed below, that performed well
from Fig. 1. In each algorithm, x represents a multi-dimensional vector that
describes a game. y is the the desired output, win or loss. w represents the
weights that the algorithm is searching for. Here are the chosen algorithms:

– Lasso regression: The objective function to be minimized is


n
  P

(yi − xij wj )2 + α | wj |,
i=1 j j=1

.
690 A. S. Randrianasolo and L. D. Pyeatt

– Random Forest: Trees are created by utilizing a random subset of the top k
predictors at each split in the tree. A tree is a multistage decision system,
in which classes are sequentially rejected, until a finally accepted class is
reached. Each tree in the ensemble is then used to generate a prediction for
a new sample. The majority in these m predictions is the forest’s prediction.
– Support Vector Machine: Provided two classes that are linearly separable,
design the function
g(x) = wT x + w0 = 0,
that leaves the maximum margin from both classes. In the cases of classes
that are non-linearly separable, a kernel function, K, can be utilized and the
decision function changes to
l

g(x) = yi αi K(x, xi ).
i=1

The kernel function we used was:


• Linear:
K(x, u) = xT u
– Feed forward neural network with one hidden layer: neural networks are
nonlinear regression and classification methods inspired from how the brain
works. Multiple units mimicking neurons are connected to perform the regres-
sion or the classification. A unit contains weights to be multiplied by the
inputs and an activation function that determines the unit’s output given the
weighted sum of the inputs. The network is composed of the input layer, the
hidden layers, and the output layer. For this research, the number of units in
the input layer was equal to the size of the vector x. Figure 1 lists the best
performing number of units in the hidden layer out of the range 2–200. The
output layer consisted of one unit with a logistic function. The activation
functions we explored were:
• rectified linear:
f (x) = max(0, x),
• identity:
f (x) = x.

5 Testing and Results

We revisited the 2020 Women Australian open by using the player’s statistics
from 2019 to create the game representations. We also predicted the 2021 Women
French Open (Roland Garros) and the 2021 Women US open. For the French
Open, player’s statistics from January 2021 to May 2021 were used to create
Using Genetic Algorithm to Create an Ensemble 691

the game representation. For the US Open, player’s statistics from January 2021
to August 2021 were used to create the game representation. In all of these
predictions, the genetic algorithm used the games from the first round of the
tournament to calculate the fitness of each candidate tolerance vector. Models
in the ensemble were also created using the games from the first round of the
tournament. The models were then used to predict the games from the second
round up to the final. We kept the parameter setting for the machine learning
algorithms used in the ensemble the same as what is shown in Fig. 1.
The results of our testing are summarized in Fig. 3, 4, 5. Similar to what
we had done in the early observation, we compared the accuracy of the ensem-
ble’s predictions to the accuracy of the predictions that are only based on the
players’ rank. In these figures, each mentioned machine learning algorithm name
refers to 101 models created utilizing that same algorithm with the parame-
ter(s) being listed. The models were generated using the same machine learning

Fig. 3. Australian open prediction


692 A. S. Randrianasolo and L. D. Pyeatt

algorithms with the same parameters; however the game representation from
which the machine learning algorithm generated the model were not necessarily
the same. Each game representation for each model depended on the tolerance
vector from the genetic algorithm. Note also that the fitness evaluation in the
genetic algorithm used the same machine learning algorithm, with exactly the
same parameters, that later generated the models.
The predictors used in testings were the player’s rank, number of matches,
and the percentages of the first serve, first serve points, second serve points, serve
points won, break points, first return points, second return points, return games
won and break points converted. Each candidate tolerance vector consisted of
11 values, each ranging from 1 to 20.

Fig. 4. French open prediction


Using Genetic Algorithm to Create an Ensemble 693

Fig. 5. US open prediction

6 Conclusion and Future Work


We started this research with the goal of being able to accurately predict tennis
games using a limited number of the data. The results we obtained showed that
such a goal is doable using the ensemble technique we described in this paper.
Figure 3, 4, 5 show that some of the ensemble techniques we created predict tennis
games higher than the predictions that are solely based on rankings. The highest
accuracy on the women’s 2020 Australian open was 77%, the highest accuracy
on the women’s 2021 French open was 60% and the highest accuracy on the
women’s 2021 US open was 65%. We also have significantly reduced the human
involvements in generating the game representations by employing a genetic
algorithm. This is a good signal toward automating the approach and avoiding
the reliance on human expertise.
The predictions for the Australian open benefited from a full year of statistics.
This could explain why the prediction accuracy are higher since the statistics
694 A. S. Randrianasolo and L. D. Pyeatt

used are considered stable at that point and better reflect the players’ ability.
However, more studies need to be conducted to confirm if the Australian open is
easier to predict using the player’s statistics compared to the other tennis grand
slam tournaments. We hoped to see better results for the predictions for the US
open. We assumed that since it is the last grand slam tournament of the year,
the players’ statistics should be close to accurate by that time. However, injuries
and fatigues and other non-performance factors that have accumulated through
the season may render the prediction very hard.
Adding predictors such as fatigue, physical strength and mental strength,
similar to the approach of Gu and Saaty [7], can possibly increase the accuracy
of the predictions. It is rather complex and hard to extract these predictors
without considerable help from human expertise and such processes will hinder
automation.
There are still other genetic algorithm setups that this paper has not yet
explored. Among those setups are using random points for crossover, different
selection strategy instead of the roulette wheel selection that we have used, and
different stopping criteria. In the ensemble setup, using models from different
machine learning algorithms, similar to [8], instead of models that come from
the same algorithm can also be explored.
In the future, we plan to run a comparative analysis on the tolerance vectors
produced for each tennis tournament to see if there is a possibility of general-
ization that can be obtained. At the moment, we suspect that these tolerance
vectors can be tournament specific and may also vary with time. More testing
on various tennis tournament over multiple years will be needed to conduct such
study. Applying this ensemble approach to other sports will also be in our future
works.

References
1. Randrianasolo, A.S., Pyeatt, L.D.: Comparing Different Data Representations and
Machine Learning Models to Predict Tennis. In: Arai, K. (eds) Advances in Infor-
mation and Communication. FICC 2022. Lecture Notes in Networks and Systems,
vol. 439. Springer International Publishing, pp. 488–500 (2022) https://doi.org/
10.1007/978-3-030-98015-3 34
2. Huang, M.-L., Li, Y.-Z.: Use of machine learning and deep learning to predict the
outcomes of major league baseball matches. Appl. Sci. vol. 11(10), 4499 (2021)
3. Candila, V., Palazzo, L.: Neural networks and betting strategies for tennis, Risks
vol. 8(3) 2020
4. Randrianasolo, A.S., Pyeatt, L.D.: Predicting head-to-head games with a similarity
metric and genetic algorithm. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2018.
AISC, vol. 880, pp. 705–720. Springer, Cham (2019). https://doi.org/10.1007/978-
3-030-02686-8 53
5. Bunker, R.P., Susnjak, T.: The application of machine learning techniques for
predicting results in team sport: A review, CoRR, vol. abs/1912.11762 (2019)
6. Khan, S., Kirubanand, V.B.: Comparing machine learning and ensemble learning
in the field of football. Int. J. Electr. Comput. Eng. (IJECE), textbf95, 4321 (2019)
Using Genetic Algorithm to Create an Ensemble 695

7. Gu, W., Saaty, T.: Predicting the outcome of a tennis tournament: Based on both
data and judgments. J. Syst. Sci. Syst. Eng. 28 317–343 (2019)
8. Wilkens, S.: Sports prediction and betting models in the machine learning age:
The case of tennis, SSRN Electr. J. (2019)
9. Pretorius, A., Parry, D.A.: Human decision making and artificial intelligence: A
comparison in the domain of sports prediction,” In: Proceedings of the Annual
Conference of the South African Institute of Computer Scientists and Information
Technologists ser. SAICSIT ’16. New York, NY, USA: ACM, pp. 32:1–32:10 (2016)
10. Brooks, J., Kerr, M., Guttag, J.,: Developing a data-driven player ranking in soc-
cer using predictive model weights, In: Proceedings of the 22Nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16.
New York, NY, USA: ACM, pp. 49–55 (2016)
11. Knottenbelt, W.J., Spanias, D., Madurska, A,M,: A common-opponent stochastic
model for predicting the outcome of professional tennis matches. Computers &
Mathematics with Applications theory and Practice of Stochastic Modeling, vol.
64, no.12, pp. 3820–3827 (2012)
12. McHale, I., Morton, A.: A Bradley-Terry type model for forecasting tennis match
results. Int. J. Forecast. 27(2), pp. 619–630 (2011)
13. del Corral, J., Prieto-Rodrı́guez, J.: Are differences in ranks good predictors for
grand slam tennis matches. Int. J. Forecast. 26(3), 551–563 (2010)
14. Newton, P.K., Keller, J.B.: Probability of winning at tennis i. theory and data.
Studies Appl. Math. 114(3), 241–269 (2005)
15. Barnett, T., Clarke, S.R.: Combining player statistics to predict outcomes of tennis
matches. IMA J. Manag. Math. 16(2) 113–120 (2005)
16. Klaassen, K.J., Magnus, J.R.: Forecasting the winner of a tennis match. Europ. J.
Operat. Res. 148(2), 257–267 (2003)
17. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learn-
ing. In: 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.,
(1989)
Towards Profitability: A Profit-Sensitive
Multinomial Logistic Regression
for Credit Scoring in Peer-to-Peer
Lending

Yan Wang1(B) , Xuelei Sherry Ni1 , and Xiao Huang2


1
School of Data Science and Analytics, Kennesaw State University,
Kennesaw, GA 30144, USA
[email protected], [email protected]
2
Department of Economics, Finance and Quantitative Analysis,
Kennesaw State University, Kennesaw, GA 30144, USA
[email protected]

Abstract. This paper proposes a profit-sensitive learning method for


loan evaluation in the peer-to-peer (P2P) lending market that could pro-
vide better investment suggestions for the lenders. Currently, the most
widely utilized loan evaluation method is credit scoring, which focuses
on evaluating the loans’ defaulting risk and formulates a binary classi-
fication problem. It screens out the non-default loans from the default
ones and thus defines the best loans as those with a low probability of
default (PD). However, the conventional credit scoring totally ignores the
profit information while solely focusing on the risk. To address the above
issue, we propose a profit-sensitive multinomial logistic regression model
that incorporates the profit information into the credit scoring approach.
More specifically, we first transform the binary classification problem in
traditional credit scoring to a multi-level classification task by further
dividing the default loans into two sub-classes: “default and profitable”
and “default and not profitable”. Then we design a multinomial logistic
regression model with a novel loss function to solve the above-defined
multi-level classification task. The loss function weights loans differently
according to their varying profits as well as their occurrence frequencies
in the real-world practices. The effectiveness of the proposed method
is examined by the real-world P2P data from Lending Club. Results
indicate our approach outperforms the existing credit scoring only app-
roach in terms of identifying more profitable loans while ensuring the
low risk. Therefore, the proposed profit-sensitive learning method serves
as an innovative reference when making investment suggestions in P2P
lending or similar markets.

Keywords: Loan evaluation · Peer-to-peer lending · Profit-sensitive


logistic regression

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 696–718, 2023.
https://doi.org/10.1007/978-3-031-18461-1_46
Profit-Sensitive Credit Scoring in P2P Lending 697

1 Introduction
1.1 Background

Peer-to-peer (P2P) lending is an electronic platform where individuals bor-


row and lend money from each other [1]. Compared to the traditional banking
finance, P2P lending has the advantages of being more convenient, faster, with
lower borrowing costs, and potentially higher returns. However, the opportunity
of getting a higher return is accompanied by a financial risk. Lenders would bear
the full risk of losing part or even all of their principals if borrowers default on
the loans. It is crucial to help lenders evaluate the loans and determine the level
of risk associated with each loan, especially in the P2P market, where lenders
are often individual investors with less professional experience and support.
Several machine learning methods have been effectively utilized to analyze
the big P2P data, especially in distinguishing the good loans (i.e., those expected
to be fully repaid) from the bad ones (i.e., those expected to default before the
due time), and thus support the investment decisions. In these approaches, loan
evaluation is typically formulated as a binary classification problem (default
vs. non-default) which focuses on developing a binary classifier to predict the
borrower’s probability of default (PD) [17]. Loans with a lower PD are considered
as good loans and vice versa. The above-mentioned approach is conventionally
known as the credit scoring method since it scores the loans at the credit risk
level. The credit scoring approach has been intensively explored by many studies
in the P2P industry [6,9,16].
Although the credit scoring approach has shown promising results in lowering
the financial risk of the investors, it cannot fully address all the objectives of the
lenders in the P2P market as lenders do care about the profit they could generate
from the investment. The idea of loan evaluation from the profit perspective was
first proposed in [13], where the authors evaluated the P2P loans using a well-
known financial formula, the internal rate of return (IRR). IRR denotes the ratio
between the total repayment and the principal. In [18], instead of using IRR, the
annualized rate of return (ARR) was used to evaluate the loan profit. Equation 1
shows the ARR formula, where Total Repayment denotes the total amount of
money the lender receives when the loan is mature, Principal denotes the loan
amount, and t denotes the repayment duration measured in years. It can be
seen that IRR does not take the various duration of the repayment process into
account when making comparisons across different loans, while ARR considers
the real repayment duration of a loan. No matter what profit measures are used,
loans with a higher predicted profit are considered as the higher-quality ones
from the profit scoring perspective and vice versa. Unlike credit scoring which
has been focused on by many studies, the research for profit scoring is very
limited so far.
Total Repayment 1/t
ARR = ( ) (1)
Principal
698 Y. Wang et al.

1.2 Motivations

Although both credit scoring and profit scoring can help investors make deci-
sions, they provide the evaluations of loans from totally different perspectives.
Credit scoring pursues lower risk while profit scoring values higher profit. There-
fore, the loans recommended by one approach may not be considered as the
high-quality loans by the other. However, in many cases, those higher risk loans
are those with greater potentials in profit due to the higher interest rate. Even
the defaulted loans could have generated profit for the lender before it turned
into delinquency. Hence, we would like to propose a recommendation system that
could better evaluate the loans for lenders by integrating the profit information
into credit scoring. We hope the top loans ranked by the integrated model would
be those in general safe while more profitable.
The second motivation of our research is to design a new strategy that
can handle the imbalanced P2P data. It is not hard to imagine that typically
most loads were fully repaid in time. The default loans make a smaller cate-
gory. Among those default loans, the defaulted but profitable loans contain an
even smaller proportion. Traditionally, cost-sensitive learning is an approach to
deal with the imbalance issue during the classification tasks [7]. However, such
method does not consider the profit information of the loans. Inspired by the
logic of cost-sensitive learning that deals with the imbalance issue by adjusting
according to the distribution of the target variable, we propose an innovative
loss function for credit scoring that involves not only the adjustments according
to the frequency characteristic, but also the adjustments based on profitability.
It is expected that the proposed profit-sensitive credit scoring model can deal
with the imbalanced P2P data well and can bring better investment suggestions
for P2P lenders.

1.3 Contributions

We will incorporate the profit information into credit-scoring through two


approaches: 1) re-defining the target variable in credit scoring to include some
profit information and 2) propose a new loss function for credit scoring that
involves profitability.
Our first approach to combine the risk information and the profit information
together is updating the target variable. Recall that the traditional credit scoring
approach is formulated as a binary classification problem that typically utilizes
the target variable with two classes: “default” or “not default”. Consequently, in
terms of profitability, the credit scoring approach has a critical inherent drawback
that has long been ignored: the class of “default” is not purely non-profitable.
Although not a usual case in the real-word practices, the scenario indeed exists
that some defaulted loans, even though not being fully repaid, generate some
profit because of the high interest rate. If simply ignore the heterogeneity of the
default loans, the model may not provide the optimal recommendation for the
investors.
Profit-Sensitive Credit Scoring in P2P Lending 699

To handle the issue of heterogeneity in the default loans that appear as one
category in the conventional credit scoring approach, in our study, we define the
target variable with three different classes: “default and no profit”, “default but
with profit”, and “not default and with profit”. As a result, the conventional
binary classification problem for credit scoring is transferred into a multi-level
classification task.
As discussed earlier, these three target classes are not evenly distributed.
We then bring in the idea of cost-sensitive learning to deal with the imbalanced
data. In addition to adjusting based on the target distribution which has already
included some general profitability information, we also weight each observation
according to its own profitability. So the second approach of incorporating profit
information into credit scoring is to design a new loss function that weighing
loans differently according to their varying profits as well as their occurrence
frequencies in the real-world practice.
We name the proposed method as a profit-sensitive methodology. It is
expected that the proposed methodology can bring the model close to the real
cases in the P2P market and thus better guide lenders in making investment
decisions.
Theoretically, the proposed profit-sensitive loss could be applied to many
machine learning methodologies, including classification trees, logistic regression,
neural networks, etc. Considering that logistic regression is the benchmark model
for credit scoring, we will test the efficiency of the profit-sensitive learning in
this study by using logistic regression. To be specific, we use the binary logistic
regression as the conventional credit scoring approach to solve the traditional
binary classification problem for the P2P market and to produce the baseline
result. Then, we design a novel loss function based on the multinomial logistic
regression model to solve the multi-level classification problem.
In summary, our study makes contributions from three perspectives:

1. We expand the loan evaluation from the binary classification problem to a


three-level classification problem with the addition of the profit information.
2. We design a novel profit-sensitive loss function to solve the pre-defined multi-
level classification.
3. The promising results in our empirical study, which will be discussed in
Sect. 4, shows its effectiveness when the base model is logistic regression.
It can be generalized to other multi-level classification models such as neural
networks and support vector machine and is applicable for other scenarios
with imbalanced data. Therefore, the logic of the profit-sensitive model has
a broader effect on the P2P market or other similar markets.

The rest of the article is organized as follows. Section 2 summarizes the exist-
ing research on P2P lending in the context of credit scoring and profit scoring.
Section 3 briefly discusses the theory of the designed profit-sensitive multino-
mial logistic regression model. Section 4 empirically examines the effectiveness
of the proposed method using the Lending Club data. Section 5 concludes with
a summary.
700 Y. Wang et al.

2 Literature Review
In P2P lending, credit scoring is conventionally formulated as a binary classifica-
tion problem, which classifies the loans into either (1) the default category if the
predicted probability of default (PD) exceeds a certain pre-defined threshold, or
(2) the non-default category otherwise. Different classifiers have been used in the
credit scoring area, including binary logistic regression [14], random forest-based
classification approach [9], LightGBM and XGBoost methods [8], etc. Regardless
of the various machine learning models proposed in the credit scoring area, all of
them focused on reducing the default risk while totally ignoring the profitability.
Therefore, from the credit scoring perspective, the models suggest the lenders
to invest in the loans with a low chance going default because of the low default
risk.
Over the past few years, many studies have changed their focus from minimiz-
ing the default risk (i.e. the credit scoring approach) to maximizing the potential
profit (i.e. the profit scoring approach), since gaining profit is the final goal of
the P2P investors. As a result, the profit scoring approach was first proposed
as an alternative to credit scoring for P2P lending in [13], wherein the authors
used IRR as the measure of the profitability of loans. They built multiple lin-
ear regression and decision tree models, indicating that the lenders can obtain a
higher IRR using profit scoring models rather than a credit scoring model. In [18],
Xia et al. pointed out that ARR is a more appropriate measure of profitability
considering the varying repayment duration of the P2P loans. They proposed a
cost-sensitive boosted tree for loan evaluation, which incorporated cost-sensitive
learning in extreme gradient boosting to enhance the capability of identifying
the potential default borrowers. Regardless of the different profit measures used,
profit scoring focuses on maximizing the profit while totally ignoring the default
risk. From the profit scoring perspective, lenders should invest in the loans with
a high predicted profit because of the high return they may bring.
Both credit scoring and profit scoring can be used as the decision tools for
evaluating loans and making investment suggestions to the lenders. However, the
two approaches work from different perspectives. The high-quality loans selected
by the credit scoring approach may not be those could achieve a high profit due
to the associated low interest rate. And reversely, the high profit loans predicted
by the profit scoring approach are not always the loans going default. There are
loans paid in full but was assigned a high interest rate in the beginning. Thus,
we assume that if evaluating loans from the credit scoring and profit scoring
perspectives together, we could achieve a better and more comprehensive evalu-
ation. Our assumption was confirmed by a recently published article in [2], which
was an integration of credit scoring and profit scoring. To be specific, a two-stage
scoring approach was proposed to recommend the loans to lenders. In stage 1,
the credit scoring approach was used to identify the non-default loans and these
loans were further examined in terms of IRR in stage 2. Their numerical studies
indicated that the two-stage approach outperformed the existing profit scoring
approaches with respect to IRR. To the best of our knowledge, this was the only
study that combined credit scoring and profit scoring together to evaluate P2P
Profit-Sensitive Credit Scoring in P2P Lending 701

loans. In spite of the improvement in predicting profitability by [2], IRR is not


the optimal measure for profit since it does not consider the repayment duration
in reality. Therefore, to address the shortcomings of the reviewed credit scor-
ing and profit scoring approaches, we use ARR as the measure of profitability
and propose a profit-sensitive multinomial logistic regression model to integrate
profit and credit information together in evaluating loans. Different from [2],
where credit scoring and profit scoring were used independently in each step
in the two-stage modeling, we first formulate a multi-level classification task
by incorporating the profit information into credit scoring, then define a novel
loss function to solve the multi-level classification problem. Since the proposed
profit-sensitive model is modified from logistic regression, we will introduce the
relevant theory in Sect. 3.

3 The Profit-Sensitive Multinomial Logistic Regression


Logistic regression is widely used in many binary classification problems. For
clarity, we first list the notations used through the following parts of this article.

– D = d1 , d2 , ..., dN : a dataset with N observations and p features;


– i: the index of observations, where i = 1, 2, ..., N ;
– j: the index of features or independent variables or explanatory variables,
where j = 0, 1, 2, ..., p;
– di : data vector for the i-th observation. di can also be expressed as xT i , yi ,
where xT i = [x i1 , xi2 , ..., xip ] denotes its feature values while y i is the value of
its dependent variable or the target variable or the outcome variable;
– K: the number of categories/levels of the dependent variable;
– k: the index of the k-th category, where k = 1, 2, ..., K;
– p(yi = k): probability that the ith observation belonging to the kth category;
– βkT = [β1k , β2k , ..., βpk ]: the coefficient vector, which will be estimated by the
model, to predict p(yi = k), where the model assumes p(yi = k) = f (βkT xi );
– β: the collection of all the coefficient vectors, where β = [β1 , β2 , ..., βK ];
– lossi : the loss for the ith observation;
– L: the likelihood function;
– LL: the log-likelihood function;
– L : the loss function;
– I(·): the indicator function.

Binary logistic regression formulates the binary classification problem by


modeling the relationship between the features and the dependent variable via
a sigmoid function [10]. For the ith observation, given its input values xi , the
probability that it belongs to one of the two categories is estimated by Eq. 2.

⎨π(β T xi ) = exp(β 1 Txi )
T

1 1+exp(β 1 xi )
if k = 1
p(yi = k|xi , β1 ) = (2)
⎩1 − π(β T xi ) = 1
if k = 0
1 1+exp(β x )T
1 i
702 Y. Wang et al.

Assuming that all the observations are independent, the loss function or the
cost function L denotes the negative transformation of LL, which is the log
transformation of the likelihood function L. The goal of model training is to
seek the model coefficients β that minimizes L given in Eq. 3.

L = −LL = −log(L)

N 
N
(3)
= −log( p(yi |xi , β1 )) = − log{p(yi |xi , β1 )}
i=1 i=1

The single loss for the ith observation can be further defined using Eq. 4.

⎧ T
⎪ exp(β 1 xi )
⎨−log{π(β 1T xi )} = −log T if k = 1
1+exp(β 1 xi )
lossi = −log{p(yi = k|xi , β 1 )} = (4)

⎩−log{1 − π(β 1T xi )} = −log 1
if k = 0
T
1+exp(β 1 xi )

Multinomial logistic regression, which is an extension of binary logistic regres-


sion, estimates the target or outcome variable that has more than two categories
[3]. The definition of the loss function for multinomial logistic regression is very
similar to that of binary logistic regression. In a dataset with K different cate-
gories, lossi , which denotes the loss for the ith observation belonging to the kth
category, is defined in Eq. 5. Similarly, the loss function L for the multinomial
logit model is the summation of the loss for all the observations in the data, as
shown by Eq. 6. The goal of the model training is to find the coefficient vectors
β= (β1 , β2 , ..., βK ) that can minimize L given in Eq. 6.


K
lossi = − I(yi = k)log{p(yi |xi , βk )}
k=1

K 
K
exp(βkT xi )
=− I(yi = k)log{π(βkT xi )} = − I(yi = k)log K
T
k=1 k=1 k=1 exp(βk xi )
(5)

L = −LL

N 
K 
N 
K exp(β kT xi )
=− I(yi = k)log{p(yi = k|xi , β k )} = − I(yi = k)log K
k=1 exp(β k xi )
T
i=1 k=1 i=1 k=1
(6)

Motivated by the logic of cost-sensitive learning discussed in Sect. 1, in our


study, we design a profit-sensitive multinomial logistic regression model. To be
specific, each loan is weighted/emphasized differently during the model training
process by considering both its profitability and its occurrence frequency. This
leads to our definition of lossi in the profit-sensitive multinomial logit regression
model shown in Eq. 7, where wi1 and wi2 denote the weights adjusted for the ith
Profit-Sensitive Credit Scoring in P2P Lending 703

observation according to its profitable characteristic and its occurrence frequency


characteristic, respectively. By defining the loss of each loan using Eq. 7, each
loan contributes differently during the model training, thus the loans that are of
our best interest will be emphasized the most.


K
lossi = −wi1 wi2 I(yi = k)log{p(yi = k|xi , βk )} (7)
k=1

According to the definition of the loss for each loan, the loss function of the
entire training set is given in Eq. 8. We hope that by incorporating different
weights based on both the profit information and the frequency information into
the loss function, the proposed method can identify more “profitable” loans while
ensuring the “safeness” of the investment compared to the conventional credit
scoring method.


N 
N 
K
L =− lossi = − wi1 wi2 I(yi = k)log{p(yi = k|xi , βk )} (8)
i=1 i=1 k=1

It is worth mentioning that we named the proposed model as “profit-


sensitive” by following the naming convention for cost-sensitive methods.
Although the logic comes from cost-sensitive learning, the design of using prof-
itability and its occurrence frequency together to weight loans during model
training is totally innovative. During our empirical study discussed in Sect. 4,
we will first mathematically prove the validity of the loss function defined for
the profit-sensitive learning method and then examine its effectiveness using the
real-world P2P data.

4 Empirical Study

In this section, the proposed method is applied to the Lending Club data to test
its effectiveness. The task is to use the proposed method to classify the Lending
Club loans and therefore recommend the high-quality loans to the investors.
Comparing to the conventional credit scoring method which considers the high-
quality loans as those with low PD without considering their profitability, it is
our hope that the proposed method will target the loans that are safe while
profitable.

4.1 Data Description

The dataset utilized in the empirical study originates from Lending Club, which
is one of the largest P2P platforms in the US and provides public available
dataset on its official website. We analyzed 1,123,895 loans that were origi-
nated before August 2016, with 219,809(19.56%) are the default loans while
90,4086(80.44%) are the non-default ones. The entire feature set for each loan is
collected from three perspectives: (1) the loan related information such as loan
704 Y. Wang et al.

purpose, term of the loan, etc. (2) the credit information of the borrower, such
as the FICO score, the debt-to-income (dti) ratio, etc. and (3) the other infor-
mation from the borrower, such as whether or not owning the living place, etc.
One feature worth mentioning is grade. Lending Club has rated all the loans into
seven different grades, labeled from A to G, and assigned an increasing nominal
interest rate from A to G.
The grade could act as one direct decision tool to assist lenders in making
rational investment decisions, however, it is not a secure tool since there still exist
default loans in the safest grade. The variable loan status, denotes the status of
the loans after it expires, with 1 denotes the loan was defaulted while 0 denotes
the loan was fully paid. Loan status is the target variable of the traditional credit
scoring approach.
Based on the raw Lending Club data, we define a target variable ARR using
Eq. 1 to measure the profitability of the loans in our study. It is worth noting that
the ARR calculated here using Eq. 1 is the actual ARR that occurs in real-world
practices. It may be different with the theoretical ARR that was expected when
the loan was originated, because of the possible early repayments or delinquen-
cies. The loans with ARR larger than 1 earn profit and vice versa. The mean,
median, and SD of ARR are 0.99, 1.07, and 0.25, respectively. It is surprising
that on average, it is not profitable to invest in the P2P market, as indicated by
the mean value of ARR. Therefore, having a data-driven recommendation that
performs better than randomly choosing some loans to invest is essential for the
lenders.

4.2 Target Transformation

Our study starts by redefine the target variable. Instead of simply classifying the
loans into two categories while ignoring the profit information, we incorporate
the profitability of the default loans into the target variable and thus transfer
the binary classification problem into a multi-level classification problem.
To be specific, a new target variable named “Group” is created as in Eq. 9,
where Group = NoDefProf means the loan was fully paid and led to a profit,
Group = DefNoProf means the loan defaulted and made a loss to the investor,
and Group = DefProf means the loan defaulted but still generated some profit.
There is no scenario of Group = NoDefNoProf (i.e., a non-default loan without
any profit) since all the principle plus some interest have been paid back if the
loan is not defaulted.


⎨NoDefProf if loan status = 0&ARR > 1
Group = DefNoProf if loan status = 1&ARR <= 1 (9)


DefProf if loan status = 1&ARR > 1
The distribution of the newly created outcome “Group” in the training set
(70% of the entire dataset) is given in Table 1. 80.42% of the loans are non-default
with a mean ARR of 1.09. As expected, the default category is heterogeneous:
Profit-Sensitive Credit Scoring in P2P Lending 705

most of the default loans have no profit while 13, 267 of them (1.69% of the
training set) are profitable.

Table 1. Distribution of the new outcome group in the training set.

Group Frequency Proportion ARR


Mean Median St. dev
NoDefProf 632,728 80.42% 1.09 1.08 0.045
DefNoProf 140,730 17.89% 0.55 0.61 0.320
DefProf 13,267 1.69% 1.04 1.03 0.047

By creating the new outcome “Group”, the traditional credit scoring problem
has been transformed into a multi-level classification problem. Next, we will use
the proposed profit sensitive multinomial logistic regression method to solve
the multi-level classification problem, where “Group” is the target variable. We
would further check whether or not the proposed method can identify higher
profitable loans than the traditional binary classification approach.

4.3 Define the Loss Function

By applying Eq. 5, we can define the loss for the ith loan in the transformed
3-category classification problem in Eq. 10. Note that pdp , pndp , and pdnp denote
the probabilities that the ith loan belongs to the category DefProf, NoDefProf, or
DefNoProf, respectively. Similarly, βdp , βn dp , and βdn p are the corresponding
coefficient vectors. As a result, the loss function L for the multinomial logit
model applied in our scenario, which is the summation of the loss for all the
observations in the data, is given in Eq. 11.
As shown in Table 1, “Group” has an extremely imbalanced distribution,
because the proportion of the DefProf category is much lower than the rest two.
Our initial experiments showed that when using the loss function of multinomial
logistic regression defined in Eq. 11, it didn’t classify any loan into to the Def-
Prof category, which confirms that the traditional multinomial method is not
appropriate for the extremely unbalanced P2P data. Someone may suggest that
the minority category DefProf can be simply discarded from being recommended
to the investors because of the low frequency of occurrence. However, as shown
in Table 1, the median ARR of the DefProf category is 1.03, which indicates
that the minority category DefProf is also the class of interest when making
investment suggestions.
706 Y. Wang et al.

lossi = −I(y
⎧ i = k)log{p(yi |xi , βk )}


exp(β dT p xi )

⎪−log(pdp ) = −log exp(β T x )+exp(β T x )+exp(β T x )

⎪ dp i n dp i dn p i



⎪ if y = DefProf


i

⎨−log(p ) = −log exp(β nT d p xi )
ndp exp(β d p xi )+exp(β nT d p xi )+exp(β dT n p xi )
T (10)
=



⎪ if yi = NoDefProf



⎪ exp(β dT n p xj )

⎪−log(p ) = −log


dnp exp(β dT p xi )+exp(β nT d p xi )+exp(β dT n p xi )

⎩ if y = DefNoProf
i


N
L =− lossi
i=1

N 
N
=− I(yi = Def P rof )log(pdp ) − I(yi = N oDef P rof )log(pndp )
i=1 i=1

N
− I(yi = Def N oP rof )log(pdnp )
i=1
(11)
As discussed in Sect. 3, we propose a profit-sensitive multinomial logistic
method by defining a new loss function shown in Eq. 8, in which two weight
terms are included. To make the proposed profit-sensitive model more accurate,
we further adjust the weights in Eq. 8 from two aspects as follows. First, the
value wi1 in Eq. 8 is defined as the ARR of the ith loan. As a result, each
loan is weighted differently based on its own profitability. Moreover, wi1 is used
only for the profitable loans. In other words, we add wi1 to the loans from
the categories of DefProf and NoDefProf while the loans from the category of
DefNoProf don’t have the wi1 item. By doing so we can take into account the real
profit information of the loans, thus more profitable loans will be emphasized
more during the modeling process and the non-profitable loans are treated the
same with each other. Secondly, instead of adjusting wi2 for each of the three
categories of “Group”, we only re-weight the loans belonging to the DefProf
category. It is because DefProf is the category that has the lowest frequency
but it is one of the target categories of interest to the investors. By adding wi2 ,
we could adjust the bias caused by the extremely low frequency of the DefProf
category. Therefore, the loss for each loan i is further modified using Eq. 12.
Please note that wi1 and wi2 are only used in the training stage of the model
not the predicting stage, so it is fine to use the “posterior” information such as
ARR here.
To summarize, for the categories that can generate profit, including DefProf
and NoDefProf, we use the real profit as the weight term (as shown by wi ) in
order to put different emphasis on the loans based on their varying profitability.
Profit-Sensitive Credit Scoring in P2P Lending 707

For the minority category DefProf, we add an additional weight term w f req to
adjust the extremely low frequency. w f req is a hyper-parameter that needs to
be tuned during the model training.

⎧ exp(β dT p xi )

⎪ −w w log(p ) = −w w log

⎪ i f req dp i f req


T
exp(β d p xi )+exp(β n T T
d p xi )+exp(β d n p xi )



⎪ if yi = DefProf



⎪ T
⎨ exp(β n d p xi )
−wi log(pndp ) = −wi log
lossi = exp(β T
x
dp i )+exp(β T T
n d p xi )+exp(β d n p xi ) (12)



⎪ if yi = NoDefProf



⎪ exp(β d n p xi )

⎪ −log(p dnp ) = −log

⎪ exp(β dT p xi )+exp(β nT T
⎪ d p xi )+exp(β d n p xi )


if yi = DefNoProf

Based on Eq. 12, the loss function L of the proposed profit sensitive multi-
nomial logit regression is finally defined in Eq. 13.


N  exp(β dT p xi1 )
L = lossi = − wi w f req log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i=1 i|yi =Def P rof

 exp(β n
T
d p xi )
− wi log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i|yi =N oDef P rof

 exp(β n
T
d p xi )
− log
exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T
i|yi =Def N oP rof
(13)

The model is trained with the purpose of minimizing the above L . To con-
firm that the globally optimal solution exists, we first mathematically prove the
convexity of the proposed loss function and the details are shown as follows.

4.4 Convexity of the Proposed Loss Function

After designing the loss function, it is critical to mathematically prove that the
algorithm of minimizing the loss function L in Eq. 13 can converge during the
training process if an appropriate learning rate is used. Otherwise, we cannot
guarantee to get a reliable and optimal solution that minimizes the loss func-
tion. We will apply gradient decent, an optimization algorithm widely used, to
minimize our loss function and find the solution of β. According to Theorem 1,
which has been indicated in [15], the problem of proving the convergence can be
transferred into the problem of proving the convexity of the loss function.

Theorem 1. Suppose the function f : Rn → R is convex and differentiable, and


that its gradient is Lipschitz continuous with constant L > 0, i.e., we have that
∂f (x) − ∂f (y)2 ≤ Lx − y2 for any x, y. Then if we run gradient descent
708 Y. Wang et al.

for k iterations with a fixed step size t ≤ 1/L, it will yield a solution f (k) which
satisfies Eq. 14, where f (x∗ ) is the optimal value.

x(0) − x∗ 22
f (xk ) − f (x∗ ) ≤ (14)
2tk
In other words, it means that gradient descent is guaranteed to converge and
that it converges with rate O(1/k) for a convex and differentiable function.

Convexity is articulated in Definition 1. Although there have been many


different ways in defining convexity in previous research, Definition 1 is the most
straightforward one [12]. As shown in this definition, the problem of proving the
convexity can be further transformed into the problem of proving the positive
semi-definite of the Hessian matrix of the given function.

Definition 1. A twice differentiable function f : Rn → R is convex, if and only


if inequality 15 holds:

∂ 2 f (x)
zT [ ]z ≥ 0, ∀z (15)
(∂x)2
∂ 2 f (x)
In other words, f if convex if and only if the Hessian matrix (∂x)2 is positive
semi-definite for all x ∈ Rn .

Lemma 1 lists one important property for the convex functions and we will
use it later during our proof [5].

Lemma 1. Let f(x), g(x) be two convex functions, then for λ1 , λ1 ≥ 0, λ1 f (x)+
λ2 g(x) is also convex. In other words, non-negative linear combination of convex
functions is also convex.

We now give the proof of the convexity of the loss function L in Eq. 13.

Proof. The loss function L given in Eq. 13 can be expressed as a linear combina-
tion of functions 16, 17, and 18. According to Lemma 1, to prove the convexity
of L , we need to prove the convexity of these three functions. According to
Definition 1, to prove the convexity of Eqs. 16, 17, and 18, we need to prove
that the Hessian matrices of them are all positive semi-definite. Without loss of
generality, we prove the convexity of function 16 only. The convexity of functions
17 and 18 can be obtained similarly.
T
exp(βdp xi )
−wi w f req log T
(16)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )

exp(βnT dp xi )
−wi log T
(17)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
Profit-Sensitive Credit Scoring in P2P Lending 709

exp(βnT dp xi )
−log T
(18)
exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )

The first derivative of function 16 with respect to βdp is derived using Eq. 19
and then its Hessian matrix is given in Eq. 20. Then, Eq. 21 is checking whether
the Hessian matrix is positive semi-definite or not.

T
∂ exp(βdp xi )
[−wi w f req log T
]
∂βdp exp(βdp xi ) + exp(βnT dp xi ) + exp(βdn
T
p xi )
T
exp(βdp xi ) (19)
= −wi w f req [1 − T
]xi
exp(βdp xi ) + exp(βnT dp xi ) + T
exp(βdn p xi )

:= −wi w f req [1 − T
π(βdp xi )]xi T
= wi w f req [π(βdp xi ) − 1]xi

∂ 2 f (β d p )
Hessian(f (β d p )) = [ ]
∂β d p ∂β dT p
∂2 exp(β d p xj )
= [−wi w f req log ]
∂β d p ∂β dT p exp(β d p xi ) + exp(β n d p xi ) + exp(β d n p xi )
∂ ∂
= wi w f req [π(β d p xi )
T
− 1]xi = wi w f req [π(β dT p xi ) − 1]xi
∂β dT p ∂β dT p
∂ ∂ ∂
= wi w f req [π(β dT p xi )xi − xi ] = wi w f req [π(β dT p xi )xi ] − wi w f req xi
∂β dT p ∂β dT p ∂β dT p
= wi w f req π(β d p xi )[1
T
− π(β dT p xi )]xi xT
i
(20)

=⇒ Then ∀z ∈ Rp ,

∂ 2 f (β d p )
zT [ ]z
∂β d p ∂β dT p
= z T [wi w f req π(β d p xi )[1
T
− π(β dT p xi )]xi xT
i ]z = wi w f req π(β d p xi )[1
T
− π(β dT p xi )](xT
i z)
2

(21)

In Eq. 21, we have wi > 0 and w f req > 0 since they both denote the
weights on the loans in our definition of L . We also have π(βdp
T
xi ) ≥ 0 and
[1 − π(βdp
T
xi )] ≥ 0 because of the range of the softmax function. Finally, it
is always true that (xT
i z) ≥ 0 because it is a square of a scalar. Therefore,
2

∂ 2 f (β d p ) ∂ 2 f (β d p )
z T [ ∂β T ]z ≥ 0 is true ∀z ∈ Rp and the Hessian matrix ∂β d p ∂β dT p
is positive
d p ∂β d p
semi-definite.
710 Y. Wang et al.

According to Definition 1, Function 16 is convex with respect to βdp . Sim-


ilarly, Function 17 is convex with respect to βn dp and Function 18 is convex
with respect to βdn p , respectively. Since L given in Eq. 13 is a positive linear
combination of Eqs. 16, 17, and 18, we conclude the convexity of L according
to Lemma 1. Finally, according to Theorem 1, we conclude that minimizing L
has optimal solutions and using the gradient decent algorithm can guarantee the
convergence.

4.5 Learning Algorithm of the Proposed Methodology

After proving the convexity of the proposed loss function L , we further articulate
the algorithm for learning the coefficients during the model training process.
Considering the large size of the training set, the mini-batch stochastic gradient
descent algorithm is used to learn the proposed multinomial logit model [4].
Algorithm 1 gives the details of the training procedure and Eqs. 22, 23 and 24
show the calculation of the partial derivatives, respectively.

∂ exp(β dT p xi )
L = −w{[I(yi = Def P rof ) − ](xi )} (22)
∂β d p exp(β dT p xi ) + exp(β n
T
d p xi ) + exp(β d n p xi )
T

∂ exp(β n
T
d p xi )
L = −w{[I(yi = N oDef P rof ) − ](xi )}
∂β n d p exp(β d p xi ) + exp(β n d p xi ) + exp(β dT n p xi )
T T

(23)

∂ exp(β dT n p xi )
L = −w{[I(yi = Def N oP rof ) − ](xi )}
∂β d n p exp(β df
T x ) + exp(β T
i n d p xi ) + exp(β d n p xi )
T

(24)

where ⎧

⎨wi w f req if yi = DefProf
w = wi if yi = NoDefProf (25)


1 if yi = DefNoProf
Profit-Sensitive Credit Scoring in P2P Lending 711

Algorithm 1. Learning the Multinomial Model via Mini-Batch Stochastic Gra-


dient Descent
1: Input: Data D = {d1 , d2 , ..., dn }, loss function L , number of epochs T , learning
rate η, number of mini-batches m.
2: Split D into mini-batches B0 , ..., Bm−1 .
0 0 0
3: Initialize βdp , βndp , βdnp .
4: k ← 0
5: for s = 1 to T do  Iterate T times
6: // Start the epoch s
7: for b = 0 to m − 1 do  Iterate m times
8: // Access the b-th mini-batch

9: βdk p+1 ← βdk p − η |B1i | i∈Bi ∂β∂d p L (di = xi , yi )

10: βnk +1 k 1
d p ← βn d p − η |Bi | i∈Bi ∂β n d p L (di = xi , yi )


11: βdk n+1 k 1
p ← βd n p − η |Bi | i∈Bi ∂β d n p L (di = xi , yi )

12: k ←k+1
13: end for
14: If converges, break
15: end for
16: Output: cost-sensitive multinomial model βdk p , βnk d p , and βdk n p .

4.6 Implementation of the Proposed Model


To test the effectiveness of the proposed methodology, we train the profit-
sensitive multinomial logit model by minimizing the loss function L defined
in Sect. 4.3 using the algorithm defined in Sect. 4.5. During the training pro-
cess, several hyper-parameters mentioned in Algorithm 1 were pre-determined
by considering the trade-off between the model performance and the training
time. Specifically, hyper-parameters including Number of epochs T , Number of
mini-batches m, learning rate η, and w f req were carefully tuned by using the
trial and error approach that aims at maximizing the cross-validated ARR with
a searching domain of (100, 2000), (1000, 50000), (0.00001, 0.1), and (1, 40),
respectively. The final settings of the above-mentioned hyper-parameters are
1000, 10000, 0.001, and 20, respectively. The proposed model was first imple-
mented on the training set (70% of the entire dataset) to estimate the coeffi-
cients and then the fitted model was applied to the test set (the rest of the 30%
dataset) to validate its effectiveness in loan evaluation. The entire experiment is
conducted using Python (version 3.5) using a laptop with a 3.3 GHz Intel Core
i7 CPU, 16 GB RAM, and MacOS.

4.7 Loan Recommendation


The main purpose of the article is to evaluate whether or not incorporating the
profit information into credit modeling could be beneficial in detecting more
“profitable” loans. To validate the superiority of the proposed model over the
712 Y. Wang et al.

existing similar approaches, we use our model to identify good loans in the test
dataset and then recommend them to the investors. The first step is to set the
rule that we use to recommend the loans to the investors based on the model
results. In this study, we decide to use p(yi = Def N oP rof ) as the ranking metric
for recommending the loans, which is the probability that the ith loan belongs
to the defaulted and no profit (DefNoProf) category. The reasons are as follows.
First, since both NoDefProf and DefProf are related to a “good” characteristic
of the loans from the profitability perspective, it is unfair to use any of them
while discarding the other in the loan evaluation. On the other hand, DefNoProf
is related to the “bad” characteristics of the loans from both the risk and the
profitability perspectives. Thus, we would recommend the loans with the lowest
p(yi = Def N oP rof ) to the investors.

4.8 Performance Evaluation and Comparison

To confirm that incorporating the profit information into the credit modeling
could be beneficial in detecting more “profitable” loans, it would be crucial
to compare its performance with the conventional credit scoring approach (i.e.,
without using the profit information). Considering that the profit-sensitive multi-
nomial logistic model is a modified variation of logistic regression, it is reasonable
to use logistic regression as the benchmark model.
In addition, as discussed in Sect. 1.3, the main contributions of this study are
two folds. The first is transferring the binary classification problem in traditional
credit scoring to a three-level classification task, which could partially achieve the
goal of incorporating the profit information of the loans into credit scoring. The
second is going beyond the current existing cost-insensitive and cost-sensitive
multi-level classification methods by proposing a novel loss function that weighs
the loans based on their profitability as well as frequency. In order to have a
comprehensive analysis and highlight the contributions, we compare the pro-
posed model (labeled as Model 6) with several cost-insensitive and cost-sensitive
logistic regression models (labeled from Model 1 to Model 5). Models 1, 2, and 3
are all binary classification problems so they all use “loan status” as the target
variable. Models 4, 5, and 6 are multi-level classification problems thus they use
“Group” as the target variable. The details of these six models are given below:

– Model 1: A conventional credit scoring model based on binary logistic regres-


sion.
– Model 2: A cost-sensitive binary logistic regression, where the weight for each
class is defined according to the Heuristic method, which is the best practice
in addressing the imbalance data issue, in the scikit-learn library in Python
[11]. The Heuristic method uses the inverse of the class distribution to weight
the observations in the training data set, which defines the weights according
to Eq. 26. n samples is the number of observations in the data set, n classes is
the number of different classes, and n samples with class denotes the number
of observations in the particular class. Therefore, for our training data, the
Profit-Sensitive Credit Scoring in P2P Lending 713

weights for class 0 (i.e., loan status = 0) and 1 (i.e., loan status = 1) in Model
786726 786726
2 are 2∗632728 = 0.62 and 2∗153998 = 2.55, respectively.
n samples
weight = ( ) (26)
n classes ∗ n samples with class
– Model 3: A profit-sensitive binary logistic regression. The implementation of
Model 3 is similar to that of the proposed model described in Sect. 4.6, except
that the weights used in Model 3 are determined based on the binary outcome
“loan status”. Similar as in Eq. 12, we use the individual ARR to weight each
loan from the profitable class and use an additional term w f req to adjust the
minority class (i.e., loan status = 1). w f req is the hyper-parameter that is
tuned to maximize the cross-validated ARR. The final setting for w f req in
Model 3 is 10.
– Model 4: A conventional multinomial logistic regression.
– Model 5: A cost-sensitive multinomial logistic regression, where the weight
for each class is defined according to the Heuristic method again as given in
Eq. 26. Specifically, the weights for classes DefProf, NoDefProf, and DefNo-
786726 786726 786726
Prof are 3∗13268 = 19.76, 3∗632728 = 0.41, and 3∗140730 = 1.86, respectively.
– Model 6: The proposed profit-sensitive multinomial logistic regression model,
with the details in Sect. 4.6.

For the model evaluation and comparison purposes, Models 1, 2, and 3 are
compared by the predicted PD for each loan. Since a higher PD corresponds
to a “bad” characteristic, loans with a lower PD will be recommended to the
lenders. Models 4, 5, and 6 output three probabilities as discussed in Sect. 4.6:
p(yi = N oDef P rof ), p(yi = Def N oP rof ), and p(yi = Def P rof ). We would
recommend the loans with a lower p(yi = Def N oP rof ) to the investors, which
is the rule set in Sect. 4.7.
Different from previous research which commonly utilizes accuracy to com-
pare classification models, we define our own model comparison rules in this
article because of the special design and the purpose of the study. Considering
that the main goal of this study is to detect/recommend “higher profit” loans,
we use the average profitability of the top loans recommended by the six models
for the model comparison.

4.9 Results and Discussion


Given the outputs of the six models, each model would recommend some high-
quality loans to the investors. We consider a scenario that a lender chooses
several top loans according to the investment suggestions from each of the six
models. The profitability of the loans selected by the six models is calculated
in terms of the average ARR and reported in Fig. 1, where the x-axis denotes
the number of the top loans identified by the six models changing from 1 to 18
and the y-axis shows the average ARR. For example, when the value is 18 on
the x-axis, we use all six models to recommend the top 18 high-quality loan to
the investor. Models 1–5 will recommend the loan portfolio with average ARR
714 Y. Wang et al.

below 1.03 while Model 6 will recommend the loan portfolio with average ARR
around 1.04.

Fig. 1. Average ARR from the selected loans identified by the six models.

Figure 1 indicates that the profitability of our proposed model (Model 6) is


consistently superior over (or equivalent to) the rest of the five models in most
cases. The two exceptions are the cases when the top 1 or 13 loans are selected. It
verifies that incorporating the profit information into credit scoring and further
weighting the loans based on their varying profitability during the model training
process can help the model to spot the “more profitable” loans. By pair-wisely
comparing the results displayed in Fig. 1, more results can be obtained as follows,
which could further highlight the contribution of the proposed methodology:

– Model 1 vs Model 2: The cost-sensitive learning (Model 2) does not show its
superiority over cost-insensitive learning (Model 1), where Model 2 addresses
the imbalance data in P2P lending for binary classification. In other words,
the conventional cost-sensitive learning is not the optimal option in the P2P
study. It may because the imbalance issue is not severe in the P2P market.
– Model 2 vs Model 3: The profit-sensitive learning (Model 3) has better per-
formance than the cost-sensitive learning (Model 2) in a few cases but not
always. In other words, the proposed profit sensitive learning approach shows
its weak superiority in the binary classification case. This verifies that incor-
porating the profit information into target is a useful and important step in
our proposed model.
– Model 1 vs Model 4: The similar performance between these two models
indicates that it is not beneficial in identifying “more profitable” loans by
Profit-Sensitive Credit Scoring in P2P Lending 715

solely transferring the binary classification problem (Model 1) into the multi-
level classification problem (Model 4).
– Model 4 vs Model 5: The cost-insensitive learning (Model 4) even outperforms
the cost-sensitive learning (Model 5) in many cases such as selecting the top
9 or the top 10 loans. Although the Heuristic method is the best practice in
addressing the imbalance issue, it is not the optimal solution for finding the
higher profitable loans in the P2P market when the problem is structured as
a multi-level classification task. This may be because we evaluate the models
differently here, by focusing only on the top loans identified. If still using the
accuracy rate, we may observe some different result.
– Model 5 vs Model 6: The profit-sensitive learning (Model 6) has much better
performance than the cost-sensitive learning (Model 5) in most cases.
– Model 3 vs Model 6: they are both profit-sensitive learning approaches in
this study and the only difference is Model 3 solves the binary classification
problem while Model 6 works out the multi-level classification task. However,
the performance of Model 6 is much better than Model 3. It is consistent
with our expectation since by transferring the binary classification problem
into the multi-level classification problem, the heterogeneity of the default
loans is further reduced. Accordingly, the model prediction will be closer to
the real interest of the investors thus better investment suggestions would be
provided.

In summary, Fig. 1 confirms our belief that the proposed profit-sensitive


multinomial learning method can help lenders to select better loans compared
with the traditionally utilized credit scoring approach. Furthermore, the results
in Fig. 1 highlight our contributions: transferring the traditional binary classifi-
cation problem to a multi-level task and using profit-sensitive learning to solve
the multi-level classification problem. Both are essential in identifying the best
loans.
To further examine the difference of the performance among the six models,
we consider the scenario that a lender chooses the top 18 best loans according
to the six models and check the detailed information of these selected loans.
Table 2 present the constituent of the top 18 loans selected from the six models.
As the result shows, Models 1, 3, and 5 select the same number of loans from
grades A, B, C, D, and E while Models 2 and 4 select the same number of loans
from different grades. On the other hand, the distribution of the loans across
grades is different in Model 6. Among the 18 loans recommended to the lenders,
the number of grade A loans are always the most. This is consistent with the
expectation that the credit scoring approach (no matter whether or not the profit
information is added) tends to identify “safer” loans. However, the average ARR
for these eight grade A loans differs for the six models, indicating that the loans
identified from each model are not exactly the same.
Table 3 summarizes the overall average ARR and the average default rate
of the top 18 loans identified by the six models. It is surprising to find that
although achieving the highest ARR on average, Model 6 has the lowest default
rate among all the models. It verifies that under the premise of not sacrificing
716 Y. Wang et al.

Table 2. Constituent of the top 18 loans selected by models 1–6. The “Sum” column
contains the total number of loans and also the number of defaulted loans in the
parenthesis. ARR denotes the average ARR in each grade segment.

Grade Model 1 Model 2 Model 3 Model 4 Model 5 Model 6


Sum ARR Sum ARR Sum ARR Sum ARR Sum ARR Sum ARR
A 7(0) 1.048 8(0) 1.047 7(0) 1.048 8(0) 1.047 7(0) 1.048 8(0) 1.047
B 3(0) 1.060 2(0) 1.043 3(0) 1.060 2(0) 1.043 3(0) 1.062 3(0) 1.062
C 6(1) 1.051 6(1) 1.051 6(1) 1.051 6(1) 1.051 6(1) 1.051 5(0) 1.105
D 1(1) 0.800 1(1) 0.800 1(1) 0.800 1(1) 0.800 1(1) 0.800 1(1) 0.800
E 1(1) 0.881 1(1) 0.881 1(1) 0.881 1(1) 0.881 1(1) 0.881 1(1) 0.881

“safety”, incorporating the profit information into credit scoring could identify
loans with a higher profitability than the credit scoring alone approach.

Table 3. Average ARR and average default rate of the top 18 loans selected by the
six models. ARR denotes the overall average ARR.

Metric Model 1 Model 2 Model 3 Model 4 Model 5 Model 6


ARR 1.028 1.025 1.028 1.025 1.028 1.043
Default rate 0.167 0.167 0.167 0.167 0.167 0.111

In closing, the effectiveness of the proposed methodology is validated using


the real-world data from Lending Club, which is one of the largest P2P plat-
forms in the US. In order to have a comprehensive analysis, we compared the
proposed method with a series of benchmarks or existing similar approaches
that we could find in the previous literature, including binary logistic regression,
cost-sensitive binary logistic regression, profit-sensitive binary logistic regression,
multinomial logistic regression, and cost-sensitive multinomial logistic regres-
sion. Results have shown that the proposed profit-sensitive multinomial logistic
regression achieves the highest profitability while ensuring the low default rate
compared to the benchmarks. Therefore, it is confirmed that integrating the
profit information into credit scoring and using the profit to adjust the emphasis
on different loans can better meet the “low risk” + “high profit” objectives in
P2P lending.

5 Conclusion and Future Work


With the goal of identifying the loans that are both “low risk” and “high profit”
for P2P lenders, we propose a profit-sensitive learning approach by integrat-
ing the profit information into the credit scoring approach for loan evaluations.
Profit-Sensitive Credit Scoring in P2P Lending 717

We first formulate a multi-level classification task and then define a novel loss
function for multinomial logistic regression to solve the pre-defined multi-level
classification problem. The proposed loss function aims to put different weights
on loans according to their varying profits as well as their occurrence frequen-
cies. As a result, the loans with higher profits (regardless of whether they are
the usual cases or the rare cases in the real-world practice) are given higher
weights during the model training process and they have a higher chance to be
recommended to the investors.
The effectiveness of the proposed methodology is validated using the real-
world Lending Club data. Results show that the proposed profit-sensitive learn-
ing approach can not only identify the “higher profit” loans but also maintain the
risk control. To the best of our knowledge, our study is the first that integrates
the profit information into the traditional credit scoring approach by formulat-
ing a multi-level classification problem along with a profit-sensitive loss function.
This approach can also be applied to model any scenario that has two outcomes
– one nominal and one numerical – while there exists some trade-off between the
two outcomes.
Our work also has some limitations, one of which is that the effectiveness of
the proposed methodology is validated only on the offline data. In our future
work, we plan to implement the proposed method on real-time data to provide
instant loan evaluations. We also want to compare the performance of the pro-
posed multinomial logistic regression on the financial data pre-, during, and post
the COVID-19 pandemic period, to gain insights on the impacts from certain
global or national crisis. In addition, we plan to leverage the logic of profit-
sensitive learning to other machine learning algorithms for binary or multino-
mial classification problems, including but not limited to neural network, random
forest, etc.

References
1. Bachmann, A., et al.: Online peer-to-peer lending-a literature review. J. Internet
Bank. Commer. 16(2), 1–18 (2011)
2. Bastani, K., Asgari, E., Namavari, H.: Wide and deep learning for peer-to-peer
lending. Exp. Syst. Appl. 134, 209–224 (2019)
3. Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math.
44(1), 197–200 (1992)
4. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller,
K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436.
Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 25
5. Boyd, S., Lieven, V.: Convex Optimization. Cambridge University Press, Cam-
bridge (2004)
6. Kim, J.Y., Cho, S.B.: Predicting repayment of borrows in peer-to-peer social lend-
ing with deep dense convolutional network. Exp. Syst. 36(4), e12403 (2019)
7. Charles, X.L., Victor, S.S.: Cost-sensitive learning. Encyclopedia Mach. Learn.
231–235 (2010)
718 Y. Wang et al.

8. Ma, X., Sha, J., Wang, D., Yuanbo, Yu., Yang, Q., Niu, X.: Study on a prediction
of p2p network loan default based on the machine learning lightgbm and xgboost
algorithms according to different high dimensional data cleaning. Electron. Com-
mer. Res. Appl. 31, 24–39 (2018)
9. Malekipirbazari, M., Aksakalli, V.: Risk assessment in social lending via random
forests. Exp. Syst. Appl. 42(10), 4621–4631 (2015)
10. Minka, T.P.: Algorithms for maximum-likelihood logistic regression (2012)
11. Pedregosa, F., et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.
12, 2825–2830 (2011)
12. Roberts, A.W.: Convex Functions. In: Handbook of Convex Geometry, pp. 1081–
1104. Elsevier (1993)
13. Serrano-Cinca, C., Gutiérrez-Nieto, B.: The use of profit scoring as an alternative
to credit scoring systems in peer-to-peer (p2p) lending. Decis. Support Syst. 89,
113–122 (2016)
14. Serrano-Cinca, C., Begoña, G.-N., Luz, L.-P.: Determinants of default in p2p lend-
ing. PLoS ONE 10(10), e0139427 2015
15. Shapiro, A., Wardi, Y.: Convergence analysis of gradient descent stochastic algo-
rithms. J. Optim. Theor. Appl. 91(2), 439–454 (1996)
16. Wang, Y., Xuelei, S.N.: A xgboost risk model via feature selection and bayesian
hyper-parameter optimization. arXiv preprint arXiv:1901.08433 (2019)
17. Wang, Y., Xuelei, S.N.: Improving investment suggestions for peer-to-peer lending
via integrating credit scoring into profit scoring. In: Proceedings of the 2020 ACM
Southeast Conference, pp. 141–148 (2020)
18. Xia, Y., Liu, C., Liu, N.: Cost-sensitive boosted tree for loan evaluation in peer-
to-peer lending. Electron. Commer. Res. Appl. 24, 30–49 (2017)
Distending Function-based Data-Driven
Type2 Fuzzy Inference System

József Dombi and Abrar Hussain(B)

Institute of Informatics, University of Szeged, 6720 Szeged, Hungary


[email protected]

Abstract. Some challenges arise when applying the existing fuzzy type2
modeling techniques. A large number of rules are required to complete
cover the whole input space. A large of parameters associated with type2
membership functions have to be determined. The identified fuzzy model
is usually difficult to interpret due to the large number of rules. Designing
a fuzzy type2 controller using these models is a computationally expen-
sive task. To overcome these limitations, a procedure is proposed here
to identify the fuzzy type2 model directly from the data. This model is
called the Distending Function-based Fuzzy Inference System (DFIS).
The proposed procedure is used to model the altitude controller of a
quadcopter. The DFIS model performance is compared with various
fuzzy models. The performance of this controller is compared with type1
and type2 fuzzy controllers.

Keywords: Fuzzy Type2 modeling · Parrot mini-drone mambo ·


Type2 distending function

1 Introduction

Fuzzy theory has found numerous practical applications in the fields of engi-
neering, operational research and statistics [16,23,24]. In most cases, expert
knowledge is not available or it is poorly described. So the exact description
of fuzzy rules is not an easy task. However, if the working data of the process
is available then a data-driven based design is an attractive option [1,14,21].
Fuzzy modeling involves the identification of fuzzy rules and parameter values
from the data. The data-based identification of a fuzzy model can be divided
into two parts, namely qualitative and quantitative identification. Qualitative
identification focuses on the number and description of fuzzy rules, while quan-
titative identification is concerned with the identification of parameter values.
These parameters belong to membership functions and fuzzy operators. In one
of the latest paper, Dutu et al. [7] investigated the qualitative identification in
detail for Mamdani-like fuzzy systems. A parametrized rule learning technique
called Selection-Reduction was introduced. The number of rules were optimized
by dropping some rules based on the rule redundancy index. This technique is

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 719–730, 2023.
https://doi.org/10.1007/978-3-031-18461-1_47
720 J. Dombi and A. Hussain

called Precise and the Fast Fuzzy Modeling (PFFM) approach. The unique fea-
tures of this approach include the high accuracy of the trained model, minimum
time for rules generation and better interpretability provided by the compact
rule set.
Because of the uncertainties, the membership functions are no longer certain
i.e. the grade of the membership functions cannot be a crisp value. To overcome
this problem, type-2 membership functions (T2MF) were introduced. T2MF
contains the footprint of uncertainty (FOU) between the upper membership
function (UMF) and lower membership functions (LMF). Interval T2FS have
been developed to reduce the computational complexity [15]. T2FS has superior
properties such as: 1) Better handling of uncertainties [17]; 2) Smooth controller
response [22]; 3) Adaptivity [22]; 4) Reduction in the number of fuzzy rules [9].
T2FS have been successfully used in control system design [3], data mining [19]
and time series predictions [8]. The design of the interval T2FS consists involves
a type reduction step. In this step type2 fuzzy sets is converted to type1 fuzzy
sets. The type reduction step is performed using the so-called Karnik-Mendel
(KM) iterative algorithm [18].
This approach has some drawbacks, such as: 1) The choice of T2MFs; 2) The
computational complexity of the type reduction step; 3) Difficulties in the opti-
mization process; 4) Controller design complexity. Quite recently several tech-
niques have been proposed to tackle these problem [2,11,26]. Tien-Loc Le pre-
sented a self-evolving functional-link type2 fuzzy neural network (SEFIT2FNN)
[12]. It uses the particle swarm optimization method to adjust the learning rate
of the adaptive law. The adaptive law tunes the parameters of the type2 fuzzy
neural network. SEFIT2FNN has been shown to successfully control the antilock
braking system under various road conditions.
However, these existing approaches have some limitations. Qualitative identi-
fication suffers from the so-called flat structure (curse of dimensionality) problem
of the rule-base [19] i.e. if the number of input variables increase, then an expo-
nentially large number of rules are required to accurately model the system. The
computational complexity of the quantitative part of the identified fuzzy model
also increases with the number of rules. As the number of rules increases, the
number of parameters of the T2MF and operators also grows exponentially. The
choice of T2MF and its systematic connection with the type of uncertainty are
not clear. Different type-1 membership functions can be combined to generate
T2MFs. In most cases, the interpretability of the identified fuzzy rule base is
not clear. If the number of rules grows exponentially, then for a given set of
input values, it is not possible to predict the response of the model and analyze
its performance. Although type-2 fuzzy logic systems require fewer rules com-
pared to type-1 fuzzy systems, the number of parameters is comparatively large.
So optimizing a large number of parameter values is not an easy task. Most of
the fuzzy type2 control design techniques use the type reduction step [20]. The
type reduction step is based on the KM algorithm, which is computationally
expensive.
Distending Function-based Data-Driven Type2 Fuzzy Inference System 721

Here, we propose solutions to remove some of these limitations. We present a


novel technique for fuzzy type2 modeling and control. The proposed fuzzy model
(DFIS) consists of rules and type2 membership functions. The rules are based
on the Dombi conjunctive operator. A procedure for designing a fuzzy type2
controller using the rules is also presented. The controller can handle various
types of uncertainties (e.g., sensor noise). The rest of the paper is organized as
follows. In Sect. 2, we briefly introduce the interval T2DF and its properties.
In Sect. 3, we explain the proposed type2 fuzzy modeling approach and rule
reduction algorithm. In Sect. 4, we describe the bench mark system, simulation
results and discuss the results. In Sect. 5, we give a brief conclusion.

2 Interval Type-2 Distending Function


Zadeh proposed various membership functions [25] and based on one of those,
we defined a more general parametric function and it is called the Distending
Function (DF). The DF has four parameters, namely ν, ε, λ and c.
It has two forms: 1) Symmetric; 2) Asymmetric. The Symmetric DF (shown
in Fig. 1) is symmetric around x − c and it is defined as [6]:
1
(λ)
δε,ν (x − c) =  x−c λ , (1)
1+ 1−ν  
ν ε

Fig. 1. Various shapes of symmetric distending functions (here c = 0)

(λ)
where ν ∈ (0, 1), ε > 0, λ ∈ (1, +∞) and c ∈ R. δε,ν (x − c) is denoted by
δs (x).
The values of the DF parameters (ν, ε, λ, c) may be uncertain. As a result,
these parameters can take various values around their nominal values, within
the uncertainty bound (Δ). By varying the parameter values within Δ, various
722 J. Dombi and A. Hussain

Fig. 2. Uncertain peak values T2DF with the footprint of uncertainty (FOU)

DFs are obtained. The DF with the highest grade values is called the upper
membership function (UMF) and that with the lowest values is called the lower
membership function (LMF). The UMF, LMF and various DFs in between can
be combined to form an interval T2DF [5]. If the peak value of the DF becomes
uncertain, then it can be represented using the interval T2DF with an uncertain
’c’ value, as shown in Fig. 2.
Various T2DFs belonging to the same fuzzy variable can be combined to
form a single T2DF. The support of the resultant T2DF will be approximately
the same as the combined support of the individual T2DFs. The UMF of the
T2DF consists of the LHS and RHS (the same is true for the LMF). The LHS
and RHS are given by [5]:
1
2
δ̄L (x − c) =  x−c λ , (2)
1+ 1−ν   1
ν ε 1+e( λ∗ (x−c))
1
2
δ̄R (x − c) =  x−c λ . (3)
1+ 1−ν   1
ν ε 1+e( −λ∗ (x−c))

The LHS and RHS of the UMF and LMF can be combined using the Dombi
conjunctive operator to get a single T2DF. Consider two T2DFs δ12 and δ22 . The
LHS of δ12 and RHS of δ22 can be combined using the Dombi conjunctive operator
2
[4]. This produces a resultant T2DF δresult , as shown in Fig. 3. Combining various
T2DF helps to reduce the number of fuzzy rules. This leads to a decrease in the
computational complexity of the identified fuzzy model.
The proposed design approach is explained in the next section.
Distending Function-based Data-Driven Type2 Fuzzy Inference System 723

Fig. 3. Combining two T2DF (δ12 and δ22 ) to get a single T2DF (δresult
2
)

3 Data-Driven Type2 Fuzzy System


Let us now assume that the input and output databases have the following form:
⎡ ⎤ ⎡ ⎤
a11 a12 ... a1n b1
⎢ 2 2 2⎥ ⎢ 2⎥
⎢a1 a2 ... an ⎥ ⎢b ⎥
U =⎢⎢ .. .. .. .. ⎥,
⎥ V =⎢ ⎥
⎢ .. ⎥ , (4)
⎣ . . . . ⎦ ⎣.⎦
al1 al2 ... aln bl

where U and V contains the l data points of each input and output variable.
Here, a1 , a2 , . . . , an are the data points belonging to the input fuzzy subsets
U1 , U2 , . . . , Un , respectively, and b1 is included in the output fuzzy subset V .
Each column of the U matrix corresponds to a unique feature (input variables)
of the process. Therefore the U matrix forms an n dimensional input feature
space. Each column of the training matrix U is normalized by transforming it
to the [0, 1] interval.
The fuzzy rule consists of an antecedent and a consequent part. Here, the
antecedent part contains a row of U and the consequent part is an element of
V . Therefore, a few rows from the data base matrix U are selected. These rows
and the corresponding elements in the V matrix are used to construct the rule
base. It is called the boundary-value rule base (Rb ) because it mostly contains
those values of the inputs that lie on the boundary of the input space.
In our procedure, two different surfaces are constructed. These are called the
estimated and the fuzzy surfaces. The estimated surface is constructed directly
from the database (Eq. (4). Each selected row from the database matrix U
corresponds to a single rule. It is a row vector and it consists of unique values of
all the input variables (features). We will construct T2DFs for all input variables.
The input variables are usually measured using the feedback sensors. The Δ value
of each sensor depends on the tolerance intervals of the corresponding sensor. All
the Δ values are transformed into the [0, 1] interval to make these compatible
724 J. Dombi and A. Hussain

with the values of the input variables. T2DFs have a long tail. Consequently each
T2DF influences the other existing T2DFs. The ν value of each T2DFs will be
calculated based on the principle of minimum influence on all the other T2DFs.
This influence can never be zero, but it can be decreased by a factor k. For less
influence, a large value of k should be chosen. However from a practical point of
view, a value of 10 is sufficient. The required value of ν can be calculated using
1
ν= , (5)
1 + k−1
d

x −x x −x
where d = | i1  j1 |λ + · · · + | in  jn |λ . Each rule is evaluated using the
Dombi operator. By applying the Dombi conjunctive/disjunctive operator over
the n input T2DFs, we get a single T2DF. This is called the output T2DF. All
these output T2DFs are superimposed in the input space to generate a fuzzy
surface G∗ .
An error surface E is defined as the difference between the estimated surface
G and the fuzzy surface G∗ . That is,
E(x1 , . . . , xn ) = G(x1 , . . . , xn ) − G∗ (x1 , . . . , xn ). (6)
We shall decrease the magnitude of E below a chosen threshold τE ( |E| <
τE ). This is achieved by an iterative procedure of adding new rules to Rb . To
add a new fuzzy rule, the coordinates of the maximum value on E are located.
The corresponding row in the database containing these coordinates is selected.
This row is then added to Rb as a new rule. This rule is evaluated to generate
an output T2DF. The ν value of this output T2DF is then calculated using Eq.
(5). This T2DF is superimposed in G∗ . It should be added that extracting the
type2 fuzzy model from the data is based on the DF. Therefore, we call this type2
model the DF-based fuzzy inference system (DFIS). Here, we describe a heuristic
approach used to decrease the number of rules in Rb . Rules reduction will lead
to a lower computational cost and better interpretability. Various output T2DFs
which are close to each other in the input space can be combined to get a single
T2DF (as shown in Fig. 3). The output T2DFs are segregated into different
groups. If the Euclidean distance between the peak value coordinates of various
output T2DFs is less than a predefined distance D, then these T2DFs are placed
in the same group:
Sum of Euclidean distances b/w peak value T2DFs
D= . (7)
Total no. of T2DFs in the same half
Each output T2DF is obtained by applying a unique rule in Rb . The output
T2DFs in the same group are combined together to produce a single T2DF.
Consequently the rules associated with all these output T2DFs are eliminated
and replaced by a single new rule. Therefore the number of rules in Rb decreases.
Now it is called a reduced rule base Rr . Using Rr , a new fuzzy surface is con-
structed and it is denoted by G∗r . Then a reduced error surface (Er ) is obtained
using
Er (x1 , . . . , xn ) = G(x1 , . . . , xn ) − G∗r (x1 , . . . , xn ). (8)
Distending Function-based Data-Driven Type2 Fuzzy Inference System 725

This procedure is performed in an iterative way as long as Er (x1 , . . . , xn )


is within a chosen threshold τR . Finally, the extracted DFIS model (Rr plus
T2DFs) can be used to design an arithmetic-based interval type-2 fuzzy con-
troller [6].

4 Benchmark System, Simulations and Results


A Parrot mini-drone Mambo was used in this study. The Matlab Simulink
Aerospace block set provides the simulation model of this quadcopter [10]. The
simulation consists of the air-frame model, sensors model, environment model
and flight controller. It consists of axis parameters (rotational (φ, θ, ψ) and trans-
lational (x, y, z)), mass, torques, and rotors. The environment model describes
the effects of external factors on the quadcopter. It consists of atmosphere and
gravity models. The sensor model includes three sensors, namely 1) Sonar for
altitude measurement; 2) A camera for optical flow estimation; 3) IMUs to mea-
sure the linear and rotational motions. The flight control system contains the
roll φ, pitch θ , yaw ψ and altitude z controllers. The mathematical model of
the system is given by

Ẋ =F (x, u) + N. (9)

Here X is the state vector consisting of translational and rotational compo-


nents, N contains the external disturbances affecting the system states and u
represents the model inputs. Let Ω1 , Ω2 , Ω3 , Ω4 be the angular speeds of the
four rotors of the quadcopter. Then
⎡ ⎤ ⎡ ⎤
u1 b Ω12 +
2Ω 2
+ Ω 2
3 + Ω 2
⎢ u2 ⎥ ⎢ 4

⎢ ⎥ ⎢ b −Ω22 + Ω42 ⎥
⎢ u3 ⎥ = ⎢ ⎢ −

⎥ .
2 2
⎢ ⎥ b Ω Ω
⎣ u4 ⎦ ⎢ 1 3
2 ⎥
⎣d −Ω1 + Ω2 − Ω3 + Ω4 ⎦
2 2 2

Ωr −Ω1 + Ω2 − Ω3 + Ω4

Here, u2 , u3 , u4 control the roll, pitch and yaw angles. u1 is the total thrust
input and it controls the altitude z of the quadcopter. b is the thrust coefficient,
d is the drag coefficient and Ωr is the residual angular speed.
It should be noted that this quadcopter model is used only to generate the
data.
Here in these simulations, we seek to model the altitude controller of the
quadcopter. A training dataset which contains the samples of input and output
of the altitude controller is a requirement for applying the proposed procedure.
The requirement was satisfied by controlling the altitude of quadcopter using the
PD controller in Matlab. The dataset containing the inputs and output of the
PD controller was created. Later the proposed procedure was used to generate
the DFIS model of the altitude controller using this input and output dataset.
The threshold τE was set to 0.15. the proposed procedure extracted 26 rules
from dataset, and these formed the rule base Rb .
726 J. Dombi and A. Hussain

The number of rules in Rb were reduced to 17 by merging a few rules. This


is called the reduced rule base Rr . A few T2DFs (3 out of 17) of each input
are shown in Fig. 5. Rr and all the T2DFs in it, collectively form a reduced
(simplified) DFIS model. The surface of the DFIS model is shown in Fig. 4.
Table 1 summarizes the key differences among the DFIS and various other
fuzzy models.

Fig. 4. Surface plot of the DFIS model. The upper surface is in blue and the lower
surface is in red. (Color figure online)

Table 1. Performance comparison of the proposed DFIS model with previously pro-
posed models

S. No Model Number of rules Number of tunable parameters Membership functions used


1 IT2FNN [13] 72 336 T2GMF
2 MIT2FC [13] 36 168 T2GMF
3 Proposed DFIS model 17 34 T2DF

The objective is to control the altitude z by generating an appropriate total


thrust u1 . The thrust u1 depends on the height (sonar) measurements and rate
of change of the height of the quadcopter. An arithmetic-based controller was
designed using Rr . Figure 6 shows the surface plot of the arithmetic based con-
troller. The three controllers (arithmetic-based, ANFIS type1, ANFIS type2)
were used to regulate the altitude of the quadcopter in MATLAB Simulink.
White noise was added to the altitude measurements by the sonar sensor. The
quadcopter was programmed to takeoff and reach an altitude of 0.7 m, then rise
to an altitude of 1 m and finally descend to 0.7 m. Figure 7 shows the altitude
response of the controlled quadcopter during this simulation study.
Distending Function-based Data-Driven Type2 Fuzzy Inference System 727

Fig. 5. Three T2DFs of each normalized input (DFIS model)

Fig. 6. Control surface of the arithmetic-based controller


728 J. Dombi and A. Hussain

Fig. 7. Simulated altitude response of the quadcopter with various controllers (in Mat-
lab Simulink). The Measurements got from the altitude sensor (Sonar) were corrupted
with white noise.

5 Conclusion
In this study, we presented the solutions to some of the limitations associated
with the existing fuzzy type2 modeling and control techniques. A procedure was
proposed to identify the type2 model directly from the data, which we called
the DFIS model. This model consists of rules and Type2 Distending Functions
(T2DFs). The whole input space is covered using a few rules. T2DFs can model
various types of uncertainties using its parameters. A rule reduction procedure
is also proposed. It combines the T2DFs in the close vicinity and it significantly
reduces the number of rules. Because of its low computational complexity and
design simplicity, the controller is suitable for real-time control applications. The
future includes refinement of the procedure, more comparisons and real-time
implementation.

Acknowledgment. The research was supported by the Ministry of Innovation and


Technology NRDI Office (Project no. TKP2021-NVA-09) within the framework of the
Artificial Intelligence National Laboratory Program (RRF-2.3.1-21-2022-00004).

References
1. Angelov, P.P., Filev, D.P.: An approach to online identification of takagi-sugeno
fuzzy models. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 34(1), 484–498
(2004)
2. Bhattacharyya, S., Basu, D., Konar, A., Tibarewala, D.N.: Interval type-2 fuzzy
logic based multiclass ANFIS algorithm for real-time EEG based movement control
of a robot arm. Robot. Autonom. Syst. 68, 104–115 (2015)
Distending Function-based Data-Driven Type2 Fuzzy Inference System 729

3. Castillo, O., Melin, P.: A review on interval type-2 fuzzy logic applications in
intelligent control. Inf. Sci. 279, 615–631 (2014)
4. Dombi, J.: A general class of fuzzy operators, the demorgan class of fuzzy operators
and fuzziness measures induced by fuzzy operators. Fuzzy Sets Syst. 8(2), 149–163
(1982)
5. József, D., Abrar, H.: Interval type-2 fuzzy control using distending function. In:
Fuzzy Systems and Data Mining V: Proceedings of FSDM 2019, pp. 705–714. IOS
Press (2019)
6. Dombi, J., Hussain, A.: A new approach to fuzzy control using the distending
function. J. Process Control 86, 16–29 (2020)
7. Duţu, L.-C., Mauris, G., Bolon, P.: A fast and accurate rule-base generation
method for mamdani fuzzy systems. IEEE Trans. Fuzzy Syst. 26(2), 715–733
(2017)
8. Gaxiola, F., Melin, P., Valdez, F., Castillo, O.: Interval type-2 fuzzy weight adjust-
ment for backpropagation neural networks with application in time series predic-
tion. Inf. Sci. 260, 1–14 (2014)
9. Hagras, H.: Type-2 flcs: a new generation of fuzzy controllers. IEEE Comput. Intell.
Mag. 2(1), 30–43 (2007)
10. Mathworks Matlab hardware team. Parrot Drone Support from MATLAB. https://
www.mathworks.com/hardware-support/parrot-drone-matlab.html. Accessed 11
Mar 2020
11. Hassani, H., Zarei, J.: Interval type-2 fuzzy logic controller design for the speed
control of dc motors. Syst. Sci. Control Eng. 3(1), 266–273 (2015)
12. Le, T.-L.: Intelligent fuzzy controller design for antilock braking systems. J. Intell.
Fuzzy Syst. 36(4), 3303–3315 (2019)
13. Le, T.L., Quynh, N.V., Long, N.K., Hong, S.K.: Multilayer interval type-2 fuzzy
controller design for quadcopter unmanned aerial vehicles using jaya algorithm.
IEEE Access 8, 181246–181257 (2020)
14. Li, C., Zhou, J., Chang, L., Huang, Z., Zhang, Y.: T-s fuzzy model identification
based on a novel hyperplane-shaped membership function. IEEE Trans. Fuzzy Syst.
25(5), 1364–1370 (2017)
15. Liang, Q., Mendel, J.M.: Interval type-2 fuzzy logic systems: theory and design.
IEEE Trans. Fuzzy Syst. 8(5), 535–550 (2000)
16. Mahfouf, M., Abbod, M.F., Linkens, D.A.: A survey of fuzzy logic monitoring and
control utilisation in medicine. Artif. Intell. Med. 21(1–3), 27–42 (2001)
17. Mendel, J.M.: Computing with words: zadeh, turing, popper and occam. IEEE
Comput. Intell. Mag. 2(4), 10–17 (2007)
18. Mendel, J.M.: Uncertain rule-based fuzzy logic systems: introduction and new.
Directions. Ed. USA: Prentice Hall, pp. 25–200 (2000)
19. Niewiadomski, A.: A type-2 fuzzy approach to linguistic summarization of data.
IEEE Trans. Fuzzy Syst. 16(1), 198–212 (2008)
20. Tai, K., El-Sayed, A.R., Biglarbegian, M., Gonzalez, C.I., Castillo, O., Mahmud, S.:
Review of recent type-2 fuzzy controller applications. Algorithms 9(2), 39 (2016)
21. Tsai, S.-H., Chen, Y.-W.: A novel identification method for takagi-sugeno fuzzy
model. Fuzzy Sets Syst. 338, 117–135 (2018)
22. Wu, D., Tan, W.W.: Genetic learning and performance evaluation of interval type-2
fuzzy logic controllers. Eng. Appl. Artif. Intell. 19(8), 829–841 (2006)
23. Yager, R.R., Zadeh, L.A.: An Introduction to Fuzzy Logic Applications in Intelli-
gent Systems, vol. 165. Springer, New York (2012)
24. Yu, L., Yan-Qing, Z.: Evolutionary fuzzy neural networks for hybrid financial pre-
diction. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 35(2), 244–249 (2005)
730 J. Dombi and A. Hussain

25. Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and
decision processes. IEEE Trans. Syst. Man Cybern. SMC-3(1), 28–44 (1973)
26. Zheng, J., Wenli, D., Nascu, I., Zhu, Y., Zhong, W.: An interval type-2 fuzzy
controller based on data-driven parameters extraction for cement calciner process.
IEEE Access 8, 61775–61789 (2020)
Vsimgen: A Proposal for an Interactive
Visualization Tool for Simulation
of Production Planning and Control
Strategies

Shailesh Tripathi1(B) , Andreas Riegler2 , Christoph Anthes2 ,


and Herbert Jodlbauer1
1
Production and operations management, University of Applied Sciences Upper
Austria, Steyr, Austria
[email protected]
2
University of Applied Sciences Upper Austria, Hagenberg, Austria

Abstract. We propose the development of an interactive visualization


and analysis tool, Vsimgen, for production planning and control (PPC)
strategies to be analyzed with simulation generator software (simgen).
This generic and scalable discrete simulation model is commonly used to
deal with optimization problems in PPC, such as MRP II (manufactur-
ing resource planning). The concept is to provide an easy to use visual
interface that hides complex details and can execute multiple steps of
discrete simulations for PPC using various user interactive and visualiza-
tion options for data selection and preprocessing, parameterization, and
experimental design. We also emphasize collaboration by users from var-
ious domains of industrial production. With collaboration, effective PPC
strategies can be executed that consider various production details pro-
vided by domain experts, managing different production-related tasks,
and yielding better insight into the various production-related problems.

Keywords: Production planning and control · Discrete simulation ·


Network representation and visualization · Network analysis ·
Interactive visualization · Immersive visualization

1 Introduction
Market scenarios are changeable due to product complexity, demand variation,
and competitiveness, and manufacturing companies require improved logistic
performance that optimizes the balance between cost and customer service.
High-quality decisions relating to production planning and control (PPC) strate-
gies, parameterization of selected PPC attributes, and capacity investment are
essential to improve the company’s logistic performance. PPC’s key function-
alities include planning material requirements, demand management, capacity
planning, and sequencing and scheduling jobs to meet manufacturing compa-
nies’ production-related challenges. Appropriate PPC strategies are dependent
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 731–752, 2023.
https://doi.org/10.1007/978-3-031-18461-1_48
732 S. Tripathi et al.

on the company’s demand, products, production environment, and manufac-


turing characteristics [36]. In a competitive and unpredictable market scenario,
demand levels can be uncertain and unsteady; production planning models that
ignore uncertainty can lead to poor planning decisions [30]. Similarly, product
complexity and product variety affect manufacturing companies’ logistic perfor-
mance and are associated with poor outcomes for operational measures such as
cost, time, quality, and delivery performance [9,42].
To improve logistic performance, we must address PPC challenges effectively.
The most significant PPC challenges in customized production are to reduce
work in progress (WIP), minimize shop floor throughput times (SFTT) and lead
times, lower stockholding costs, improve responsiveness to changes in demand, and
improve delivery date (DD) adherence [36,45]. Various analytical and simulation
models are utilized to optimize PPC outcomes. [20] proposes an analytical app-
roach to find the relationship between capacity investment and inventory require-
ments in order to minimize the costs of capacity and inventory. The author in [22]
evaluates drum-buffer-rope (DBR) and CONWIP (Constant Work In Process)
systems for production planning using a continuous-time Markov process model.
However, such analytical models are not implementable for practical use. Other
traditional discrete event simulation models are used for planning and design in
the study of system performance [40], for example, studies of flexibility of design
for routing policies and equipment [15,41]. Such models ignore the implementa-
tion and operational phases and are called “throwaway” models because after the
design phase is completed these models are not further used.
Discrete event simulation models such as simgen, a generic and scalable sim-
ulation model, are commonly used to deal with optimization problems in PPC
[1,17,26]. Simgen can be applied for any production structure and uses a hier-
archical production planning concept divided into the following three levels:

– Long-term. Capacity investment decisions, resource planning, and aggregated


production planning.
– Medium-term. Includes shift model, overtime, PPC methods such as material
requirement planning (MRP) or CONWIP, production system structure such
as flow shop, and job shop.
– Short-term. Includes various day-to-day operational planning details such as
dispatching rules.

The advantage of the simgen model is the practical applicability of the mod-
els due to the required input parameters, which are selected from the Enterprise
Resource Planning (ERP) system’s data and are processed and stored in its
database. These sets of parameters are known master data parameters. The
master data define the simulation model by creating a production system struc-
ture.
The input parameters defined from master data for discrete event simula-
tions are the bill of materials (BOM), routing sequence of materials, qualifica-
tion matrix, production planning parameters for each item, shift calendars, skill
groups, total available employees, production program, the expected forecasts
Network Visualization and Analysis 733

of the final items and the customer’s demand in terms of order size, and the
customer’s expected lead time. The combined BOM and material routing table,
named WS Master, contains three attributes:

– Parent. Can have one or more child items. End products are always parents,
and product sub-assemblies can be a parent or child.
– Child. Material variants and product sub-assemblies that are required to build
a parent item.
– Machine/workstation group. The machines (assigned machine IDs) or group
of employees working at the workstations (workstation IDs) assembling parent
items or producing the items.

The transaction data define the second set of parameters. They are used
to characterize probability distributions and the respective parameter estima-
tions. The estimated distributions’ parameters are used to randomly initialize
processing, setup, sales data variables, repair time, delivery time, and produc-
tion planning variables. The parameter selections and experimental design are
then applied to discrete simulations for various PPC scenarios. The discrete
simulation results are validated and compared with previous years’ real-world
business outcome data, for example, previous years’ real-world inventory, work
in progress, and service level data for a manufacturing company. Further, the
results are analyzed by business experts for managerial insights related to various
production scenarios.
The PPC simulation parameterization follows three steps:

– Creation of production structure from master data.


– Generation of various random variables for processing time, set-up-time,
meantime to repair, mean time between failures, customer demand, sales
details, and lead time (customer required, replenishment, delivery), performed
by characterizing probability distributions and their parameters based on
transaction data.
– Experimental design by varying different combinations of parameters to
match the optimization problem’s objectives.

One of the bottlenecks in discrete simulation using simgen is the computa-


tional time complexity, which can be high due to the large number of variants
modeled in combination with the heuristic optimization methods used.
In realistic production scenarios where variant diversity is high, it takes a
long time for a simulation model to optimize product variants, and such vari-
ants may add inappropriate model details. Therefore, we must reduce the num-
bers of materials and resources to reasonable numbers of groups (representative
materials), which is done by finding representative materials and resources for
various similar or redundant routing sequences from the routing data obtained
from the database [37]. Sometimes workstations also have similar functions but
are named and located differently; in such cases, workstations and materials
should be identified manually by experts to reduce the number of materials and
resources.
734 S. Tripathi et al.

The preparatory steps before discrete simulation are not trivial tasks; expert
insight is required for data prepossessing and selecting BOM data, representa-
tive materials, other relevant parameters, and experimental designs. The initial
preparation requires the collaborative effort of business experts from different
industrial-production domains within a company. Therefore, a platform with
visual and interactive functionality can help experts from other domains to share,
exchange, and collaborate in an understandable and straightforward manner.
We propose the development of an interactive platform with visual assistance
that allows business experts to prepare, select, and validate the real-world pro-
duction data and parameters, and various production activities of a company
to produce a simplified version of the data and parameters that is suitable for
the requirements of a discrete simulation intended to provide optimal solutions
to various PPC challenges. The interactive platform would allow network-based
approaches to visualize and analyze data systematically. Further, the platform
should facilitate multiple users for interactive analysis of the results of discrete
simulation by the collaborating team members.
The structure of this paper is as follows. The second section briefly outlines
objectives, tasks, and network-based approaches for data exploration and tech-
nical implementation. The third section discusses a case study of ERP data used
for the sheet metal processing industry, including preprocessing, network con-
struction methods, and step-by-step network exploration methods for preparing,
selecting, and analyzing production data. Finally, Sect. 4 provides concluding
remarks about our analysis and future work approach.

2 Methods
In this section, we highlight the main objectives for the interactive visualization
platform and point out some of the tasks necessary to achieve these objectives.
We also provide a brief description of the key aspects of its network visualization
and exploration functionality, and basic details of technical implementation.

2.1 Objectives

Our proposal is for an interactive platform for visual modeling of discrete simula-
tion for various PPC objectives related to optimizing logistic performance in indus-
trial manufacturing. Visual modeling of discrete simulation for PPC will facilitate
a higher level of abstraction so that domain experts can develop various discrete
simulation scenarios without the requirement for any technical understanding.
This interactive platform would have data visualization features based on immer-
sive visualization technology that would allow different users, acting collabora-
tively, to execute various discrete simulation steps to optimize PPC issues.
The platform would allow users to perform various parameter selection tasks,
experimental design, validation of simulation outcomes with state-of-the-art data
visualization techniques, data analytics, and immersive and collaborative visual-
ization techniques. A schematic diagram of the platform is shown in Fig. 1. Our
proposed platform has the following objectives:
Network Visualization and Analysis 735

if the user is sasfied with the selecon Representave


user Visualizaon material selecon
Load, preprocess, explore and select
the data. algorithm
Prepare relevant parameters and
experimental designs for the
simulaon
Validate and save results.
Results returned by
algorithm If readjustments by the
Send selected data for the discrete
simulaon module. user requires to run the
algorithm again

Readjusng, selecng, and


priorizing roung paths
Adjusng threshold for overlap Results analysis
tuning
Exploring results by visualizaon
This module would be
a part of future work Discrete simulaon
module
All input parameters:

1. BOM
comparison of simulaon results with 2. Routing data including setup and
previous business year. processing time
compare different personnel and 3. Production planning parameters
qualificaon scenarios set-up Comparison 4. defining the available capacity
opmizaons, and lot-sizing policies with 5. Skill groups and number of employees
past business year for deeper business 6. Production program and forecast for end items
understanding 7. Customer demand, order amount size
8. customer required lead times

Fig. 1. Schematic diagram of the visual platform for discrete data simulation and
analysis

– Connectivity application programming interface to download required data


from the database to the visual platform.
– Basic visualization tools for data visualization to test and check data quality,
including the required data preprocessing tools to structure and clean the
data.
– Network visualization of BOM data and workstation connectivity for pri-
mary selection materials and workstations with interactive visualization for
selecting, deleting, and editing workstations (delete or merge) and BOM data
manually.
– Providing clustering methods to cluster different products or assembly pro-
cesses based on the similarity of their features and structures. Additional
functionality for users to make partitions of materials utilizing the BOM and
workstations network based on their domain expertise.
– Implementing a representative material selection algorithm based on the
directionality of routing sequences (a sequence of workstations) that can
group different materials to reduce product variants based on routing sim-
ilarity to the optimized level.
– Providing statistical tools in the visual platform to assist experts in charac-
terizing distributions and estimating parameters for the random sampling of
processing, setup, repair time, and production planning variables.
– Providing users with various interactive options whereby they can generate
different production scenarios as the experimental design, interacting with
other users to make collective decisions for various production scenarios to
meet their PPC objectives.
– The visualization, exploration, selection, and experimental design scenarios
should be saved and stored in a user-understandable and generic format that
736 S. Tripathi et al.

is easily readable and can be loaded into various analysis tools and platforms
for further analysis and discrete simulation.
– Allowing evaluation metrics and useful means of visualization to help business
experts evaluate, compare, and understand the simulation results in terms of
the outcomes in practical scenarios.
– A user-friendly interface with various interactive options both for initial data
preparation and parameters that connects with the discrete simulation mod-
ule for execution, and for systematic presentation of the results of the simu-
lation.

2.2 Tasks
To meet the challenges of developing the interactive visual platform, we divide
the necessary tasks into the following steps:

– Obtaining relevant data, including BOM, routing data, and other required
information, to construct a practical example from a company’s ERP data.
– Defining common standards for preprocessing, cleaning, and integrating data
for the approach to representative material selection.
– Developing construction and visualization algorithms for various types of
networks, such as workstation networks, BOM networks, bipartite networks
(between materials and workstation), and multilevel networks (which are a
combination of workstation and BOM networks).
– Implementing a representative material selection algorithm by using cluster-
ing and community detection algorithms on the data selected by users for
identifying different groups of materials based on material routing data.
– Implementing an immersive visualization platform with interactive features
such as options for preprocessing data, prior selection of materials, and work-
stations for clustering materials using a representative material selection algo-
rithm. Related tasks are implementing necessary visual options for parame-
terizing the discrete simulation model and results evaluation functions for
output results.
– Describing case studies for the interactive visualization of various user activ-
ities related to preprocessing, selecting, editing data for discrete simulation,
and validation outlines.
– User evaluation and testing of the visual platform.

2.3 Network Exploration


The basic purpose of the platform is to systematically explore and analyze ERP
data for the preparation of discrete simulations and other production-related
challenges using network-based approaches utilizing interactive and immersive
visualization techniques. We implement various functionalities to construct and
explore networks in an easy, understandable, and collaborative manner. In the
constructed networks, nodes and edges function as active components of the
immersive visualization platform [2,12,19,24,35]. The active components would
Network Visualization and Analysis 737

respond based on the user’s query, performed interactively, and include various
functionalities such as collapsing of nodes and edges, adding new nodes and
edges, selection of subgraphs, expansion of abstract views of graphs, separate
visualization of modules, and multilayer visualization by combining BOM net-
works and workstation routing networks [43,46]. The materials and BOM details
of individual products and machines, with historical data, would be added to
the interactive visualization platform. We would also provide various structural
properties relevant to the BOM networks and workstation networks, such as clus-
tering coefficients, page rank, degree distribution, and node and edge entropies,
that correlate with various emerging characteristics of the different types of net-
works that influence PPC objectives [10,13,47]. Various graph-related distance
measures would allow users to compare the constructed networks of BOM, work-
stations, and multilayer networks for different simulation scenarios.
The initial step of network exploration is to address the redundant and old
information of ERP data that is not useful for PPC optimization of current pro-
duction orders. For example, if routing sequence(s) of a particular product or
group of products is required to reassign or some products are not required for
the new simulation scenarios. Therefore, in these cases, the routing sequences
and products’ information (BOM structure) are no longer important for new
scenarios. The other examples are if new products are customized and require
allocation of new routing sequences for production or the routing sequences at
various steps of materials’ assembly are needed to change as per the products’
complexity. In these cases, experts should utilize their domain understanding
to remove unwanted materials and their allocated routing sequences. Experts
should add new routing sequences and allocate the expected time at each work-
station for materials used for new customized products. These activities are per-
formed by visualizing and exploring the bipartite graph, workstation network, or
the multilevel BOM network and workstation networks. When the initial step is
completed, the next step is selecting representative materials using a clustering
algorithm based on different products’ features. The clustering solution provides
multiple groups of materials and workstation networks (routing). These groups
obtained by clustering are enriched into various product types and workstation
categories of production processes depending on the complexity, delivery time,
available resources, and other product and workstation network features. The
users and domain experts needed to work collaboratively to select representative
materials from different groups and reorganize the materials’ routing sequences
by interacting with the networks’ active components (nodes, edges). The interac-
tive operations select and merge workstations (node merging) and delete nodes
(if workstations are not functional or required). The users and domain experts
can perform various tasks collaboratively, such as:

– Adding new routes (if the product’s routing is needed to be extended).


– Modifying routing sequences of a product or a group of products (if worksta-
tions have a long processing time or high load).
– Prioritizing production of a group of products over others by adding new
routing sequences or giving higher priority to some materials in the queue of
a workstation in the workstation network.
738 S. Tripathi et al.

In addition, interval networks of workstations can be used to visualize and


explore production processes of numerous products occurring at varying points
in time. Which are useful for analyzing workstation load, expected completion
time, and routing materials. Such visualizations and exploration of networks
would guide to understand the complexity of workstation routing and products’
complexity. The interval graphs also allow for analyzing routing at different
time points for flexible scheduling of various products. The experts can opti-
mize the production processes’ routing based on BOM complexity and expected
delivery time using Multilevel network visualization to generate simulation sce-
narios as per the PPC objectives to be fulfilled. The interactive functionalities to
interact with nodes and edges of several networks in the visualization platform
allow business experts to collaboratively incorporate their domain knowledge to
design simulation scenarios to optimize PPC challenges and understand produc-
tion processes’ complexity. The main advantages of network-based interactive
visualization and exploration with immersive technology are that the users are
not required to know the complex and technical details of discrete simulation’s
preparatory steps. Network-based approaches and interactive visualization using
immersive technology provide an abstraction and systematic representation to
deal with complex problems, thus helping business experts and supervisors on
production sites understand and manage production and manufacturing systems
efficiently. Second, it provides flexibility to create various discrete simulation
scenarios intuitively by domain experts to address PPC challenges. Third, it is
useful for efficient collaboration between domain experts to address PPC chal-
lenges given the complexity of the products’ manufacturing process and BOM
structure.

2.4 Technical Implementation


Data visualization and analysis using immersive technologies along the reality-
virtuality (RV) continuum [29] have increasingly become popular in medical/bi-
ological research, production and engineering, robotics, and education. More
recently, there has been a trend to include traditional workstation-based visual
analysis on 2D screens, as well as augmented reality (AR), augmented virtu-
ality (AV), and virtual reality (VR) defined as cross-virtuality (XV). In order
to achieve greater insight into highly complex and large data sets, we suggest
using cross-virtuality (XV) to concentrate the initial phases of the data analysis
process on the immersive VR-side of the RV continuum. Combining technologies
along the RV continuum with Cross-virtuality analytics (XVA) [14] will enable
greater insight into these complex and large data sets. Cross-virtuality analyt-
ics (XVA) enables seamless integration and transition between conventional 2D
visualization, augmented reality, and virtual reality to provide users with opti-
mal visual and algorithmic support with maximum cognitive and perceptual
suitability [32].
While research on visualization systems across the RV continuum started sev-
eral decades ago [21,39], it gained recent interest because of the wide availability
of respective hardware such as head-mounted displays (HMDs), tablets, among
Network Visualization and Analysis 739

other devices used for XVA [33]. However, new hardware advancements also
add opportunities and challenges regarding visualization and interaction tech-
niques, especially considering analyses and interactions using different devices
along the RV continuum. Particularly for XVA, collaborative features between
multiple users, potentially from different domains with different levels of knowl-
edge or backgrounds, and their use cases and scenarios, must be researched.
More recently, collaborative systems were introduced which integrate cross-
device functionality, such as the Dataspace system by Cavallo et al. [5], where
large screens are combined with augmented reality for a shared analysis experi-
ence of multivariate data.
For a practical implementation of Vsimgen using XVA, we will explore graph
and network analysis methods which are extensively used in the production and
supply chain domain [6,28]. The graph and network analysis is concerned with
the visualization of and interaction with complex networks for analysis purposes,
where nodes represent entities and edge the relationships between them. The
first step for the analysis of complex networks is typically visualization to dis-
cover patterns, generate new knowledge, and provide an interpretation of various
higher-level emergent properties of the system [38]. Research shows that immer-
sive visualization technologies can foster an interactive and efficient visualization
of networks [23,27,34]. Immersive technology, such as AR and VR HMDs, can
help to interactively visualize and modify graphs, which, until recently, have
been under-utilized due to technical and complexity reasons. Further, an exten-
sion of graph visualization and interaction to collaboration capabilities has yet to
be researched more thoroughly. For example, Cordeil et al. [8] utilize collabora-
tive visualization of graphs which aim to find patterns, structures, and complex
characteristics in a collaborative way. We believe XVA can be applied for the
following techniques in graph and network analysis:

– Selection/highlighting of nodes and edges.


– Zooming in/out of the graph to reveal sub-graphs.
– Visualizing paths for two given nodes or set of nodes.
– Adding/removing nodes and edges.
– Collaboratively exploring graphs and sub-graphs.

For the software part, we propose using Unity, an established 3D visualization


and interaction engine that can be used for web, desktop, mobile, and AR/VR
applications, among others. For hardware, current consumer-grade HMDs pos-
sess a high display resolution to visualize many data points in an immersive
manner. Furthermore, HMDs are often coupled with extra controllers that can
be utilized to interact with the graph content, such as zooming or selecting nodes.
Example HMD devices are the HTC Vive Pro and the Oculus Quest. For a non-
immersive experience for visualization and interaction with graphs, the use of
monitors or tablets is also possible. However, drawbacks are the limited depth
effect compared to an AR/VR implementation, which may make complex graphs
harder to navigate and interact with from a user’s perspective. For a concrete
use case, a particular file containing graph data (i.e., nodes and edges) would be
740 S. Tripathi et al.

loaded into the AR/VR application deployed on an HMD, which would parse
and visualize the file’s content in a three-dimensional immersive manner. Further
interactions would then be accomplished using hand/finger gestures or coupled
controllers containing buttons and sliders to control the graph environment.

3 Case Study

This section provides examples of data visualization for groups of materials,


workstation network visualization, and representative material selection analysis.
The analysis aims to combine various network-based approaches for processing,
selecting, and editing various BOM and workstation details, using the knowl-
edge of collaborating domain experts to provide brief forms of input parameters.
These will be used for discrete simulation analysis addressing various issues of
optimization in PPC. We adopt a network approach to systematically visualize,
explore, and select relevant information with interactive visualization for dis-
crete simulation. The edges and nodes are described as active components and
retain relevant information about the BOM, workstations, and materials. This
information can be interactively accessed and utilized by the business experts
for selecting the required information for the simulation. Before describing the
various aspects of visualization, we briefly describe the bipartite graph, work-
station network, and heterogeneous network of combined BOM structure and
workstations.
Bipartite-Graph: Let G = (U, V, E) be a bipartite graph that consists of two
sets of vertices, U and V , which are disjointed vertex sets. E denotes the edge
set of edges representing connections between vertices U (workstations) and V
(materials) if a material vi ∈ V is processed at a workstation ui ∈ U .
Workstation Network: For the workstation network, we first construct a
directed path graph, Gp = (Vp , Ep ), of the routing sequence of each mate-
rial. The vertex set of each material mi is a routing sequence of worksta-
tions, Vp = {v1 , ..., vn }, and edges, Ep = {e1 , ..., en−1 }, where ek = (vk , vk+1 )
for every 1 ≤ k ≤ (n − 1). The final network is constructed as follows:
Gws = G1 ∪ G2 · · · ∪ GP .
BOM Network: In the hierarchical structure of BOM, the root node is the
finished product. The finished product is made of several materials by assembling
different materials and sub-assemblies represented as a BOM tree. A BOM tree
is a directed tree structure in which the root node is the end product. The parent
nodes connect to the child nodes by directed edges. A complex BOM network is a
combined representation of BOM trees of a set of products, P = {P1 , P2 , . . . , Pn },
is constructed in three steps [7]: first, construct a tree Ti for each Pi ; second,
reduce the bill of materials by merging duplicate vertices into one; and finally,
construct the network GBOM = T1 ∪ T2 ∪ · · · ∪ T|P | .
Multilevel Network: Let G = (V, E) represent a multilevel network where V =
∪da=1 V α and E is a set of edges. The vertices subset V a = V b . Each subgraph
Network Visualization and Analysis 741

(V a , E a ) is a level where E a = {ij ∈ E : i, j ∈ V a }. And the relationship between


levels is bipartite subgraph (V a , V b , E ab ) where E ab = {ij ∈ E : i ∈ V a , j ∈ V b }
[11,18]. In our case we construct a multilevel network, G = (V, E), by considering
levels of BOM and workstation networks, i.e., V = V BOM ∪ V ws and edges
represent three sets of connections E = E ws ∪ E BOM ∪ E ws,BOM .
Interval Network (Workstation): The temporal network of workstation rout-
ing which allows users to visualize the active workstation (nodes) and the routing
(edges) in a specified time interval. Let we divide the whole production process
  
into a set of intervals T = {(t1 , t1 ), (t2 , t2 ), . . . (tn , tn )}. Let there are V worksta-
tions and in which Vt workstations are active, i.e., Vt ⊂ V , with Et routing edges

in an interval (ta , ta ). The graph Gt = (Vt , Et ) is an interval network where an
 
edge, eij ij
t ∈ Et , et = (i, j, ta , ta ). The i, j ∈ Vt and ta is the start time and ta is
the end time [16].

3.1 Data

We use real world manufacturing data for sheet metal processing. In the initial
phase, the data are exported from the ERP system relating to BOM, routing
data with processing time at each workstation, and other production planning
parameters required for the discrete simulation. The BOM data contain material
IDs (unique), sub-assembly IDs, and the end products and lot size policy for each
material. Lot size policies can be fixed order period (FOP), fixed order quantity
(FOQ), or consumption based (CB).
The routing data contain material IDs, workstation ID (unique), expected
time spent at the corresponding workstation, and operation sequence numbers
defined by integer values. There are multiple rows in routing data for individual
material IDs with different sequence numbers, representing the complete routing
sequence of the material.
The BOM data and routing sequence data are integrated by joining both
tables using material ID as the primary key. The joined table is called a master
table. An example master table is shown in Table 1.

Table 1. Example master data table joining BOM data and routing sequence data

End Sub Material ID Workstation Process ID Lot size Standard Cumulative


item assembly policy time time
ID
A SA1 M00001 W1 1 FOP1 0.20 0.20
A SA1 M00001 W2 2 FOP1 0.25 0.45
A SA1 M00001 W3 3 FOP1 0.25 0.70
A SA1 M00002 W1 1 FOP2 0.15 0.15
A SA1 M00002 W4 2 FOP2 0.20 0.35
. . . . . . . .
. . . . . . . .
. . . . . . . .
742 S. Tripathi et al.

In the master data set the bipartite network, Gbipartite , is constructed using
data from the Material ID column and Workstation column. The BOM network,
GBOM , is constructed by combining the End item, Sub assembly ID, and Mate-
rial ID columns. The workstation network is constructed using the Workstation
column. Examples of the networks constructed from the master data are shown
in Fig. 2.

M01:
W1 A W1 W2 W3

M02:
M01
W2 W1 W4

SA1
M02 W3

W1 W2 W3

M01 M02
W4
W4

a). Biparte network b). BOM Network c). Workstaon network

Fig. 2. Different types of networks created from the master data.

3.2 Initial Data Visualization

In the first step, we load the required ERP data, perform preprocessing, and cre-
ate a master table; the complete data comprise 400, 000 rows. The initial data
visualization is performed by creating a bipartite network containing 124 work-
station details and 28, 500 materials. However, the bipartite network ignores the
routing details of the materials in the first step. Here, we only consider the rela-
tionship between a workstation and material if the material is processed in that
workstation. Users can select materials and workstation-related network con-
struction options. For example, the user can choose to construct only those rela-
tionships for materials processed through at least three workstations. An exam-
ple of a bipartite graph is shown in Fig. 3. We applied a multilevel-community
detection algorithm [3] and replaced each module with a weighted node and their
respective connections with a single weighted edge.
In Fig. 3, The module detection algorithm estimates 8 modules of bipartite
graph, G = (U, V, E), with V = ∪8i=1 Vi materials and U = ∪8i=1 Ui workstations.
An abstract bipartite graph, Gabs = (Uabs , Vabs , Eabs ), is drawn by replacing each
Vi and Ui with a single node and an edge is drawn if there is atleast one edge
between nodes i ∈ Vi and j ∈ Uj , where ij ∈ E. The edge widths show the total
number of materials connected with workstations in the respective modules. The
vertex weights correlate with the total numbers of materials and workstations.
Network Visualization and Analysis 743

M2 M1 M7 M5 M3 M6 M4 M8

W7 W1 W2 W3 W5 W4 W6 W8

Fig. 3. Abstract visualization of the bipartite graph of workstation and material


nodes, which are the group of vertices, Vi ⊂ V (T op : materials), Ui ⊂ U (Bottom :
workstations), respectively. Edges width highlight the proportion of edges that exists
between Ui ∈ U and Vj ∈ V . Edge width can also represent proportion of the marginal
sums of weights of edges in a module.

The vertices and edges are used as active components in interactive visualization
and should provide various options to explore details by event driven functions
represented by a node or connections between nodes. The abstract visualiza-
tion fulfills two initial tasks. First, it reduces complexity, and second, it is used
to select specific modules for further analysis and masterdata preparation for
discrete simulation. The next step is to explore individual modules and their
connections, which can be shown by selecting a workstation or material module
for further exploration of the data. We provide examples of selecting a worksta-
tion module in Fig. 4a and selecting a material module in Fig. 4b. Each material
and workstation modules, represented as an edge, can be further explored for the
details of individual materials and workstations. An example is shown in Fig. 5.
In this example we visualize the workstation (Fig. 5a) and material (Fig. 5b) at
the center and the connected materials and workstations are arranged in circular
layouts.

3.3 Visualizing Workstation Network

The workstation network Gws is constructed by combining different directed


routing path graphs, as shown in Fig. 6a. The network is presented in a grid
layout where the edges of nodes are depicted by rectangular paths due to the
complexity of the connections. Vertex sizes represent total materials going into
and out from workstations. Figure 6b highlights the routing path of a single
material. The starting node is shown in blue and the end node and in-between
nodes are shown in red and green, respectively. A user can merge two nodes, add
new nodes, delete nodes, and change workstation routing; the back-end data are
automatically updated for materials and workstations accordingly. These are
744 S. Tripathi et al.

M7 M1 W1

W5

W7

M5 W1 M2
M1

W3
M3 M6 W2

(a) (b)

Fig. 4. Visualizing individual modules, (a) Workstation module W 1 with connections


to material modules; (b) Material module M 1 with connections to workstation modules.

1784−160−00−00−014
1202−696−00−00−015
1282−160−00−00−015
1668−980−00−00−014 2861006 APG
284500
1328−150−00−00−015 325100
2055−160−01−00−014 1482−150−00−00−015
1649−167−01−00−015 7 4
A
1322−150−00−00−015 2036−150−00−00−014 221000 284811
1200−170−00−00−015 222000
1712−170−01−00−014 284810
1321−170−01−00−015 211000
1227−980−00−00−015 284807
1954−160−00−00−014 212000
1984−170−01−00−014 284806
1675−170−00−00−015 212002
1927−170−00−00−014 284805
1629−170−01−00−015 222002
1985−170−00−00−014 284804
1231−170−01−00−015 284800
1953−980−00−00−014 0
1135−696−10−00−015 284305
1292−150−00−00−015 1
1244−952−85−00−015 222010
W5 1345−170−00−00−015 M5 286200
1268−160−00−00−015 281200
1202−160−01−00−015 212001
1361−980−00−00−015 AV
1996−166−01−00−014 211001
1728−980−00−00−015 282100
1060−980−00−00−015 222001
1930−170−00−00−014 281400
1395−170−01−00−015 221001
1754−170−00−00−014 283300
1347−170−01−00−013 283301
2023−980−00−00−014 282500
1767−150−00−00−015 282300
1698−950−83−00−015 APE
1712−160−01−00−014 251002
1411−150−00−00−015 284802
1415−170−01−00−015 1610−150−00−00−015 284808 282200
1711−696−10−00−015
1805−160−01−00−015 1909−170−00−00−014 284442 281500
1923−150−00−00−015
1167−170−00−00−015 282600 282000
251003
1291−160−00−00−015
1231−690−45−00−015
1994−170−00−00−014 284400 222011
284803

(a) (b)

Fig. 5. Visualizing individual modules, (a) Workstation module W 1 with modules of


materials; (b) Material module M 1 with modules of workstations.

examples to show the exploration of routing-related details for various materi-


als. Further, processing time complexity for each workstation and various other
dynamic details can be explored through interactive visualization.
Network Visualization and Analysis 745

(a) (b)

Fig. 6. (a) Workstation network aggregating all routing sequences of different materi-
als; (b) Showing the routing of a single material in the network.

The other important visualization is the interval network visualization of the


workstation networks, and this visualization explores active workstations and
the load in the workstation network with an interval. Users can generate several
networks at different time intervals to visualize active workstations and material
routing. In our example, we randomly select a subset of material, Ma ⊂ M , where
|Ma | = 1026. These selected materials are processed at least in n ≥ 5 worksta-
tions to visualize interval networks. We consider initial time for all the materials
tinit = 0 and the standard processing time of a material mk ∈ Ma at a worksta-
tion wi ∈ W is t(mk , wi ). We first calculate cumulative sum of standard process-
ing time (cumulative time) of each material mk at each workstation wi , if mk
is sequentially processed at workstations w1 , w2 , . . . wp respectively, the cumula-
 i
tive processed time of mk at wi is t (mk , wi ) = s=1 t(mk , ws ). We then select
 
the maximum time, tmax = max(t (mk , wi ) : i = 1, 2, . . . |W |, k = 1, . . . |Ma |)
 
which is tmax = 8.601 h. We split T = (tinit , tmax ) into 5 intervals, i.e.,
T = {(0, 0.302), (0.303, 0.493), (0.494, 0.739), (0.741, 1.263), (1.268, 8.601)}. We
then construct workstation networks for each interval of m = 1026 materials. An
example visualization we present in Fig. 7. The nodes in red color are the active
nodes between the given time interval, Ti (in hours), and edges in green show
that the routing of material at the time interval Ti . Node size represents the num-
ber of materials processed at a particular workstation at an interval of Ti . In the
example visualization, we ignore the time when the material arrives for process-
ing. Still, similar visualization can be presented when we have a material’s arrival
time at a workstation given. Such visualization is useful to visualize the work-
station network’s overall status to visualize the total time the workstations are
busy and the load containing. Such visualization assist domain experts to draw
new routes or re-optimize the standard routing for their analysis by selecting
networks between a time interval. However, we need to add various function-
alities for exploring, navigating and rerouting with collaboration to prove the
efficiency of such visualization.
746 S. Tripathi et al.

a. T1 =(0,0.302) b. T2 =(0.303, 0.493) c. T3 =(0.494, 0.739)

d. T4 =(0.741, 1.263) e. T5 =(1.268, 8.601)

Fig. 7. Interval networks of workstations remain active in different intervals.

3.4 BOM Network Exploration

The BOM network, GBOM , is composed of various BOM structures. The com-
bined network allows users to design products, analyze inventory details, and
perform PPC analysis. A user can create and explore new products by inter-
actively visualizing the BOM network, selecting representative materials for
discrete simulation. They can examine BOM structure similarity by compari-
son of graphs to find clusters of BOM that have common sub-assemblies and
materials. BOM trees can be compared based on the similarity of various
graphs and distance measures, such as graph edit distance [4], DeltaCon [25],
and Vertex-edge overlap [31], that would be provided in the visualization plat-
form. An example visualization of a BOM hierarchy and the aggregation is pre-
sented in Fig. 8. Figure 8a, 8b, and 8c are the individual BOM structures for
products P 1, P 2, and P 3, respectively, and these are aggregated in Fig. 8d.
The interactive visualization would allow users to visualize individual BOMs
or aggregated BOMs and explore various information utilizing networks’ topo-
logical properties. The visualization would also provide the two-level network
G = (V ws ∪ V BOM , E ws ∪ E BOM ∪ E ws,BOM ) for selecting and using BOM and
routing data for discrete simulation.
Network Visualization and Analysis 747

P1 P2

M13 M23 M24 M8

M1 M8 M9 M10

M14 M15

M2 M4

M17 M18 M12 M20 M21 M22

M5 M6 M7 M11 M12 M26 M7

(a) (b)
P3 P1 P2 P3

M28 M11 M12 M42 M1 M8 M10 M13 M24 M28 M42

M27 M32 M33 M2 M4 M14 M15 M27 M32 M33

M29 M30 M5 M6 M11 M12M17 M18 M20 M21 M22 M29 M30

M9 M35 M36 M23 M38 M39 M7 M26 M9 M23 M35 M36 M38 M39

(c) (d)

Fig. 8. (a), (b), and (c) are individual BOM structures; (d) is an aggregated BOM
network visualized in a layered layout.

3.5 Representative Material Selection

The idea of representative material selection is to select representative materials


from a large number of variants that follow similar production features (rout-
ing) and reflect an overall routing of the workstation network that is similar
to the complete workstation network (Gws ). In order to get the representative
materials [44], we measure the overlap in routing sequence between materials
and perform clustering of materials with a high degree of overlap. We choose an
overlap threshold of α = 0.90 to separate different clusters of materials. From
28, 500 different materials, this yields 1, 301 clusters. Users can apply network-
based approaches through visual interaction, such as optimizing the number of
materials based on graph edit distance (GED) [4] between Gws and Gwssub (M),
i.e., arg min ged(Gws , Gwssub (M )) ∼ 0, between the workstation network Gws
and a subnetwork constructed by M = {m1 , m2 , . . . , mn }, |M | materials. We
applied GED only for insertions and deletions of vertices and edges, between
workstation network (Gws ) and workstation subnetwork (Gwssub ) of |M | repre-
sentative materials. As an example, we provide a graph constructed of missing
edges of Gwssub (M ), by selecting M = 800 materials. The GED measures sim-
ilarity between Gws and Gwssub (M ), and its value here is 91. A visualization
of the missing edges is shown in Fig. 9a. We further provide a comparison of
representative materials by comparing with Gws . We randomly select p groups
748 S. Tripathi et al.

and then randomly pick a material from each group and construct Gmsub (M ) for
M = {5, 25, 45, . . . 1305} and repeat 20 times for each Mi . The GED is shown
in Fig. 9b. As we increase the number of clusters from which materials are ran-
domly selected, the GED gets closer to 0, i.e., GED ∼ 0. This result validates
the clustering solution grouped based on overlap of routing sequences of materi-
als. The representative materials in each group can be selected based on domain
knowledge and users’ prior understanding. A user can also select several scenar-
ios for discrete simulation by selecting different sets of representative materials
considering the complexity of BOM structures and workstation routing. The net-
work visualization and analysis functionalities of graphs and interactive features
in the visualization tool would provide users with an efficient way to deal with
large numbers of materials and high complexity of workstation routing.

700
600
Graph Edit Distance
500
400
300
200
100
0

5 65 145 245 345 445 545 645 745 845 945 1045 1165 1285

#number of materials

(b)
(a)

Fig. 9. (a) Missing edges in the workstation network by comparing Gmsub (M ) with
Gws ; (b) GED, comparing Gws with Gmsub (M ) by randomly selecting materials, M,
from different groups.

4 Summary
In this paper, we discussed the importance of optimized PPC strategies for
improved logistic performance in customized production. We also discussed some
of the challenges in the discrete simulations approach to analyzing PPC deci-
sions, which mainly relate to data preparation, representative material selection,
and model parameterization. Further, we presented example visualizations for
interactive visualization technology for BOM networks, workstation networks,
and bipartite networks proposed for the systematic exploration of the master
data. The initial setup for a discrete event simulation requires a collaborative
effort between domain experts managing various stages of PPC strategies. To
enable efficient collaboration between experts for optimizing PPC, we have pro-
posed an interactive platform with state of the art visualization technology and
Network Visualization and Analysis 749

network-based approaches for systematic exploration and analysis of data. These


features allow users to collaborate and select various options for experimental
designs using representative materials selection and other required parameters,
allowing flexibility and abstraction by hiding complex technical details of data
extraction, preprocessing, and discrete simulation. The platform would also allow
users to implement discrete simulation models and visually compare results for
different simulation scenarios. The visual platform would also streamline com-
plex back-end operations of discrete simulation by means of abstraction, pro-
vide faster and simpler working methods to achieve efficient PPC strategies via
discrete simulation, and enable deeper business understanding for efficient pro-
duction planning decisions. Our future plan is to implement a working model
of Vsimgen, featuring various interactive user options in an immersive visual-
ization platform. We will develop functional requirements for rendering, layout
generation, personalized and multiple views, network metrics, and processing
and analyzing data. We will also design case studies of interactive and collabo-
rative analysis that proves its efficacy over the cumbersome and non-interactive
standard approach.

Acknowledgments. This paper is a part of X-pro project. The project is financed


by research subsidies granted by the government of Upper Austria.

References
1. Altendorfer, K., Felberbauer, T., Jodlbauer, H.: Effects of forecast errors on optimal
utilisation in aggregate production planning with stochastic customer demand. Int.
J. Prod. Res. 54(12), 3718–3735 (2016)
2. Bach, B., Dachselt, R., Carpendale, S., Dwyer, T., Collins, C., Lee, B.: Immersive
analytics: exploring future interaction and visualization technologies for data ana-
lytics. In: Proceedings of the 2016 ACM International Conference on Interactive
Surfaces and Spaces, pp. 529–533 (2016)
3. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of
communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008
(2008)
4. Bunke, H., Dickinson, P.J., Kraetzl, M., Wallis, W.D.: A graph-theoretic approach
to enterprise network dynamics, vol. 24. Springer Science & Business Media (2007).
https://doi.org/10.1007/978-0-8176-4519-9
5. Cavallo, M., Dolakia, M., Havlena, M., Ocheltree, K., Podlaseck, M.: Immersive
insights: a hybrid analytics system for collaborative exploratory data analysis. In:
Symposium on Virtual Reality Software and Technology (VRST), pp. 1–12. ACM
(2019)
6. Cheng, Y., Tao, F., Xu, L., Zhao, D.: Advanced manufacturing systems: supply–
demand matching of manufacturing resource based on complex networks and inter-
net of things. Enterprise Inf. Syst. 12(7), 780–797 (2018)
7. Cinelli, M., Ferraro, G., Iovanella, A., Lucci, G., Schiraldi, M.M.: A network per-
spective on the visualization and analysis of bill of materials. Int. J. Eng. Bus.
Manage. 9, 1847979017732638 (2017)
750 S. Tripathi et al.

8. Cordeil, M., Dwyer, T., Klein, K., Laha, B., Marriott, K., Thomas, B.H.: Immersive
collaborative analysis of network connectivity: cave-style or head-mounted display?
IEEE Trans. Visual Comput. Graph. 23(1), 441–450 (2017)
9. de Groote, X., Yücesan, E.: The impact of product variety on logistics performance.
In: Proceedings of the 2011 Winter Simulation Conference (WSC), pp. 2245–2254.
IEEE (2011)
10. Dehmer, M., Emmert-Streib, F., Jodlbauer, H.: Methods and Applications. CRC
Press, Entrepreneurial Complexity (2019)
11. Dimitrova, T., Petrovski, K., Kocarev, L.: Graphlets in multiplex networks. Sci.
Rep. 10(1), 1–13 (2020)
12. Elmqvist, N., Moere, A.V., Jetter, H.C., Cernea, D., Reiterer, H., Jankun-Kelly,
T.J.: Fluid interaction for information visualization. Inf. Visual. 10(4), 327–340
(2011)
13. Emmert-Streib, F., et al.: Computational analysis of the structural properties of
economic and financial networks. arXiv:1710.04455 (2017)
14. Fröhler, B., et al.: A survey on cross-virtuality analytics. In: Computer Graphics
Forum, vol. 41, pp. 465–494. Wiley Online Library (2022)
15. Garg, S., Vrat, P., Kanda, A.: Equipment flexibility vs. inventory: a simulation
study of manufacturing systems. Int. J. Prod. Econ. 70(2), 125–143 (2001)
16. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012)
17. Hübl, A., Altendorfer, K., Jodlbauer, H., Gansterer, M., Hartl, R.F.: Flexible model
for analyzing production systems with discrete event simulation. In: Proceedings
of the 2011 Winter Simulation Conference (WSC), pp. 1554–1565. IEEE (2011)
18. Interdonato, R., Magnani, M., Perna, D., Tagarelli, A., Vega, D.: Multilayer net-
work simplification: approaches, models and methods. Comput. Sci. Rev. 36,
100246 (2020)
19. Jetter, H.C., Gerken, J., Zöllner, M., Reiterer, H., Milic-Frayling, N.: Materializing
the query with facet-streams: a hybrid surface for collaborative search on table-
tops. In: Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, pp. 3013–3022 (2011)
20. Jodlbauer, H., Altendorfer, K.: Trade-off between capacity invested and inventory
needed. Eur. J. Oper. Res. 203(1), 118–133 (2010)
21. Kiyokawa, K., Takemura, H., Yokoya, N.: A collaboration support technique by
integrating a shared virtual reality and a shared augmented reality. In: Interna-
tional Conference on Systems, Man, and Cybernetics (SMC), vol. 6, pp. 48–53.
IEEE (1999)
22. Koh, S.-G., Bulfin, R.L.: Comparison of DBR with CONWIP in an unbalanced
production line with three stations. Int. J. Prod. Res. 42(2), 391–404 (2004)
23. Kotlarek, J., et al.: A study of mental maps in immersive network visualization.
In: IEEE Pacific Visualization Symposium (PacificVis), pp. 1–10 (2020)
24. Kotlarek, J., et al.: A study of mental maps in immersive network visualization
(2020)
25. Koutra, D., Vogelstein, J.T., Faloutsos, C.: DELTACON: a principled massive-
graph similarity function. In: Proceedings of the 2013 SIAM International Confer-
ence on Data Mining, pp. 162–170. SIAM (2013)
26. Kronberger, G., Weidenhiller, A., Kerschbaumer, B., Jodlbauer, H.: Automated
simulation model generation for scheduler-benchmarking in manufacturing. In:
Proceedings of the International Mediterranean Modelling Multiconference (I3M
2006), pp. 45–50 (2006)
Network Visualization and Analysis 751

27. Kwon, O.H., Muelder, C., Lee, K., Ma, K.L.: A study of layout, rendering, and
interaction methods for immersive graph visualization. IEEE Trans. Visual Com-
put. Graph. 22(7), 1802–1815 (2016)
28. Li, Y., Tao, F., Cheng, Y., Zhang, X., Nee, A.Y.C.: Complex networks in advanced
manufacturing systems. J. Manuf. Syst. 43, 409–421 (2017)
29. Milgram, P., Takemura, H., Utsumi, A., Kishino, F.: Augmented reality: a class
of displays on the reality-virtuality continuum. In: Das, H. (eds.) Photonics for
Industrial Applications, pp. 282–292 (1995)
30. Mula, J., Poler, R., Garcı́a-Sabater, J.P., Lario, F.C.: Models for production plan-
ning under uncertainty: a review. Int. J. Prod. Econ. 103(1), 271–285 (2006)
31. Papadimitriou, P., Dasdan, A., Garcia-Molina, H.: Web graph similarity for
anomaly detection. J. Internet Serv. Appl. 1(1), 19–30 (2010). https://doi.org/
10.1007/s13174-010-0003-x
32. Riegler, A., et al.: Cross-virtuality visualization, interaction and collaboration. In:
XR@ ISS (2020)
33. Sereno, M., Besançon, L., Isenberg, T.: Supporting volumetric data visualization
and analysis by combining augmented reality visuals with multi-touch input. In:
EG/VGTC Conference on Visualization (EuroVis) - Posters (2019)
34. Sorger, J., Waldner, M., Knecht, W., Arleo, A.: Immersive analytics of large
dynamic networks via overview and detail navigation. In: International Confer-
ence on Artificial Intelligence and Virtual Reality (AIVR), pp. 144–1447. IEEE
(2019)
35. Sorger, J., Waldner, M., Knecht, W., Arleo, A: Immersive analytics of large
dynamic networks via overview and detail navigation (2019)
36. Stevenson*, M., Hendry, L.C., Kingsman, B.G.: A review of production planning
and control: the applicability of key concepts to the make-to-order industry. Int.
J. Prod. Res. 43(5), 869–898 (2005)
37. Strasser, S., Peirleitner, A.: Reducing variant diversity by clustering. In: Proceed-
ings of the 6th International Conference on Data Science, Technology and Appli-
cations, pp. 141–148. SCITEPRESS-Science and Technology Publications, LDA
(2017)
38. Strogatz, S.H.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
39. Szalavári, Z., Schmalstieg, D., Fuhrmann, A., Gervautz, M.: “studierstube”: an
environment for collaboration in augmented reality. Virt. Real. 3(1), 37–48 (1998)
40. Thompson, M.B.: Expanding simulation beyond planning and design-in addition
to the increase in traditional uses, simulation is expanding into new and even more
valuable areas. Ind. Eng.-Norcross 26(10), 64–67 (1994)
41. Tiger, A.A., Simpson, P.: Using discrete-event simulation to create flexibility in
APAC supply chain management. Global J. Flexible Syst. Manage. 4(4), 15–22
(2003)
42. Trattner, A., Hvam, L., Forza, C., Herbert-Hansen, Z.N.L.: Product complexity
and operational performance: a systematic literature review. CIRP J. Manuf. Sci.
Technol. 25, 69–83 (2019)
43. Tripathi, S., Dehmer, M., Emmert-Streib, F.: NetBioV: an R package for visualizing
large network data in biology and medicine. Bioinformatics 30(19), 2834–2836
(2014)
44. Tripathi, S., Strasser, S., Jodlbauer, H.: A network based approach for reducing
variant diversity in production planning and control (2021)
45. Tseng, M.M., Radke, A.M.: Production planning and control for mass
customization–a review of enabling technologies. In: Mass Customization, pp. 195–
218. Springer (2011)
752 S. Tripathi et al.

46. Wang, C., Tao, J.: Graphs in scientific visualization: a survey. In: Computer Graph-
ics Forum, vol. 36, pp. 263–287. Wiley Online Library (2017)
47. Guihai, Yu., Dehmer, M., Emmert-Streib, F., Jodlbauer, H.: Hermitian normalized
Laplacian matrix for directed networks. Inf. Sci. 495, 175–184 (2019)
An Annotated Caribbean Hot Pepper
Image Dataset

Jason Mungal1(B) , Azel Daniel1 , Asad Mohammed2 , and Phaedra Mohammed1


1
Department of Computing and Information Technology,
The University of the West Indies, St. Augustine, Trinidad and Tobago
[email protected], {Azel.Daniel,Phaedra.Mohammed}@sta.uwi.edu
2
Department of Mathematics and Statistics, The University of the West Indies,
St. Augustine, Trinidad and Tobago
[email protected]

Abstract. The Caribbean region is home to, and widely known for,
its many “hot” peppers. These peppers are now heavily researched to
bolster the development of the regional pepper industry. However, accu-
rately identifying the different landraces of peppers in the Caribbean
has since remained an arduous, manual task that involves the physical
inspection and classification of individual peppers. An automated app-
roach that uses machine-learning techniques can help with this task; how-
ever, machine learning approaches require vast amounts of data to work
well. This paper presents a new multi-label annotated, image-dataset
of Capsicum Chinense peppers from Trinidad and Tobago. The paper
also presents a benchmark for image-pepper classification and identifica-
tion. It serves as a starting ground for future work that can include the
compilation of larger datasets of regional peppers that can include more
morphological features. It additionally serves as the starting ground for
a Caribbean-based hot-pepper ontology.

Keywords: Caribbean hot peppers · Capsicum chinense · Hot pepper


descriptors · Machine learning · Image classification · Pepper image
dataset

1 Introduction
The Caribbean is well known for its hot peppers. Most of the region’s commer-
cially produced peppers are of the Capsicum Chinense Jacq. Species. The species
is recognized for its high capsaicin content which gives Caribbean hot peppers
their characteristic heat, pungent aromatic smell and flavor. These characteris-
tics have made them an important export for countries in the region [29]. Another
trend that favors their export is global interest in the so-called “super-hot” cate-
gory of peppers. Trinidad and Tobago is known in particular for its “super-hot”
landraces and pure line varieties such as the “Seven-Pot” and “Scorpion” pep-
pers. The Scorpion pepper is one of the hottest peppers in the world and is
sought after for its high capsaicin content [5,6].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 753–769, 2023.
https://doi.org/10.1007/978-3-031-18461-1_49
754 J. Mungal et al.

Hot pepper production is therefore an important area for research in


the region. The Caribbean Agricultural Research and Development Institute
(CARDI) has conducted research and started programs to encourage develop-
ment of the regional hot pepper industry. CARDI looks for ways to improve areas
of the hot pepper production value chain. An area of interest is investigating the
use of machine learning and computer vision to enhance processes such as the
grading and sorting of fruit for export. Grading is currently a manual process
that involves physical inspection of fruits for defects and undesirable character-
istics [1]. This is a part of CARDI’s value chain that can possibly benefit from
automation by applying computer vision techniques.
Another area of interest is being able to identify different varieties sold in
markets. This is a particularly challenging task as Caribbean farmers typically
grow landraces which have not been purified or characterized [4]. Landraces
maintained by farmers are known to have varying characteristics. A Trinidad hot
pepper field survey conducted by Bharath [4], found that many seedling suppliers
and farmers were not using certified planting material (seedlings). Bharath [4]
further discusses how such farming practices lead to increased variation in the
characteristics of the fruit. These variations can result in loss of the commercially
desirable characteristics and affect the production quantity or yield.
The main bottleneck for such applications is the lack of image datasets fea-
turing local Capsicum Chinense fruit. The aim of this work is to put forward an
image dataset of various local Capiscum Chinense varieties, annotated with infor-
mation about the visual attributes and morphological features. The morpholog-
ical features used are a subset of those described in the IPGRI 1995 Descriptors
for Capsicum with additional features observed in local fruit in the work done
by [3]. To our knowledge no other image dataset exists that features Capsicum
Chinense fruit from Trinidad with the same amount of visual attributes.
The contributions of this paper are as follows: (i) an image dataset of various
Capsicum Chinense fruits from Trinidad, (ii) a set of annotations of select visual
attributes and morphological features visible in images.
This work can be a starting point for machine learning and computer vision
application in hot pepper production and research in the region. It may also be
a starting point for building a local Capsicum Chinense ontology.
The remaining sections of this paper are as follows: Sect. 2 discusses related
work. We describe similar datasets, compare image collection methods, and com-
pare the additional attribute annotations. Section 3 describes our image and
annotation collection process. Section 4 describes the dataset with detailed statis-
tics relating to the classes and annotations. Section 5 presents an experiment to
show how the dataset can be used for image classification. Finally, Sect. 6 con-
cludes the paper and suggests future work related to the dataset.

2 Related Work
2.1 Related Image Datasets
Since our image dataset is intended to be used for deep learning applications
it is important to examine other datasets designed for similar use. Deep neural
Caribbean Hot Pepper Image Dataset 755

networks such as Convolutional Neural Networks (CNNs) require a large dataset


to produce good results. Collecting, labeling, and annotating images is a time
intensive process. In order to save time for developing classification algorithms
it is common practice to use large, well-known datasets. This allows for easier
comparison and bench-marking against other models. However, the availability
of large annotated datasets is often a problem as noted by Barth et al. [2] in
their work on building a synthetic Capsicum Annuum (sweet pepper) image
dataset. Minervini et al. [23] noted the lack of computer vision applications in
agriculture and plant phenotyping (visual traits of an individual resulting from
the interaction of its genotype with the environment). These are tasks which
depend on availability of benchmark datasets.
Other agriculture image datasets have also been produced to address this
problem. They vary in content, application, and data collection methods.
Fruits360 [24] is a fruit dataset made for image classification with deep learn-
ing. It contains images of several classes of fruit on a white background. VegFru
[15] is a much larger dataset of fruits and vegetables with a more diverse set of
images. This dataset is geared towards fine-grained classification. DeepFruits [28]
and CropDeep [36] are datasets developed for fruit detection with convolutional
neural networks. They include bounding box annotations for object detection.
Barth et al. [2] created a synthetic Capsicum Annuum dataset semantic segmen-
tation for fruit/object detection. Zhang et al. [35] created a set of “clean” fruit
images for their fruit classification neural network. The aforementioned datasets
do not include local Capsicum Chinense fruit nor do they offer text annotations
of visual attributes.
The datasets created by Minervini et al. [22] presented annotations includ-
ing segmentation, bounding boxes and metadata about the plants. While there
may not be an immediate need for such annotations they can prove useful for
future research. With this in mind our dataset has a unique combination of an
under-represented class of fruit and annotated descriptions of the morphological
features that the other datasets do not provide.

2.2 Datasets with Attribute Annotations


Attribute annotations typically provide a description of an object or scene in an
image. The right attribute annotations add value to the dataset. Farhadi et al.
[11] place emphasis on using object attributes for recognition and classification
tasks. Attributes were said to be semantic or discriminative and could be used in
combination to recognize and describe objects. The authors developed classifiers
that could recognize known object classes and also describe new classes. The
semantic annotations collected in [11] are related to things such as object parts,
shape, color and texture.
Other datasets with attribute annotations have been built in a similar manner
to those in [11]. A subset of ImageNet [10] was used to collect attribute annota-
tions by Russakovsky and Fei-Fei [27] who aimed to discover visual relationships
between classes with respect to color, shape, pattern and texture. The SUN
Attribute Dataset [25] contains annotation of scene attributes. [26] is another
756 J. Mungal et al.

notable mention which focuses on object description. The Visual Genome dataset
[17] features extensive annotations of objects, object attributes and object rela-
tions. This dataset was built with the aim of developing systems that can perform
reasoning tasks rather than recognition/classification tasks.
The initial task of determining what attributes to annotate was approached
differently for each study mentioned. Attributes need to be general enough so
that they can be applied to different types of objects in order to create rela-
tionships. The authors of [27] mined ImageNet synset definitions for descriptive
attributes. In [25] and [26] a human intelligence task was involved in identify-
ing common attributes. In both [27] and [25] this task was done using Amazon
Mechanical Turk (AMT). The top five most commonly selected attributes were
chosen in [25]. The authors of [26] reduced the number of attributes to those
that best describe visual properties.
Object specific datasets with attribute annotations reduce the difficulty of
choosing which attributes to include. Caltech-UCSD Birds-200–2011 [30] is an
image dataset of bird images containing part location and attribute annotations.
It is an extension of the dataset in [31]. Attributes were chosen from a bird field
guide as there is an already defined or commonly understood taxonomy for
describing the object instances. CelebFaces Attributes Dataset [21] is a large
scale dataset of celebrity faces. The attributes describe physical features of the
faces as well as clothing and hair. However, Liu et al. [21] do not describe how
the attributes were chosen. Animals with Attributes Dataset [18] used a group of
high-level semantic attributes to describe images of animals in their dataset. The
authors describe these attributes as ones that generalize across class boundaries
e.g. color. DeepFashion [20] is a fashion dataset put forward for the purpose of
clothing classification and attribute prediction. Attributes for DeepFashion [20]
were chosen by mining meta-data of clothing images from Google Images and
online clothing retailers. With our Capsicum Chinense image dataset, we used
The International Plant Genetic Resources Institute (IPGRI) Descriptors for
Capsicum Spp. [16], a well known set of descriptors for morphological traits of the
Capsicum species of plants. This descriptor set is well-defined and widely used.
Our attributes are therefore a subset of the descriptors defined in this document.
This is a more domain specific way to go about attribute annotations. The
vocabulary used for attributes in our dataset is consistent with that of studies
involving fruit from accessions of the Capsicum Chinense which we believe is
more beneficial for applications in this area.

2.3 Image Collection Methods


The diverse, large-scale, datasets mentioned (e.g. ImageNet, MS-COCO) sourced
images from the internet using image search engine results or image hosting web-
sites. Even some of the smaller, domain specific datasets such as Caltech-UCSD
Birds-200–2011 [30] used online image search to collect samples for their dataset.
The question of how best to construct a dataset depends on the application or
intended use. Barth et al. [2] captured images using the same sensor and viewing
angle that will be used by the harvesting robot. Sa et al. [28] collected images
Caribbean Hot Pepper Image Dataset 757

for infield fruit identification by taking images of fruit trees in the field and
using similar images found online. Zhang et al. [35] created a set of “clean” fruit
images by applying pre-processing to remove background from their image data.
Datasets like ImageNet [10] were constructed with general image classification
in mind and therefore aim to be as diverse as possible. Deng et al. [10], the cre-
ators of ImageNet, included average image calculations against classes of other
datasets to show that ImageNet is more diverse. This type of problem is more
difficult as the context of real-world images cannot be controlled. The image set
must be as diverse as possible to not only improve accuracy but avoid problems
such as domain shift. Domain shift problems refers to using sets of data from
different distributions to train and evaluate a model. An example of domain shift
described is a classifier trained on images of objects on white background per-
forming worse on similar images with a different background. The model trained
on data from one domain does not generalize well enough to be used with data
from another domain.
It may not be necessary to have the most diverse set of images regarding vari-
ables such as backgrounds and obscured images. A context of classifying images
in lab conditions where the lighting, sensor, background, and fruit orientation
can be controlled may be sufficient. Zhang et al. [34] constructed a dataset for
fine-grained classification of banana ripeness. This dataset included images of
bananas in predetermined orientations on white background. The same camera,
camera settings, distance and lighting were used when capturing images. While
their experiment achieved good results, it should not be expected to perform
well on images outside of this domain e.g. images from a different camera on a
different background. Zhang et al. [35] also conclude that their convolution neu-
ral network was not able to perform well on imperfect images and images with
complex backgrounds due to the network being trained on their clean dataset.
The same expectations would apply to our Capsicum Chinense image dataset
which uses a similar method of data collection to Zhang et al. [34].

2.4 Physical Descriptors for Capsicum Chinense


The physical features of plants and fruit such as size, shape or color are known
as their morphological traits. These are the features typically used by humans to
identify different species of plants as they are easy to observe and learn. Different
variations of a plant or fruit are sometimes easy to distinguish just by examining
their morphological traits. The International Plant Genetic Resources Institute
(IPGRI), currently known as Bioversity International, set out a list of descrip-
tors for plants and fruit of genus Capsicum Spp. [16]. These descriptors include
features such as plant height, leaf shape, fruit shape, fruit color, pedicel attach-
ment etc. Despite this extensive list of morphological traits, Bharath, Cilas and
Umaharan [3] observed four additional traits found in accessions (i.e. distinct,
uniquely identified samples) of Caribbean Capsicum Chinense Jacq fruit. The
four additional traits observed are: fruit gloss, surface pebbling, pericarp fold
and tail at distal end of the fruit. Surface pebbling and the distal tail are traits
observed in popular varieties such as the Trinidad Scorpion pepper. The yellow
758 J. Mungal et al.

Scotch Bonnet pepper is known for its yellow color when it matures, campanu-
late (bell-like) shape, and its folded pericarp. These morphological traits form
the basis for our attribute annotations.
As varieties share similar traits we can say there is some inter-class similarity.
For example, the Scorpion pepper and Seven-Pot pepper have a very similar
appearance with small differences such as the Scorpion pepper’s pronounced
tail at the distal end of the fruit as seen in Fig. 1. Intra-class variation is also
a problem as the images of a class may have different orientations, scale or
occluded features. There is natural variation in morphological traits observed
for a specific variety.

Fig. 1. Image showing inter-class similarities between Scorpion and Seven-Pot Pepper

3 Methodology
3.1 Image Collection and Dataset Annotation

In order to start the data collection process, genuine Scorpion pepper fruit had
to be sourced directly from farmers. Images of other Capsicum Chinense such
as Trinidad Pimentos, Seven-Pot-brown, red and yellow Habanero varieties were
collected from local markets. The images were captured using two cameras in
fixed positions. The peppers were rotated to capture different sides of the fruit in
different orientations. In all, 10 images of each pepper were taken. A diagram of
this setup is shown in Fig. 2. The images were captured on a white background
with two fluorescent lamps for lighting. Both cameras had their white balances
set to fluorescent, to account for the artificial lighting used. They also had their
ISO set to 400 with image quality being fine. However, due to the differences in
cameras and their lenses, a few settings had to be configured differently. The first
Caribbean Hot Pepper Image Dataset 759

camera produced images with a 4:3 aspect ratio with a 3264 × 2488 resolution,
while the second produced images with a 3:2 ratio and a 5184 × 3456 resolution.
Both cameras used fixed lenses; however, due to differences in brands and camera
technology, the first camera used an exposure of +1 23 , while the second used a
1/125 shutter speed and a f/3.2 aperture. The differences in these settings led
to the consistent images. For the purpose of having uniformly sized images, the
larger images were center cropped to match the aspect ratio and resolution of the
first set of images in the final dataset. The result of this process was a collection
of just over 4,000 images.

Fig. 2. Figure showing the camera setup used in the image collection process

Like a number of other image annotation collections [14,19,37], our anno-


tations were sourced through the Amazon Mechanical Turk (AMT) platform.
For each image collected, the MTurkers (annotators on the AMT platform) were
asked four questions. An example annotation screen is shown in Fig. 3. They
were asked to look carefully at the image and select the depicted pepper’s color,
its shape, how wrinkly its skin was, how bumpy its skin was and whether it
had a pericarp fold. They were offered an option to view a larger version of the
image if necessary. They were also presented with detailed instructions on what
each feature of the pepper would look like to aid the annotation process. As is
common in image-annotation collection work, the MTurkers were restricted to
be only from the United States or Great Britain and they must have at least
50% of their previous annotation work on the platform approved [14,33]. To
ensure a level of consensus, each image was annotated by five MTurkers. The
most common values for each attribute for each image were taken as the ground
truth for that image. This was done to eliminate the potential biases that a single
annotator may possess. The result of this process was a dataset with annotations
for color, shape, smooth surface, surface pebbling and folded pericarp for each
of the images. A total of 4184 images were annotated. Additional images are
presented in the final dataset without annotations.
760 J. Mungal et al.

Fig. 3. Figure showing the AMT user interface that was shown to the annotators.

3.2 Classification Experiment

The dataset was used to perform a classification experiment. The goal was to
determine whether an image in the dataset was of a scorpion pepper or not.
The experiment made use of a convolutional neural network for classification
and employed a technique known as transfer learning. A two-phase fine-tuning
process was used where only a new classification head of the network is trained
using our dataset in the first phase, and the second phase where the top and
some of the higher layers in the convolutional base are retrained. VGG16 was
chosen as the classification network. For the first phase, K-Fold Cross Validation
was used to determine which split of the data provided the best representation
of the model performance. From this the second phase of the fine-tuning process
was carried out. Other common evaluation metrics were also used including
Precision, Recall, F-1 score and Accuracy.
Caribbean Hot Pepper Image Dataset 761

CNN Setup and Configuration. The experiment was set up using an Ana-
conda environment with Python 3.6. The model was trained on a machine with
an Nvidia 2060 Super GPU with 8 GB of VRAM. The VGG16 network, pre-
trained on ImageNet, was loaded from the Keras API. A two step fine-tuning
approach was used for adapting the network for the experiment with our dataset.
The VGG16 classification layers are replaced with a new classifier, the structure
of which is given in Fig. 4. The new classifier uses smaller dense layers and
replaces the soft-max layer for an output neuron with a sigmoid activation func-
tion. Training is done in two phases:

1. Freeze the convolutional base of the network. Train the new fully connected
head.
2. Unfreeze block 5 of the VGG16 convolutional base. Retrain block 5 and the
fully connected head.

Fig. 4. Image showing new classification layers added to the standard VGG16 base
network

3.3 Data Preparation

The images were center cropped using a 900 × 900 box and then resized to
224 × 224 to match the default input size for VGG16 pre-trained on ImageNet.
Since there were ten images of each pepper, the images were grouped by pepper
to avoid the same fruit appearing in training, validation and test sets. Exactly 4
peppers, 40 images, were removed from the set of peppers that are not Scorpion
pepper. This was done to make the split even. The entire set was then randomly
split using a 80:20 ratio for training and evaluation sets. A breakdown of the
training and test set splits is shown in Table 1. The training set was then split
into k subsets for K-Fold cross validation where k = 10 with the same grouping
enforced as to prevent images of the same fruit showing up in both training and
validation sets. A breakdown of images per fold is shown in Table 2.

Training. Binary cross-entropy was used as the loss function since it is a binary
classification problem. Adam was selected as the optimizer for back propagation
with a learning rate of 0.0001. The training and validation data generator were
set to use a batch size of 20 images.
762 J. Mungal et al.

Table 1. Table showing breakdown of training and test splits of the dataset.

Class Test set Training set


Scorpion 420 1020
Not Scorpion 420 2340
Total 840 3360

Table 2. Table showing breakdown of folds for classification fxperiment

Fold Scorpion-Pepper Non-Scorpion-Pepper Total


0 100 230 330
1 100 230 330
2 100 230 330
3 100 230 330
4 100 230 330
5 100 230 330
6 100 230 330
7 100 230 330
8 100 230 330
9 120 270 390
Total 1020 2340 3360

1. Training steps per epoch = number of images in training set/batch size


2. Validation steps per epoch = number of images in validation set/batch size

The same settings were used for phase 1 and phase 2 of training. The model
is trained for 15 epochs in Phase 1 and 10 epochs in Phase 2.
A further pre-processing step is done using the ImageDataGenerator. The
pixel values of the images are re-scaled from the range of 0–255 to the range of
0–1. This is a common data normalization step for images. While this method
does not match the pre-processing function used by the pre-trained model, it
seemed to help reduce over-fitting in our testing. The pre-processing and fine-
tuning procedure is based on the example by Chollet [8], the creator of the Keras
library, from his Deep learning with Python book.

4 Results
4.1 Classification Experiment Results
Table 3 shows the evaluation accuracy results after performing k-fold cross val-
idation. The average evaluation accuracy across all folds was roughly 98%.
Exactly 7 of the 10 folds were within 1 standard deviation of the mean. The
performance on the evaluation set suggested that the model was generalizing
Caribbean Hot Pepper Image Dataset 763

Table 3. Table showing accuracy of the model for K-fold validation for classification
experiment. * means values within 1 standard deviation of the mean

Fold Train Acc Valid Acc Eval Acc


0 0.9884 0.9939 0.9893 *
1 0.9855 1.0000 0.9905
2 0.9858 0.9879 0.9810 *
3 0.9861 0.9848 0.9798 *
4 0.9891 0.9909 0.9845 *
5 0.9875 0.9879 0.9881 *
6 0.9908 0.9939 0.9738
7 0.9914 0.9606 0.9738
8 0.9878 0.9727 0.9881 *
9 0.9919 0.9744 0.9893 *
mean 0.9884 0.9847 0.9838
std 0.0022 0.0114 0.0061

well to unseen samples. The data in Table 4 shows the evaluation metrics of the
model. The model showed good results for classification of both classes. The
precision score for the “not scorpion” class was higher than that of the scorpion
pepper class. The opposite is true of the recall scores and the F1 scores were
roughly the same. The model made more mistakes classifying the “not scorpion”
images as opposed to the images of scorpion peppers.

Table 4. Table showing evaluation metrics for the classification experiment.

Class (P) precision (R) recall (F1) f1-score (S) support


Not Scorpion 0.9903 0.9738 0.9820 420
Scorpion 0.9742 0.9905 0.9823 420
macro average 0.9823 0.9821 0.9821 840
weighted average 0.9823 0.9821 0.9821 840

4.2 AMT Annotation Results


Most responses from the MTurkers were valid; however, approximately 2% of all
responses were not. The invalid responses contained either incomplete or com-
pletely infeasible annotations. They were rejected and redone by other MTurkers.
The MTurkers also took varying times to complete the task, but on average they
took around 50 s to annotate a single image. The annotation collection process
lasted three days to procure all of the required annotations. Each MTurker was
compensated $0.02 USD for a single pepper annotation and they could have
completed as much as they desired. Considering that five annotations were col-
lected per image, the total cost of the annotation process was approximately
764 J. Mungal et al.

$627.60 USD. The most popular values reported by the five annotators for each
pepper’s attributes were selected as the final attribute values for that pepper.
Additional information on the MTurkers themselves were also collected. They
were asked about their birth country, residence country, native language and
other languages they speak. If they lived in a country that was not their birth
country, they were asked how long they lived in their new country. This was done
to capture any cultural differences that may have existed between MTurkers and
the way that they may annotate the peppers. A summary of the annotations is
shown in Table 5.

Table 5. Table showing results from annotation process broken down by attribute type
and pepper type. The numbers represent the count of peppers belonging to a landrace
for a specific attribute.

Seven Yellow Red Moruga Local


Attribute Scorpion Pimento pot habanero habanero red red
Color
Red 455 133 198 0 102 307 75
Dark red 671 27 107 0 113 168 155
Orange 179 59 57 73 9 12 0
Dark green 0 0 6 7 24 236 0
Light green 37 200 4 1 0 8 0
Green 3 17 10 33 12 109 0
Yellow 2 11 57 75 0 1 0
Dark brown 0 0 138 0 0 0 0
Dark orange 22 14 33 1 10 6 0
Dark yellow 3 12 27 22 0 1 0
Light orange 7 8 6 18 0 0 0
Dark purple 0 0 24 0 0 1 0
Brown 0 0 23 0 0 0 0
Black 0 0 16 0 0 0 0
Light red 5 7 3 0 0 1 0
Light yellow 0 12 1 0 0 0 0
Light brown 0 0 6 0 0 0 0
Light purple 0 0 2 0 0 0 0
Purple 0 0 1 0 0 0 0
White 0 0 1 0 0 0 0
Shape
Almost round 62 2 233 153 230 744 132
Campanualate 566 117 397 64 36 91 64
Triangular 613 56 87 13 4 15 31
Elongate 143 325 3 0 0 0 3
Fruit surface
Semi wrinkled 791 243 271 97 88 375 143
Smooth 50 242 45 111 164 418 85
Wrinkled 543 15 404 22 18 57 2
Surface pebbling
Visible 1283 37 640 25 26 123 57
Not Visible 101 463 80 205 244 727 173
Folded pericarp
Not visible 648 359 262 151 208 693 193
Visible 736 141 458 79 62 157 37
Caribbean Hot Pepper Image Dataset 765

Measuring Agreement Among Annotators. Fleiss’ Kappa and Gwet’s AC1


statistics were found for each attribute and for each type of pepper by using the
‘irrCAC’ package in R. These measures are used to assess the extent of agreement
between raters on nominally scaled data.
A 95% confidence interval and a p-value was found for each measure. The p-
value tests the hypothesis that the measure (Fleiss’ Kappa or Gwet’s AC1) is zero.

Table 6. Fleiss’ Kappa calculations for agreement among annotators

Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Color All 0.357 0.005 (0.348, 0.366) 0.00
Color Local red 0.140 0.024 (0.093, 0.188) 0.00
Color Moruga red 0.411 0.011 (0.39, 0.432) 0.00
Color Pimento 0.283 0.010 (0.262, 0.303) 0.00
Color Red habanero 0.290 0.020 (0.251, 0.33) 0.00
Color Scorpion 0.271 0.010 (0.251, 0.29) 0.00
Color Seven pot 0.274 0.009 (0.257, 0.291) 0.00
Color Yellow habanero 0.264 0.021 (0.223, 0.305) 0.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Pebbling All 0.337 0.007 (0.322,0.351) 0.00
Pebbling Local red 0.058 0.025 (0.009, 0.107) 0.02
Pebbling Moruga red 0.055 0.013 (0.029, 0.081) 0.00
Pebbling Pimento 0.030 0.015 (−0.001, 0.06) 0.05
Pebbling Red habanero 0.026 0.022 (−0.018, 0.069) 0.25
Pebbling Scorpion 0.050 0.010 (0.03, 0.071) 0.00
Pebbling Seven pot 0.087 0.016 (0.055,0.119) 0.00
Pebbling Yellow Habanero 0.049 0.023 (0.003, 0.095) 0.04
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Pericarp Fold All 0.096 0.006 (0.084, 0.108) 0.00
Pericarp Fold Local red 0.060 0.023 (0.016, 0.105) 0.01
Pericarp Fold Moruga red 0.029 0.013 (0.005, 0.054) 0.02
Pericarp Fold Pimento 0.048 0.017 (0.016,0.081) 0.00
Pericarp Fold Red habanero 0.035 0.023 (−0.01, 0.08) 0.12
Pericarp Fold Scorpion 0.047 0.010 (0.029, 0.066) 0.00
Pericarp Fold Seven pot 0.098 0.015 (0.069, 0.126) 0.00
Pericarp Fold Yellow habanero 0.000 0.021 (−0.042, 0.041) 1.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Shape All 0.322 0.005 (0.312, 0.333) 0.00
Shape Local red 0.261 0.022 (0.218, 0.304) 0.00
Shape Moruga red 0.069 0.013 (0.044,0.094) 0.00
Shape Pimento 0.184 0.016 (0.153,0.214) 0.00
Shape Red Habanero 0.024 0.017 (−0.009,0.057) 0.15
Shape Scorpion 0.177 0.009 (0.16,0.195) 0.00
Shape Seven pot 0.167 0.013 (0.141,0.192) 0.00
Shape Yellow habanero 0.258 0.026 (0.207,0.309) 0.00
Attribute Type Fleiss’ Kappa Standard error 95% Confidence interval p-value
Fruit Surface All 0.141 0.005 (0.132,0.151) 0.00
Fruit Surface Local red 0.080 0.021 (0.039,0.122) 0.00
Fruit Surface Moruga red 0.056 0.009 (0.038,0.074) 0.00
Fruit Surface Pimento 0.084 0.014 (0.057,0.111) 0.00
Fruit Surface Red habanero 0.013 0.015 (−0.017,0.043) 0.40
Fruit Surface Scorpion 0.048 0.008 (0.033,0.063) 0.00
Fruit Surface Seven pot 0.065 0.011 (0.043,0.087) 0.00
Fruit Surface Yellow habanero 0.109 0.020 (0.07,0.148) 0.00
766 J. Mungal et al.

Fleiss’ Kappa sometimes yields low values when the ratings suggest high
levels of agreement. This is known as the kappa paradox, identified by [12],
as well as [9]. Gwet [13] proposed an AC1 statistic as a more paradox-resistant
alternative to kappa measures. [32] concluded that Gwet’s AC1 statistic provides
a more stable inter-rater reliability coefficient than other kappa measures.

Table 7. Gwet’s AC1 calculations for agreement among annotators

Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Color All 0.448 0.004 (0.44,0.456) 0.00
Color Local red 0.549 0.016 (0.517,0.58) 0.00
Color Moruga red 0.520 0.009 (0.501,0.538) 0.00
Color Pimento 0.374 0.012 (0.351,0.396) 0.00
Color Red habanero 0.481 0.016 (0.449,0.512) 0.00
Color Scorpion 0.489 0.007 (0.475,0.503) 0.00
Color Seven pot 0.328 0.009 (0.31,0.345) 0.00
Color Yellow habanero 0.364 0.018 (0.328,0.401) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Pebbling All 0.339 0.008 (0.324,0.354) 0.00
Pebbling Local red 0.236 0.036 (0.166,0.306) 0.00
Pebbling Moruga red 0.409 0.019 (0.372,0.446) 0.00
Pebbling Pimento 0.527 0.023 (0.482,0.571) 0.00
Pebbling Red habanero 0.478 0.032 (0.415,0.54) 0.00
Pebbling Scorpion 0.558 0.013 (0.531,0.584) 0.00
Pebbling Seven pot 0.503 0.020 (0.464,0.542) 0.00
Pebbling Yellow habanero 0.462 0.036 (0.391,0.533) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Pericarp Fold All 0.408 0.004 (0.4,0.416) 0.00
Pericarp Fold Local red 0.398 0.038 (0.324,0.472) 0.00
Pericarp Fold Moruga red 0.266 0.019 (0.23,0.303) 0.00
Pericarp Fold Pimento 0.152 0.021 (0.11,0.194) 0.00
Pericarp Fold Red habanero 0.201 0.031 (0.14,0.262) 0.00
Pericarp Fold Scorpion 0.366 0.006 (0.353,0.379) 0.00
Pericarp Fold Seven pot 0.151 0.017 (0.118,0.185) 0.00
Pericarp Fold Yellow habanero 0.061 0.026 (0.009,0.113) 0.02
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Shape All 0.401 0.005 (0.391,0.41) 0.00
Shape Local red 0.395 0.024 (0.349,0.442) 0.00
Shape Moruga red 0.498 0.012 (0.476,0.521) 0.00
Shape Pimento 0.405 0.018 (0.371,0.44) 0.00
Shape Red habanero 0.473 0.019 (0.436,0.511) 0.00
Shape Scorpion 0.307 0.008 (0.29,0.323) 0.00
Shape Seven pot 0.383 0.010 (0.363,0.403) 0.00
Shape Yellow habanero 0.463 0.024 (0.416,0.51) 0.00
Attribute Type Gwet’s AC1 Standard error 95% Confidence interval p-value
Fruit Surface All 0.171 0.005 (0.162,0.18) 0.00
Fruit Surface Local red 0.264 0.021 (0.222,0.306) 0.00
Fruit Surface Moruga red 0.156 0.010 (0.136,0.177) 0.00
Fruit Surface Pimento 0.225 0.014 (0.196,0.253) 0.00
Fruit Surface Red habanero 0.152 0.019 (0.114,0.19) 0.00
Fruit Surface Scorpion 0.216 0.008 (0.2,0.232) 0.00
Fruit Surface Seven pot 0.230 0.013 (0.204,0.256) 0.00
Fruit Surface Yellow habanero 0.183 0.022 (0.14,0.227) 0.00
Caribbean Hot Pepper Image Dataset 767

In general, the agreement was shown to range between slight and moderate.
The results can be seen in Tables 6 and 7. Some attributes are easier to identify;
so we expect to see lower agreement for more subjective attributes such as fruit
surface and the presence of a folded pericarp. Illustrations were provided as
examples during annotation; however, these still showed the lowest agreement.
This is a known problem in data collection tasks that involve human subjects as
human generated annotations can vary significantly [7,14].
Attributes such as surface pebbling are easier to identify, especially for vari-
eties such as the Pimento which rarely exhibits this trait. The results for agree-
ment for pebbling on the Pimento images show the kappa paradox. Gwet’s AC1
gives a better measure of agreement. This is also observed for other attributes.

5 Conclusion

In this paper we presented an image dataset of Caribbean Capsicum Chinense


fruit, annotated with information about the attributes that are visible in each
image. Annotations were collected using Amazon Mechanical Turk (AMT). We
determined the value of each attribute based on the most chosen value by the
annotators. We determined the agreement between annotators to be slight to
moderate using Fleiss’ Kappa and Gwet’s AC1, with attributes such as fruit
surface and presence of a folded pericarp showing the lowest agreement. This
dataset can be used as a starting point for machine learning and computer
vision applications in the regional hot pepper industry. Additionally, we give an
example of a classification experiment using only the images in the dataset. The
experiment showed that we could classify images of Scorpion peppers with 98%
accuracy in a binary classification task.
Some limitations include the fact that visual identification of specific land
race varieties is difficult and there is some subjectivity to determining the value
of some attributes e.g. colour and shape as seen in the agreement between the
AMT annotators. This method of data collection was not ideal as we wanted
to see stronger agreement. The dataset is also limited in size, as deep learning
application require larger datasets. Determining the variety of the fruit was
difficult in some cases as suppliers did not know the exact names/varieties they
were growing. The binary classification example is a much simpler task than
multi-class classification. An experiment with more image data showing a multi-
class classification example would provide more value to this area of research.
Future work would include building a larger dataset and appending addi-
tional attribute annotations relating to other morphological features of the fruit.
A comparison between the annotators’ work and that of people with expertise
in identifying morphological traits would give us a better understanding of the
quality of the annotations. Additional experiments with fine-grained classifi-
cation and Generative Adversarial Networks are other areas worth exploring.
Finally, we see this as a starting point for building a Caribbean hot pepper
ontology.
768 J. Mungal et al.

References
1. Adams, H., Umaharan, P., Brathwaite, R., Mohammed, K.: Hot pepper production
manual for Trinidad and Tobago (2011)
2. Barth, R., IJsselmuiden, J., Hemming, J., Van Henten, E.J.: Data synthesis meth-
ods for semantic segmentation in agriculture: a capsicum annuum dataset. Comput.
Electr. Agric. 144 284–296 (2018)
3. Bharath, S.M., Cilas, C., Umaharan, P.: Fruit trait variation in a Caribbean
germplasm collection of aromatic hot peppers (capsicum chinense jacq.).
HortScience 48(5), 531–538 (2013)
4. Bharath, S.M.: Morphological characterisation of a Caribbean germplasm collec-
tion of capsicum chinense jacq. Master’s thesis, The University of the West Indies
(2012)
5. Bosland, P.W., Coon, D., Reeves, G.: Trinidad moruga scorpion pepper is the
world’s hottest measured Chile pepper at more than two million Scoville heat
units. HortTechnology 22(4), 534–538 (2012)
6. CARDI. Genuine caribbean hot pepper seed produced and sold by cardi (2014)
7. Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P.: Data collection
and evaluation server, Lawrence Zitnick. Microsoft coco captions (2015)
8. Chollet, F.: Deep Learning with Python. Manning Publications Company (2017)
9. Cicchetti, D.V., Feinstein, A.V.: High agreement but low kappa: Ii. resolving the
paradoxes. J. Clin. Epidemiol. 43(6), 551–558 (1990)
10. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-
scale hierarchical image database. In 2009 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 248–255 (2009)
11. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their
attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1778–1785 (2009)
12. Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems
of two paradoxes. J. Clin. Epidemiol. 43(6), 543–549 (1990)
13. Gwt, K.L.: Computing inter-rater reliability and its variance in the presence of
high agreement. British J. Math. Stat. Psychol. 61(1), 29–48 (2008)
14. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking
task: Data, models and evaluation metrics. J. Artif. Int. Res. 47(1), 853–899 (2013)
15. Hou, S., Feng, Y., Wang, Z.: VegFru: A domain-specific dataset for fine-grained
visual categorization. In 2017 IEEE International Conference on Computer Vision
(ICCV), pp. 541–549 (2017)
16. International Plant Genetic Resources Institute IPGRI. Descriptors for Capsicum
(Capsicum Spp.) =: Descriptores Para Capsicum (Capsicum Spp.). IPGRI, Rome
(1995)
17. Krishna, R., et al.: Connecting language and vision using crowdsourced dense image
annotations, Visual genome (2016)
18. Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object
classes by between-class attribute transfer. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition, pp. 951–958 (2009)
19. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D.,
Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp.
740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48
20. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2016)
Caribbean Hot Pepper Image Dataset 769

21. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In
Proceedings of International Conference on Computer Vision (ICCV) (2015)
22. Minervini, M., Fischbach, A., Scharr, H., Tsaftaris, S.A.: Finely-grained anno-
tated datasets for image-based plant phenotyping. Pattern Recogn. Lett. 81, 80–89
(2016)
23. Minervini, M., Scharr, H., Tsaftaris, S.A.: Image analysis: the new bottleneck in
plant phenotyping [applications corner]. IEEE Signal Process. Mag. 32(4), 126–131
(2015)
24. Mureşan, H., Oltean, M.: Fruit recognition from images using deep learning. Acta
Universitatis Sapientiae, Informatica 10(1), 26–42 (2018)
25. Patterson, G., Hays, J.: Sun attribute database: Discovering, annotating, and rec-
ognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pat-
tern Recognition, pp. 2751–2758 (2012)
26. Patterson, G., Hays, J.: COCO Attributes: attributes for people, animals, and
objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9910, pp. 85–100. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46466-4 6
27. Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Kutu-
lakos, K.N. (ed.) ECCV 2010. LNCS, vol. 6553, pp. 1–14. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-35749-7 1
28. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.: Deepfruits: A fruit
detection system using deep neural networks. Sensors (Basel, Switzerland), vol.
16(8) (2016)
29. Sinha, A., Petersen, J.: Caribbean hot pepper production and post harvest manual
(2011)
30. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd
birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute
of Technology (2011)
31. Welinder, P. Ca, et al.:ltech-USCD birds 200. Technical Report CNS-TR-2010-001,
California Institute ,of Technology (2010)
32. Wongpakaran, N., Wongpakaran, T., Wedding, D., Gwet, K.L.: A comparison of
cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients:
a study conducted with personality disorder samples. BMC Med. Rese. Methodol.
13(1), 61 (2013)
33. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions.
Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
34. Zhang, Y., Lian, J., Fan, M., Zheng, Y.: Deep indicator for fine-grained classifica-
tion of banana’s ripening stages. EURASIP J. Image Video Process. 2018(1), 46
(2018)
35. Zhang, Y.-D., Dong, Z., Chen, X., Jia, W., Du, S., Muhammad, K., Wang, S.-
H.: Image based fruit category classification by 13-layer deep convolutional neural
network and data augmentation. Multimedia Tools Appl. 78(3), 3613–3632 (2017).
https://doi.org/10.1007/s11042-017-5243-3
36. Zheng, Y.-Y., Kong, J.-L., Jin, X.-B., Wang, X.-Y., Ting-Li, S., Zuo, M.: Cropdeep:
the crop vision dataset for deep-learning-based classification and detection in pre-
cision agriculture. Sensors 19(5), 1058 (2019)
37. Zitnick, C.L., Parikh, D.:. Bringing semantics into focus using visual abstraction.
In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE
(2013)
A Prediction Model for Student Academic
Performance Using Machine Learning-Based
Analytics

Harjinder Kaur(B) and Tarandeep Kaur

School of Computer Applications, Lovely Professional University, Phagwara, Punjab, India


[email protected]

Abstract. The adoption of digitization in the education sector has led to transfor-
mational changes. The academic sector has become more digital, more extensive,
and more comprehensive but more complex as well. The topical advancements
include the rise of technology-driven learning, the use of digital learning plat-
forms, management systems, and technologies by students; the implementation
of artificial intelligence and machine learning approaches for improvising student
learning. In recent times, the solicitation of machine learning into academics has
led to an upsurge in the education sector embroidering the growth of novel are-
nas such as Academic Data Mining (ADM) or Education Data Mining (EDM).
ADM, based on machine learning techniques, helps in the prediction of students’
academic performance and is the subject of concern to many academic institu-
tions for the classification of its students according to their learning capabilities.
Moreover, the enormous amount of data about student academics can be handled,
pre-processed, analyzed, and transformed into meaningful results and interest-
ing patterns. The resulting patterns help in analyzing the academic performance
of students and further lead to the identification of students who require special
counseling. This paper proposes a model that predicts the performance of students
based on academic details that helps in the classification of different learners.

Keywords: Academic data mining · Decision tree · Naïve Bayes · Performance


prediction

1 Introduction
In recent times Machine Learning (ML) techniques are being used for decision making
in many prominent areas out of which education is of utmost importance. It promotes
academic institutions by predicting the academic performance of their students. Fur-
thermore, it facilitates the instructors to differentiate between good and bad performers
based on their predicted academic performance [1–3].
There are several components of ML, depending upon the type of output, features
or input, the different depiction of data, types of feedback used when learning. The
variation comes in the type of data available, type of features extracted from the data,
type of output needed, and the type of algorithm used to obtain the learning model [4].
Figure 1 represents the components of machine learning.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 770–775, 2023.
https://doi.org/10.1007/978-3-031-18461-1_50
A Prediction Model for Student Academic Performance 771

MACHINE LEARNING OUTPUT

FEEDBACK

COMPONENTS
DATA

TRAINING
EXPERIENCE

Fig. 1. Machine learning components

Learning Analytics is grounded on the massive amount of academic data which is


collected from students’ aforementioned academic performance that is considered for
predicting their academic performance [14]. In the present situation, numerous machine
learning algorithms are available for envisaging student educational performance the
algorithm which we are proposing in our model is the ensemble of Decision Tree and
Naïve Bayes. The proposed model spots the probable concerns that are accountable for
chastening students’ performance. Figure 2 illustrates the various steps to be followed
to accessing and analyze the academic performance of the student.
Exclusively, the proposed model not only predicts the students’ academic perfor-
mance but it has a recommender component that assists the students/or learners in the
selection of courses so as to improvise their academic performance. The recommender
component is unique in terms of its intelligence in considering the student choice for the
courses and academic performance predictions. The choice-oriented and performance-
oriented incorporations in the proposed model make it the best alternative amongst the
existing models for performance prediction and recommendations.
772 H. Kaur and T. Kaur

Fig. 2. Strategic flow diagram for assessing student performance analytics for getting the result

2 Role of Learning Analytics in Academic Data


In the field of education, Learning Analytics (LA) plays a very prominent role. LA
pursuit is to notify the academic institutions about their students’ academic patterns
and assist them to opt for strategies and pedagogies helpful in promoting academic
performance [5, 6]. Both learners and tutors achieve considerable benefits from LA, that
is, instructors can identify weak learners as well as the counterproductive concerns that
have affected the student’s performance. On the other hand, the model facilitates the
students or learners to do their self-assessment.
LA uses machine learning techniques for the identification of courses hampering
the academic performance of students. Prior to the prediction process, academic data
of students has been collected, which is further considered as an input to the proposed
model. Furthermore, the predicted results help the students to make a self-assessment
of their progress and focus on their weak areas [3]. In addition to students, the predicted
performance also assists the instructor to guide the students in their weak areas which
further leads to performance improvement [7].
Machine learning is being extensively used in LA. The use of machine learning
algorithms tends to make the LA intelligent and proactive in determining the prospec-
tive academic performance of the students. Machine learning algorithms are intelligent
because they are used for creating and implementing algorithms that can be trained
from past experiences [10]. If a pattern of behavior exists in the past then using machine
learning algorithms we can predict whether the same pattern or behavior will occur again
or not. If there is no historic data then no predictions can be made. Machine learning
A Prediction Model for Student Academic Performance 773

algorithms iteratively run on large datasets to evaluate the different patterns in the data
and let the machine respond to the situations for which they have not been explicitly
prepared. To deliver consistent results the machines learn from the historic data [11].
The results generated using learning analytic process have been used both by instruc-
tor and students represented using Fig. 3. The instructors have used intelligent learning
analytics in their teaching for the improvement of the learning experience of students.
The fundamental idea of intelligent learning analytics is to identify the students at risk
and provide them with timely intervention based on the results of student academics.
Early detection of students at-risk will help higher education institutions in the reduction
of dropout rates and increase the retention rate [8, 9].

Fig. 3. Machine learning-based learning analytics process

Data
Student Learning
Collection
Academic Analytics on
from different
Analytics collected data
institutions

Result Action Predictive


Analysis Observation Analytics

Fig. 4. Strategic flow diagram for assessing student performance analytics

The very first step in performance prediction analytics is the availability of students’
academic data. Figure 4 shows that the academic data can be collected from different
sources. Afterward, learning analytics has been applied to the gathered data further
helps in the analysis of students and their academic data. The predictive analytics are
then applied for predicting the academic performance of the students. The prediction
results have been used by both learners and tutors so that appropriate actions can be
774 H. Kaur and T. Kaur

taken to improvise the performance of the weak students. The last step is the resulting
analytics which helps in the validation of generated results.

3 Assessment Measures of an ML-Based Learning Analytics


Process
In machine learning, to measure the performance of classification models different
parameters are used that are accuracy, precision, recall, and f-measure. Accuracy is
the measure that is demarcated as the number of correctly classified examples over the
total number of examples. However, Precision is defined as the number of acceptably
classified instances over the overall number of instances. Recall gives the ratio of actual
positive recognized correctly to the total number of misclassified and correctly classified
instances. These measures can be signified as [12, 13].
TP + TN
Accuracy =
TP + TN + FP + FN
TP
Precision =
TP + FP
TP
Recall =
TP + FN
Confusion matrix is used to quantify the accurateness of machine learning algorithms
desirable for prediction which is depicted in Table 1: [15].

Table 1. Confusion matrix

Predicted results
Positive Negative
Actual data Positive results True positive (TP) False negative (FN)
Negative results False positive (FP) True negative (TN)

4 Conclusion and Future Directions


Recently, learning analytics with the support of machine learning approaches is being
extensively implemented in the education sector. It involves pre-processing and anal-
ysis of collected student academic data and classifies them into slow and fast learners
using learning analytical models. This paper proposes a learning analytical model that
intends to improvise the academic performance of the learners at risk. The model works
as an alarming system for weak performers and helps the tutors to make quick decisions
for improving their academic performance. Consequently, this significantly impacts the
performance of slow learners who have been identified by the model and same has been
A Prediction Model for Student Academic Performance 775

communicated to their teachers. The teachers can then provide their valuable suggestions
and inputs so as to improvise such student’s academic performances. Moreover, for the
institutions, the identification of slow learners at the initial stage benefits them in increas-
ing their retention rate by taking corrective actions on time. The institutions can leverage
the capabilities of the proposed analytical model to bridge the gap between the student
learning capabilities, their behavior and teaching potential of the instructors. Overall,
the proposed model is beneficial to the institutions, instructors as well as the students.
The proposed model can be augmented with additional course recommendation abilities
that will help the students to select appropriates courses as per his/her performance.

References
1. Enughwure, A.A., Ogbise, M.E.: Application of machine learning methods to predict student
performance: a systematic literature review. Int. Res. J. Eng. Technol. 7(05), 3405–3415
(2020)
2. Albreiki, B., Zaki, N., Alashwal, H.: A systematic literature review of students’ performance
prediction using machine learning techniques. Educ. Sci. 11(9), 552 (2021)
3. Bhutto, E.S., Siddiqui, I.F., Arain, Q.A., Anwar, M.: Predicting students’ academic per-
formance through supervised machine learning. In: 2020 International Conference on
Information Science and Communication Technology (ICISCT), pp. 1–6. IEEE, February
2020
4. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
5. Lemay, D.J., Baek, C., Doleck, T.: Comparison of learning analytics and educational data
mining: a topic modeling approach. Comput. Educ. Artif. Intell. 2, 100016 (2021)
6. Namoun, A., Alshanqiti, A.: Predicting student performance using data mining and learning
analytics techniques: a systematic literature review. Appl. Sci. 11(1), 237 (2020)
7. Guo, B., Zhang, R., Xu, G., Shi, C., Yang, L.: Predicting students performance in educational
data mining. In: 2015 International Symposium on Educational Technology (ISET), pp. 125–
128. IEEE, July 2015
8. Akçapınar, G., Altun, A., Aşkar, P.: Using learning analytics to develop early-warning system
for at-risk students. Int. J. Educ. Technol. High. Educ. 16(1), 1–20 (2019). https://doi.org/10.
1186/s41239-019-0172-z
9. Miguéis, V.L., Freitas, A., Garcia, P.J., Silva, A.: Early segmentation of students according
to their academic performance: a predictive modelling approach. Decis. Support Syst. 115,
36–51 (2018)
10. Aldowah, H., Al-Samarraie, H., Fauzy, W.M.: Educational data mining and learning analytics
for 21st century higher education: a review and synthesis. Telematics Inform. 37, 13–49 (2019)
11. Chuan, Y.Y., Husain, W., Shahiri, A.M.: An exploratory study on students’ performance
classification using hybrid of decision tree and Naïve Bayes approaches. In: Akagi, M.,
Nguyen, T.T., Vu, D.T., Phung, T.N., Huynh, V.N. (eds.) ICTA 2016. AISC, vol. 538, pp. 142–
152. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49073-1_17
12. Al Breiki, B., Zaki, N., Mohamed, E.A.: Using educational data mining techniques to pre-
dict student performance. In 2019 International Conference on Electrical and Computing
Technologies and Applications (ICECTA), pp. 1–5. IEEE, November 2019
13. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A., Nielsen, H.: Assessing the accuracy of
prediction algorithms for classification: an overview. Bioinformatics 16(5), 412–424 (2000)
14. Papamitsiou, Z., Economides, A.A.: Learning analytics and educational data mining in prac-
tice: a systematic literature review of empirical evidence. J. Educ. Technol. Soc. 17(4), 49–64
(2014)
15. Mueen, A., Zafar, B., Manzoor, U.: Modeling and predicting students’ academic performance
using data mining techniques. Int. J Mod. Educ. Comput. Sci 8(11), 36–42 (2016)
Parameterized-NL Completeness
of Combinatorial Problems by Short
Logarithmic-Space Reductions
and Immediate Consequences
of the Linear Space Hypothesis

Tomoyuki Yamakami(B)

Faculty of Engineering, University of Fukui, 3-9-1 Bunkyo, Fukui 910-8507, Japan


[email protected]

Abstract. The concept of space-bounded computability has become


significantly important in handling vast data sets on memory-limited
computing devices. To replenish the existing short list of NL-complete
problems whose instance sizes are dictated by log-space size parameters,
we propose new additions obtained directly from natural parameteri-
zations of three typical NP-complete problems—the vertex cover prob-
lem, the exact cover by 3-sets problem, and the 3-dimensional matching
problem. With appropriate restrictions imposed on their instances, the
proposed decision problems parameterized by appropriate size parame-
ters are proven to be equivalent in computational complexity to either
the parameterized 3-bounded 2CNF Boolean formula satisfiability prob-
lem or the parameterized degree-3 directed s-t connectivity problem by
“short” logarithmic-space reductions. Under the assumption of the lin-
ear space hypothesis, furthermore, none of the proposed problems can be
solved in polynomial time if the memory usage is limited to sub-linear
space.

Keywords: Parameterized decision problem · Linear space


hypothesis · NL-complete problem · Sub-linear space · 2SAT · Vertex
cover · Exact cover · Perfect matching

1 Background and New Challenges


1.1 Combinatorial Problems and NL-Completeness
Given a combinatorial problem, it is desirable, for practical reason, to seek for
good algorithms that consume fewer computational resources in order to solve
the problem, and therefore it is of great importance for us to identify the small-
est amount of computational resources required to execute such algorithms. Of
various resources, we are focused in this exposition on the smallest “memory
space” used by an algorithm that runs within certain reasonable “time span”.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 776–795, 2023.
https://doi.org/10.1007/978-3-031-18461-1_51
Parameterized-NL Completeness of Combinatorial Problems 777

The study on the minimal memory space has attracted significant attention in
real-life circumstances at which we need to manage vast data sets for most net-
work users who operate memory-limited computing devices. It is therefore useful
in general to concentrate on the study of space-bounded computability within
reasonable execution time. In the past literature, special attention has been paid
to polynomial-time algorithms using logarithmic memory space and two corre-
sponding space-bounded complexity classes: L (deterministic logarithmic space)
and NL (nondeterministic logarithmic space).
In association with L and NL, various combinatorial problems have been
discussed by, for instance, Jones, Lien, and Laaser [8], Cook and McKenzie [4],
Àlvarez and Greenlaw [1], and Jenner [7]. Many graph properties, in particular,
can be algorithmically checked using small memory space. Using only logarithmic
space1 (cf. [1,10]), for instance, we can easily solve the problems of determining
whether or not a given graph is a bipartite graph, a computability graph, a
chordal graph, an interval graph, and a split graph. On the contrary, the directed
s-t connectivity problem (DSTCON) and the 2CNF Boolean formula satisfiability
problem (2SAT) are known to be NL-complete [8] (together with the result of
[6,12]) and seem to be unsolvable using logarithmic space. To understand the
nature of NL better, it is greatly beneficial to study more interesting problems
that fall into the category of NL-complete problems.

1.2 Parameterization of Problems and the Linear Space Hypothesis

Given a target combinatorial problem, it is quite useful from a practical view-


point to “parameterize” the problem by introducing an adequate “size parame-
ter” as a unit basis of measuring the total amount of computational resources,
such as runtime and memory space needed to solve the problem. As quick exam-
ples of size parameters, given a graph instance G, mver (G) and medg (G) respec-
tively denote the total number of the vertices of G and the total number of the
edges in G. For a CNF Boolean formula φ, in addition, mvbl (φ) and mcls (φ)
respectively express the total number of distinct variables in φ and the total
number of clauses in φ. Decision problems coupled with appropriately chosen
size parameters are generally referred to as parameterized decision problems. In
this exposition, we are particularly interested in NL problems L parameterized
by log-space (computable) size parameters m(x) of input x. Those parameterized
decision problems are succinctly denoted (L, m), in particular, to emphasize the
size parameter m(x). Their precise definition will be given in Sect. 2.2.
Among all parameterized decision problems with log-space size parameters
m, we are focused on combinatorial problems L that can be solvable by appropri-
ately designed polynomial-time algorithms using only sub-linear space, where the
informal term “sub-linear” means O(m(x)1−ε polylog(|x|)) for an appropriately
chosen constant ε ∈ (0, 1] independent of x. All such parameterized decision
1
Those problems were proven to be in co-SL by Reif [10] and SL-complete by Àlvarez
and Greenlaw [1], where SL is the symmetric version of NL. Since SL = co-SL = L
by Reingold [11], nonetheless, all the problems are in L.
778 T. Yamakami

problems (L, m) form the complexity class PsubLIN [14] (see Sect. 2.2 for more
details). It is natural to ask if all NL problems parameterized by log-space size
parameters (or briefly, parameterized-NL problems) are solvable in polynomial
time using sub-linear space. To tackle this important question, we zero in on the
most difficult (or “complete”) parameterized-NL problems. As a typical exam-
ple, let us consider the 3-bounded 2SAT, denoted 2SAT3 , for which every variable
in a 2CNF Boolean formula φ appears at most 3 times in the form of literals,
parameterized by mvbl (φ) (succinctly, (2SAT3 , mvbl )). It was proven in [13] that
(2SAT3 , mvbl ) is complete for the class of all parameterized-NL problems.
Lately, a practical working hypothesis, known as the linear space hypoth-
esis [14], was proposed in connection to the computational hardness of
parameterized-NL problems. This linear space hypothesis (LSH) asserts that
(2SAT3 , mvbl ) cannot be solved by any polynomial-time algorithm using sub-
linear space. From the NL-completeness of 2SAT3 , LSH immediately derives
long-awaited complexity-class separations, including L = NL and LOGDCFL =
LOGCFL, where LOGDCFL and LOGCFL are respectively the log-space many-
one closure of DCFL (deterministic context-free language class) and CFL
(context-free language class) [14]. Moreover, under the assumption of LSH, it
follows that 2-way non-deterministic finite automata are simulated by “narrow”
alternating finite automata [15].
Notice that the completeness notion requires “reductions” between two prob-
lems. The standard NL-completeness notion uses logarithmic-space (or log-
space) reductions. Those standard reductions, however, seem to be too pow-
erful to use in many real-life circumstances. Furthermore, PsubLIN is not even
known to be close under the standard log-space reductions. Therefore, much
weaker reductions may be more suitable to discuss the computational hardness
of various real-life problems. A weaker notion, called “short” log-space reduc-
tions, was in fact invented and studied intensively in [13,14]. The importance of
such reductions is exemplified by the fact that PsubLIN is indeed closed under
short log-space reductions.

1.3 New Challenges in This Exposition

The key question of whether LSH is true may hinge at the intensive study of
parameterized-NL “complete” problems. It is however unfortunate that a very
few parameterized decision problems have been proven to be equivalent in com-
putational complexity to (2SAT3 , mvbl ) by short log-space reductions, and this
fact drives us to seek out new parameterized decision problems in this exposi-
tion in hope that we will eventually come to the point of proving the validity
of LSH. This exposition is therefore devoted to proposing a new set of problems
and proving their equivalence to (2SAT3 , mvbl ) by appropriate short log-space
reductions.
To replenish the existing short list of parameterized-NL complete problems,
after reviewing fundamental notions and notation in Sect. 2, we will propose
three new decision problems in NL, which are obtained by placing “natural”
Parameterized-NL Completeness of Combinatorial Problems 779

restrictions on instances of the following three typical NP-complete combinato-


rial problems: the vertex cover problem (VC), the exact cover by 3-sets problem
(3XC), and the 3-dimensional matching problem (3DM) (refer to, e.g., [5] for
their properties). We will then set up their corresponding natural log-space size
parameters to form the desired parameterized decision problems.
In Sects. 3–5, we will prove that those new parameterized decision problems
are equivalent in computational complexity to (2SAT3 , mvbl ) by constructing
various types of short log-space reductions. Since (2SAT3 , mvbl ) is parameterized-
NL complete, so are all the three new problems. This completeness result imme-
diately implies that under the assumption of LSH, those problems cannot be
solved in polynomial time using only sub-linear space.

2 Fundamental Notions and Notation

We briefly describe basic notions and notation necessary to read through the
rest of this exposition.

2.1 Numbers, Sets, Graphs, Languages, and Machines

We denote by N the set of all natural numbers including 0, and denote by Z the
set of all integers. Let N+ = N − {0}. Given two numbers m, n ∈ Z with m ≤ n,
the notation [m, n]Z expresses the integer interval {m, m + 1, . . . , n}. We further
abbreviate [1, n]Z as [n] whenever n ≥ 1. All polynomials are assumed to take
non-negative coefficients and all logarithms are taken to the base 2. The informal
notion polylog(n) refers to an arbitrary polynomial in log n. Given a (column)
vector v = (a1 , a2 , . . . , ak )T (where “T ” is a transpose operator) and a number
i ∈ [k], the notation v(i) indicates the ith entry ai of v. For two (column) vectors
u and v of dimension n, the notation u ≥ v means that the inequality u(i) ≥ v(i)
holds for every index i ∈ [n]. A k-set refers to a set that consists of exactly k
distinct elements.
An alphabet is a finite nonempty set of “symbols” or “letters”. Given an
alphabet Σ, a string over Σ is a finite sequence of symbols in Σ. The length of
a string x, denoted |x|, is the total number of symbols in x. The notation Σ ∗
denotes the set of all strings over Σ. A language over Σ is a subset of Σ ∗ .
In this exposition, we will consider directed and undirected graphs and each
graph is expressed as (V, E) with a set V of vertices and a set E of edges. An edge
between two vertices u and v in a directed graph is denoted by (u, v), whereas
the same edge in an undirected graph is denoted by {u, v}. Two vertices are
called adjacent if there is an edge between them. When there is a path from
vertex u to vertex v, we succinctly write u  v. An edge of G is said to be a
grip if its both endpoints have degree at most 2. Given a graph G = (V, E), we
set mver (G) = |V | and medg (G) = |E|. The following property is immediate.

Lemma 1. For any connected graph G whose degree is at most k, it follows that
mver (G) ≤ 2medg (G) and medg (G) ≤ kmver (G)/2.
780 T. Yamakami

If a Boolean formula φ in the conjunctive normal form (CNF) contains n


variables and m clauses, then we set mvbl (φ) = n and mcls (φ) = m as two
natural size parameters. For later convenience, we call a literal x in φ removable
if no clause in φ contains x (i.e. the negation of x). We say that φ is in a clean
shape if each clause of φ consists of literals whose variables are all different. An
exact 2CNF Boolean formula has exactly two literals in each clause.
As a model of computation, we use the notions of multi-tape deterministic
and nondeterministic Turing machines (or DTMs and NTMs, respectively), each
of which is equipped with one read-only input tape, multiple rewritable work
tapes, and (possibly) a write-once2 output tape such that, initially, each input
is written on the input tape surrounded by two endmarkers, c| (left endmarker)
and $ (right endmarker), and all the tape heads are stationed on the designated
“start cells”. Given two alphabets Σ1 and Σ2 , a function f from Σ1∗ to Σ2∗ (resp.,
from Σ1∗ to N) is computable in time t(n) using s(n) space if there exists a DTM
M such that, on each input x, M produces f (x) (resp., 1f (x) ) on the output
tape before it halts within t(|x|) steps with accesses to at most s(|x|) work-tape
cells (not counting the input-tape cells as well as the output-tape cells). We
freely identify a decision problem with its corresponding language. A decision
problem is defined to be computable within time t(n) using at most s(n) space
in a similar manner.

2.2 Parameterized Decision Problems and Short Reductions

Throughout this exposition, we target decision problems (equivalently, lan-


guages) that are parameterized by size parameters, which specify “sizes” (i.e.,
positive integers) of instances given to target problems and those sizes are used
as a basis to measuring computational complexities (such as execution time and
memory usage) of the problems. More formally, for any input alphabet Σ, a size
parameter is a map from Σ ∗ to N+ . The information on the instance size is
frequently used in solving problems, and thus it is natural to assume the easy
“computability” of the size. A size parameter m : Σ ∗ → N is said to be log-space
computable if it is computable using O(log |x|) space, where x is a symbolic
input. A parameterized decision problem is a pair (A, m) with a language A over
a certain alphabet Σ and a size parameter m mapping Σ ∗ to N+ .
For any parameterized decision problem (A, m), we say that (A, m) is com-
putable in polynomial time using sub-linear space if there exists a DTM that
solves A in time polynomial in |x|m(x) using O(m(x)1−ε polylog(|x|)) space,
where ε is a certain fixed constant in the real interval (0, 1]. A parameterized
decision problem (A, m) with log-space size parameter m is in PsubLIN if (A, m)
is computable in polynomial time using sub-linear space.
To discuss sub-linear-space computability, however, the standard log-space
many-one reductions (or L-m-reductions, for short) are no longer useful. For
instance, it is unknown that all NL-complete problems parameterized by natural
2
A tape is write-once if its tape head never moves to the left and it must move to the
right whenever it writes a non-blank symbol.
Parameterized-NL Completeness of Combinatorial Problems 781

log-space size parameters are equally difficult to solve in polynomial time using
sub-linear space. This is because PsubLIN is not yet known to be closed under
standard L-m-reductions. Fortunately, PsubLIN is proven to be closed under
slightly weaker reductions, called “short” reductions [13,14].
The short L-m-reducibility between two parameterized decision problems
(P1 , m1 ) and (P2 , m2 ) is given as follows: (P1 , m1 ) is short L-m-reducible to
(P2 , m2 ), denoted by (P1 , m1 ) ≤sL m (P2 , m2 ), if there is a polynomial-time,
logarithmic-space computable function f (which is called a reduction function)
and two constants k1 , k2 > 0 such that, for any input string x, (i) x ∈ P1 iff
f (x) ∈ P2 and (ii) m2 (f (x)) ≤ k1 m1 (x) + k2 . Instead of using f , if we use a
polynomial-time logarithmic-space oracle Turing machine M to reduce (P1 , m1 )
to (P2 , m2 ) with the extra requirement of m2 (z) ≤ k1 m1 (x) + k2 for any query
word z made by M on input x for oracle P2 , then (P1 , m1 ) is said to be short
L-T-reducible to (P2 , m2 ), denoted by (P1 , m1 ) ≤sL
T (P2 , m2 ).
For any reduction ≤ in {≤sL m , ≤sL
T }, we say that two parameterized deci-
sion problems (P1 , m1 ) and (P2 , m2 ) are inter-reducible (to each other) by ≤-
reductions if both (P1 , m1 ) ≤ (P2 , m2 ) and (P2 , m2 ) ≤ (P1 , m1 ) hold; in this
case, we briefly write (P1 , m1 ) ≡ (P2 , m2 ).

Lemma 2. [14] Let (L1 , m1 ) and (L2 , m2 ) be two arbitrary parameterized deci-
sion problems. (1) If (L1 , m1 ) ≤sLm (L2 , m2 ), then (L1 , m1 ) ≤T (L2 , m2 ). (2) If
sL

(L1 , m1 ) ≤sL
T (L2 , m2 ) and (L2 , m2 ) ∈ PsubLIN, then (L 1 , m1 ) ∈ PsubLIN.

2.3 The Linear Space Hypothesis or LSH

One of the first problems that were proven to be NP-complete in the past lit-
erature is the 3CNF Boolean formula satisfiability problem (3SAT), which asks
whether or not a given 3CNF Boolean formula φ is satisfiable [3]. In sharp
comparison, its natural variant, called the 2CNF Boolean formula satisfiability
problem (2SAT), was proven to be NL-complete [8] (together with the results
of [6,12]). Let us further consider its natural restriction introduced in [14]. Let
k ≥ 2.
k-Bounded 2CNF Boolean Formula Satisfiability Problem
(2SATk ):
◦ Instance: a 2CNF Boolean formula φ whose variables occur at most k times
each in the form of literals.
◦ Question: is φ satisfiable?

As natural log-space size parameters for the decision problem 2SATk , we use
the aforementioned size parameters mvbl (φ) and mcls (φ).
Unfortunately, not all NL-complete problems are proven to be inter-reducible
to one another by short log-space reductions. An example of NL-complete prob-
lems that are known to be inter-reducible to (2SAT3 , mvbl ) is a variant of the
directed s-t-connectivity problem whose instance graphs have only vertices of
degree at most k (kDSTCON) for any number k ≥ 3.
782 T. Yamakami

Degree-k Directed s-t Connectivity Problem (kDSTCON):


◦ Instance: a directed graph G = (V, E) of degree at most k and two vertices
s, t ∈ V
◦ Question: is there any simple path in G from s to t?
Given a graph G with n vertices and m edges, we set mver ( G, s, t
) = n and
medg ( G, s, t
) = m as natural log-space size parameters.
Lemma 3. [14] Let k ≥ 3 be any integer. (1) (2SATk , mvbl ) is inter-
reducible to (2SATk , mcls ) and also to (2SAT3 , mvbl ) by short L-m-reductions.
(2) (kDSTCON, mver ) is inter-reducible to (kDSTCON, medg ) and further to
(3DSTCON, mver ) by short L-m-reductions. (3) (3DSTCON, mver ) is inter-
reducible to (2SAT3 , mvbl ) by short L-T-reductions.
Notice that we do not know whether we can replace short L-T-reductions
in Lemma 3(3) by short L-m-reductions. This exemplifies a subtle difference
T and ≤m .
between ≤sL sL

Definition 1. The linear space hypothesis (LSH) asserts, as noted in Sect. 1.2,
the insolvability of the specific parameterized decision problem (2SAT3 , mvbl )
within polynomial time using only sub-linear space.
In other words, LSH asserts that (2SAT3 , mvbl ) ∈/ PsubLIN. Note that, since
PsubLIN is closed under short L-T-reductions by Lemma 2(2), if a parameterized
decision problem (A, m) satisfies (A, m) ≡sLT (2SAT3 , mvbl ), we can freely replace
(2SAT3 , mvbl ) in the definition of LSH by (A, m). The use of short L-T-reduction
here can be relaxed to a much weaker notion of short SLRF-T-reduction [13,14].

2.4 Linear Programming and Linear Equations


As a basis of later NL-completeness proofs, we recall a combinatorial problem
of Jones, Lien, and Laaser [8], who studied a problem of determining whether or
not there exists a {0, 1}-solution to a given set of linear programs, provided that
each linear program (i.e., a linear inequality) has at most 2 nonzero coefficients.
When each variable further has at most k nonzero coefficients in the entire
linear programs, the corresponding problem is called the (2, k)-entry {0, 1}-linear
programming problem [14], which is formally described as below.
(2, k)-Entry {0, 1}-Linear Programming Problem (LP2,k ):
◦ Instance: a rational m × n matrix A, and a rational (column) vector b of
dimension n, where each row of A has at most 2 nonzero entries and each
column has at most k nonzero entries.
◦ Question: is there any {0, 1}-vector x for which Ax ≥ b?
For practicality, all entries in A are assumed to be expressed in binary using
O(log n) bits. For any instance x of the form A, b
given to LP2,k , we use two
log-space size parameters defined as mrow (x) = n and mcol (x) = m.
It is known that, for any index k ≥ 3, the parameterized decision problem
(LP2,k , mrow ) is inter-reducible to (LP2,k , mcol ) and further to (2SAT3 , mvbl ) by
short L-m-reductions [14].
Parameterized-NL Completeness of Combinatorial Problems 783

Lemma 4. [14] The following parameterized decision problems are all inter-
reducible to one another by short L-m-reductions: (LP2,k , mrow ), (LP2,k , mcol ),
and (2SAT3 , mvbl ) for every index k ≥ 3.

We can strengthen the requirement of the form Ax ≥ b in LP2,k as follows.


Consider another variant of LP2,k , in which we ask whether or not b1 ≥ Ax ≥ b2
holds for a certain {0, 1}-vector x for any given matrix A and two (column)
vectors b1 and b2 .
Bidirectional (2, k)-Entry {0, 1}-Linear Programming Problem
(2LP2,k ):
◦ Instance: a rational m × n matrix A, and two rational vectors b1 and b2 of
dimension n, where each row of A has at most 2 nonzero entries and each
column has at most k nonzero entries.
◦ Question: is there any {0, 1}-vector x for which b1 ≥ Ax ≥ b2 ?

Proposition 1. For any index k ≥ 3, (2LP2,k , mcol ) ≡sL


m (LP2,k , mcol ).

Proof. The reduction (LP2,k , mcol ) ≤sL m (2LP2,k , mcol ) is easy to verify by set-
ting b2 = b and b1 = (bi )i with bi = max{|aij1 | + |aij2 | : j1 , j2 ∈ [m], j1 < j2 } for
any instance pair A = (aij )ij and b = (bj )j given to LP2,k . Since the description
size of (A, b1 , b2 ) is proportional to that of (A, b), the reduction is indeed “short”.
To verify the opposite reducibility (2LP2,k , mcol ) ≤sL m (LP2,k+2 , mcol ), it
suffices to prove that (2LP2,k , mcol ) ≤sL m (LP 2,k , mcol ) since (LP2,l , mcol ) ≡sL
m
(LP2,3 , mcol ) for any l ≥ 3 by Lemma 4. Take an arbitrary instance (A, b, b )
given to 2LP2,k and assume that A = (aij )ij is an m × n matrix and b = (bi )i
and b = (bi )i are two (column) vectors of dimension m. We wish to reduce
(A, b, b ) to an appropriate instance (D, c) for LP2,k , where D = (dij )ij is a
4m × 2n matrix and c = (ci )i is a 4m-dimensional vector. For all index pairs
i ∈ [m] and j ∈ [n], let dij = aij , dm+i,n+j = −aij , and di,n+j = dm+i,j = 0.
For all indices i ∈ [m], let ci = bi and cm+i = −bi . Moreover, for any pair
(i, j) ∈ [m] × [n], we set d2m+i,j = 1, d2m+i,n+j = −1, and c2m+i = 0. In addi-
tion, we set d3m+i,j = −1, d3m+i,n+j = 1, and c3m+i = 0. Notice that each
column of N has at most k + 2 nonzero entries and each row of D has at most
2 nonzero entries.
Let x = (xj )j denote a {0, 1}-vector of dimension n for A and let y = (yj )j
denote a {0, 1}-vector of dimension 2n for D satisfying n yj = xj and yn+j = xj
for any j ∈ [n]. It then follows that the inequality j=1 dij yj ≥ ci is equiva-
n n
lent to aij xj ≥ bj . Furthermore, j=1 dn+i,j y ≥ cn+i is equivalent
n j=1 n n+j
to − j=1 aij xj ≥ −bj , which is the same as 
j=1 ij xj ≤ bi . Therefore,
a
we conclude that b ≥ Ax ≥ b iff Dy ≥ c. In other words, it follows that
(A, b, b ) ∈ 2LP2,k , iff (D, c) ∈ LP2,k . 

As a special case of 2LP2,k by restricting its instances on the form (A, b1 , b2 )


with b1 = b2 , it is possible to consider the decision problem of asking whether
or not Ax = b holds for an appropriately chosen {0, 1}-vector x. We call this
new problem the (2, k)-entry {0, 1}-linear equation problem (LE2,k ). As shown
784 T. Yamakami

in Lemma 5, LE2,k falls into L, and thus this fact signifies how narrow the gap
between NL and L is. For the proof of the lemma, we define the exclusive-or
clause (or the ⊕-clause) of two literals x and y to be the formula x ⊕ y. The
problem ⊕2SAT asks whether, for a given collection C of ⊕-clauses, there exists
a truth assignment σ that forces all ⊕-clauses in C to be true. It is known that
⊕2SAT is in L [8].

Lemma 5. For any index k ≥ 3, LE2,k belongs to L.

Proof. Consider any instance (A, b) given to LE2,k . Since ⊕2SAT ∈ L, it suffices
to reduce LE2,k to ⊕2SAT by standard L-m-reductions. Note that the equation
Ax = b is equivalent to aij1 xj1 + aij2 xj2 = bi for all i ∈ [m], where aij1 and aij2
are nonzero entries of A with j1 , j2 ∈ [n]. Fix an index i ∈ [m] and consider the
first case where j1 = j2 . In this case, we translate aij1 xj1 = bi into a ⊕-clause
vj1 ⊕ 0 if xij1 = 1, and vj1 ⊕ 1 otherwise. In the other case of j1 = j2 , on the
contrary, we translate aij1 xj1 +aij2 xj2 = bi into two ⊕-clauses {xj1 ⊕0, xj2 ⊕1} if
(xj1 , xj2 ) = (1, 0), and the other values of (xj1 , xj2 ) are similarly treated. Finally,
we define C to be the collection of all ⊕-clauses obtained by the aforementioned
translations. It then follows that Ax = b iff C is satisfiable. This implies that
(A, b) ∈ LE2,k iff C ∈ ⊕2SAT. 

3 2-Checkered Vertex Covers

The vertex cover problem (VC) is a typical NP-complete problem, which has
been used as a basis of the completeness proofs of many other NP problems,
including the clique problem and the independent set problem (see, e.g., [5,9]).
For a given undirected graph G = (V, E), a vertex cover for G is a subset V 
of V such that, for each edge {u, v} ∈ E, at least one of the endpoints u and v
belongs to V  .
The problem VC remains NP-complete even if its instances are limited to
planar graphs. Similarly, the vertex cover problem restricted to graphs of degree
at least 3 is also NP-complete; however, the same problem falls into L if we
require graphs to have degree at most 2. We wish to seek out a reasonable
setting situated between those two special cases. For this purpose, we intend
to partition all edges into two categories: grips and non-grips (where “grips”
are defined in Sect. 2.1). Since grips have a simpler structure than non-grips,
the grips need to be treated slight differently from the others. In particular, we
request an additional condition, called 2-checkeredness, which is described as
follows. A subset V  of V is called 2-checkered exactly when, for any edge e ∈ E,
if both endpoints of e are in V  , then e must be a grip. The 2-checkered vertex
cover problem is introduced in the following way.
2-Checkered Vertex Cover Problem (2CVC):
◦ Instance: an undirected graph G = (V, E).
◦ Question: is there a 2-checkered vertex cover for G?
Parameterized-NL Completeness of Combinatorial Problems 785

Associated with the decision problem 2CVC, we set up the log-space size
parameters: mver (G) and medg (G), which respectively express the total number
of the vertices of G and that of the edges of G.
Given an instance of graph G = (V, E) to 2CVC, if we further demand that
every vertex in V should have degree at most k for any fixed constant k ≥ 3,
then we denote by 2CVCk (Degree-k 2CVC) the problem obtained from 2CVC.
There exists a close connection between the parameterizations of 2CVC3 and
2SAT3 .
Theorem 1. (2CVC3 , mver ) ≡sL
m (2CVC3 , medg ) ≡m (2SAT3 , mvbl ).
sL

Proof. Firstly, it is not difficult to show that (2CVC3 , mver ) ≡sL m (2CVC3 , medg )
by Lemma 1.
Next, we intend to prove that (2SAT3 , mvbl ) ≤sL m (2CVC3 , mver ). Let φ be
any instance to 2SAT3 made up of a set U = {u1 , u2 , . . . , un } of variables and
a set C = {c1 , c2 , . . . , cm } of 2CNF Boolean clauses. For convenience, we write
U for the set {u1 , u2 , . . . , un } of negated variables and define Û = U ∪ U . In the
case where a clause contains any removable literal x, it is possible to delete all
clauses that contain x, because we can freely assign T (true) to x. Without loss of
generality, we assume that there is no removable literal in φ. We further assume
that φ is an exact 2CNF formula in a clean shape (explained in Sect. 2.1). Since
every clause has exactly two literals, each clause cj is assumed to have the form
cj [1] ∨ cj [2] for any index j ∈ [m], where cj [1] and cj [2] are treated as “labels”
that represent two literals in the clause cj .
Let us construct an undirected graph G = (V, E) as follows. We define
(1) (2) (1) (2)
V = {ui , ui , cj [1], cj [2] | i ∈ [n], j ∈ [m]} and we set Ũ to be {ui , ui |
(1) (2)
i ∈ [n]} by writing ui for ui and ui for ui . We further set E as the union
(1) (2)
of {{ui , ui }, {cj [1], cj [2]} | i ∈ [n], j ∈ [m]} and {{z, cj [l]} | z ∈ Ũ , l ∈
[2], and cj [l] represents z}. Since each clause contains exactly two literals, it fol-
lows that deg(cj [1]) = deg(cj [2]) = 2. Thus, the edge {cj [1], cj [2]} for each index
j is a grip. Moreover, since each variable ui appears at most 3 times in the
(1) (2)
form of literals (because of the condition of 2SAT3 ), deg(ui ) + deg(ui ) ≤ 5.
(1) (2)
Since no removable literal exists in φ, we obtain max{deg(ui ), deg(ui )} ≤ 3.
It follows by the definition that mver (G) = 2(|U | + |C|) ≤ 8|U | = 8mvbl (φ) since
|C| ≤ 3|U |.
Here, we want to verify that φ ∈ 2SAT3 iff G ∈ 2CVC3 . Assume that φ ∈
2SAT3 . Let σ : U → {T, F } be any truth assignment that makes φ satisfiable.
We naturally extend σ to a map from Û to {T, F } by setting σ(ū) to be the
opposite of σ(u). Its corresponding vertex cover Cσ is defined in two steps.
Initially, Cσ contains all elements z ∈ Û satisfying σ(z) = F . For each index
j ∈ [m], let Aj = {i ∈ [2] | ∃z ∈ Û [cj [i] represents z and σ(z) = T ]}. Notice that
Aj ⊆ {1, 2}. If Aj = {i} for a certain index i ∈ [2], then we append to Cσ the
vertex cj [i]; however, if Aj = {1, 2}, then we append to Cσ the two vertices cj [1]
and cj [2] instead.
To illustrate our construction, let us consider a simple example: φ ≡ c1 ∧
c2 ∧ c3 ∧ c4 with c1 ≡ u1 ∨ u2 , c2 ≡ u2 ∨ u1 , c3 ≡ u1 ∨ u3 , and c4 ≡ u2 ∨ u3 .
786 T. Yamakami

u1(1) u1(2) u2(1) u2(2) u3(1) u3(2)

c1[1] c1[2] c2[1] c2[2] c3[1] c3[2] c4[1] c4[2]

Fig. 1. The graph G obtained from φ ≡ c1 ∧ c2 ∧ c3 ∧ c4 with c1 ≡ u1 ∨ u2 , c2 ≡ u2 ∨ u1 ,


c3 ≡ u1 ∨ u3 , and c4 ≡ u2 ∨ u3 . For the truth assignment σ satisfying σ(x1 ) = σ(x2 ) =
σ(x3 ) = T , the 2-checkered vertex cover Cσ consists of all vertices marked by dotted
circles.

The corresponding graph G is drawn in Fig. 1. Take the truth assignment σ that
satisfies σ(u1 ) = σ(u2 ) = σ(x3 ) = T . We then obtain A1 = {1}, A2 = {1, 2},
and A3 = {2}. Therefore, the resulting 2-checkered vertex cover Cσ is the set
(2) (2) (2)
{u1 , u2 , u3 , c1 [1], c2 [1], c2 [2], c3 [2], c4 [1]}.
By the definition of Cσ ’s, we conclude that G belongs to 2CVC3 . Conversely,
we assume that φ ∈ / 2SAT3 . Consider any truth assignment σ for φ and con-
struct Cσ as before. By the construction of Cσ , if Cσ is a 2-checkered vertex
cover, then σ should force φ to be true, a contradiction. Hence, G ∈ / 2CVC3
follows. Overall, it follows that φ ∈ 2SAT3 iff G ∈ 2CVC3 . Therefore, we obtain
(2SAT3 , mvbl ) ≤sL
m (2CVC3 , mver ).
Conversely, we need to prove that (2CVC3 , mver ) ≤sL m (2SAT3 , mvbl ). Given
an undirected graph G = (V, E), we want to define a 2CNF Boolean formula
φ to which G reduces. Let V = {v1 , v2 , . . . , vn } and E = {e1 , e2 , . . . , em } for
certain numbers m, n ∈ N+ .
Hereafter, we use the following abbreviation: u → v for u ∨ v, u ↔ v for
(u → v) ∧ (v → u), and u ↔ v for (u ∨ v) ∧ (u ∨ v). Notice that, as the notation
↔ itself suggests, u ↔ v is logically equivalent to the negation of u ↔ v.
We first define a set U of variables to be V . For each edge e = {u, v} ∈ E, we
define Ce as follows. If one of u and v has degree more than 2, then we set Ce
to be u ↔ v; otherwise, we set Ce to be u ∨ v. Finally, we define C to be the set
{Ce | e ∈ E}. Let φ denote the 2CNF Boolean formula made up of all clauses in
C.
Next, we intend to verify that G has a 2-checkered vertex cover iff φ is
satisfiable. Assume that G has a 2-checkered vertex cover, say, V  . Consider C
obtained from G. We define a truth assignment σ by setting σ(v) = T iff v ∈ V  .
Take any edge e = {u, v}. If one of u and v has degree more than 2, then either
(u ∈ V  and v ∈/ V  ) or (u ∈ / V  and v ∈ V  ) hold, and thus σ forces u ↔ v to
be true. Otherwise, since either u ∈ V  or v ∈ V  , σ forces u ∨ v to be true. This
concludes that φ is satisfiable. On the contrary, we assume that φ is satisfiable
by a certain truth assignment, say, σ; that is, for any edge e ∈ E, σ forces Ce to
be true. We define a subset V  of V as V  = {v ∈ V | σ(v) = T }. Let e = {u, v}
be any edge. If Ce has the form u ∨ v for u, v ∈ V , then either u or v should
Parameterized-NL Completeness of Combinatorial Problems 787

belong to V  . If σ forces u ↔ v in C to be true, then either (u ∈ V  and v ∈


/ V )
/ V  and v ∈ V  ) hold. Hence, V  is a 2-checkered vertex cover.
or (u ∈ 

The NL-completeness of 2CVC3 follows from Theorem 1 since 2SAT (and


also 2SAT3 ) is NL-complete by standard L-m-reductions [8] (based on the fact
that NL = co-NL [6,12]).
As an immediate corollary of Theorem 1, we obtain the following hard-
ness result regarding the computational complexity of (2CVC, mver ) under the
assumption of LSH.

Corollary 1. Under LSH, letting ε be any constant in (0, 1], there is no


polynomial-time algorithm that solves (2CVC, mver ) using O(mver (x)1−ε ) space,
where x is a symbolic input.

Proof. Assume that LSH is true. If (2CVC, mver ) is solvable in polynomial time
using O(mver (x)1−ε ) space for a certain constant ε ∈ (0, 1), since 2CVC3 is a
“natural” subproblem of 2CVC, Theorem 1 implies the existence of a polynomial-
time algorithm that solves (2SAT3 , mvbl ) using O(mrow (x)1−ε ) space as well.
This implies that LSH is false, a contradiction. 

4 Exact Covers with Exemption

The exact cover by 3-sets problem (3XC) was shown to be NP-complete [9].
Fixing a universe X, let us choose a collection C of subsets of X. We say that
C is a set cover for X if every element in X is contained in a certain set in C.
Furthermore, given a subset R ⊆ X, C is said to be an exact cover for X exempt
from R if (i) every element in X − R is contained in a unique member of C and
(ii) every element in R appears in at most one member of C. When R = ∅, we
say that C is an exact cover for X. Notice that any exact cover with exemption
is a special case of a set cover.
To obtain a decision problem in NL, we need one more restriction. Given a
collection C ⊆ P(X), we introduce a measure, called “overlapping cost,” of an
element of any set in C as follows. For any element u ∈ X, the overlapping cost
of u with respect to (w.r.t.) C is the cardinality |{A ∈ C | u ∈ A}|. With the
use of this special measure, we define the notion of k-overlappingness for any
k ≥ 2 as follows. We say that C is k-overlapping if the overlapping cost of every
element u in X w.r.t. C is at most k.
2-Overlapping Exact Cover by k-Sets with Exemption Problem
(kXCE2 ):
◦ Instance: a finite set X, a subset R of X, and a 2-overlapping collection C
of subsets of X such that each set in C has at most k elements.
◦ Question: does C contain an exact cover for X exempt from R?
The use of an exemption set R in the above definition is crucial. If we are
given a 2-overlapping family C of subsets of X as an input and then ask for the
788 T. Yamakami

Fig. 2. The graph G obtained from φ ≡ C1 ∧ C2 ∧ C3 ∧ C4 with clauses C1 = {x1 , x2 },


C2 = {x1 , x3 }, C3 = {x2 , x3 }, and C4 = {x1 , x3 }. The truth assignment σ satisfies
σ(x1 ) = σ(x2 ) = T and σ(x3 ) = F . All vertices in the top row of the graph are in the
set R. The exact cover Xσ exempt from R, obtained from σ, consists of vertex pairs
and triplets linked respectively by dotted lines and dotted boxes. Here, t1 [3], t2 [2], t2 [3],
and t3 [3] are omitted for simplicity.

existence of an exact cover for X, then the corresponding problem is rather easy
to solve in log space [1].
The size parameter mset for kXCE2 satisfies mset ( X, R, C
) = |C|, provided
that all elements of X are expressed in O(log |X|) binary symbols. Obviously,
mset is a log-space size parameter. In what follows, we consider 3XCE2 parame-
terized by mset , (3XCE2 , mset ), and prove its inter-reducibility to (2SAT3 , mvbl ).
Theorem 2. (3XCE2 , mset ) ≡sL
m (2SAT3 , mvbl ).

Theorem 2 immediately implies the NL-completeness of 3XCE2 . To simplify


the following proof, we recall from Sect. 2.4 the NL-problem 2LP2,k and the
fact that, for any index k ≥ 3, (2LP2,k , mcol ) ≡sL
m (2SAT3 , mvbl ), obtained from
Lemma 4 and Proposition 1.
Proof. We begin our proof with verifying that (2SAT3 , mvbl ) ≤sL m
(3XCE2 , mset ). Let φ denote a 2CNF Boolean formula with n variables and
m clauses, given as an instance to 2SAT3 . Let V denote the set {x1 , x2 , . . . , xn }
of all distinct variables in φ and let C denote the set {C1 , C2 , . . . , Cm } of all
distinct clauses in φ. With no loss of generality, we assume that there is no
removable literal in φ and that φ is an exact 2CNF Boolean formula in a clean
shape. We write V for the set {x1 , x2 , . . . , xn } and define V̂ = V ∪ V . We freely
identify a clause of the form zi1 ∨zi2 for literals zi1 and zi2 with the set {zi1 , zi2 },
which is also a subset of V̂ . By our assumption, each variable xi should appear
at most 3 times in different clauses in the form of literals.
We want to reduce φ to an appropriately constructed instance (X, R, D) of
3XCE2 . To construct such an instance, we first define the following three sets
X1 , X2 , and X3 . Let X1 = {xi [j] | i ∈ [n], j ∈ [m], xi ∈ Cj } ∪ {xi [j] | i ∈ [n], j ∈
[m], xi ∈ Cj }, X2 = {sj | j ∈ [m]}, and X3 = {ti [j] | i ∈ [n], j ∈ [3]}. The
universe X is made up from those three sets (i.e., X = X1 ∪ X2 ∪ X3 ).
To understand the following construction better, we here illustrate a simple
example of φ, which is of the form C1 ∧ C2 ∧ C3 ∧ C4 with clauses C1 = {x1 , x2 },
C2 = {x1 , x3 }, C3 = {x2 , x3 }, and C4 = {x1 , x3 }. We define the set D as
drawn in Fig. 2. Take a truth assignment σ defined by σ(x1 ) = σ(x2 ) = T and
Parameterized-NL Completeness of Combinatorial Problems 789

σ(x3 ) = F . The set R consists of the elements of the form xi [j] and x − i[j]
for all indices i ∈ [3] and j ∈ [4]. The exact cover Xσ for X exempt from R
consists of {x1 [1], s1 }, {x1 [2], s2 }, {x2 [3], s3 }, {x3 [4], s4 }, {x1 [2], t1 [1], t1 [2]}, and
{x3 [2], t3 [1], t3 [2]}.
Returning to the proof, let us define two groups of sets. For each index
j ∈ [m], Aj is composed of the following 2-sets: Aj = {{zi1 [j], sj }, {zi2 [j], sj } |
i1 , i2 ∈ [n], Cj = {zi1 , zi2 } ⊆ V̂ }. Associated with V , we set V (+) to be composed
of all variables xi such that xi appears in two clauses and xi appears in one clause.
Similarly, let V (−) be composed of all variables xi such that xi appears in one
clause and xi appears in two clauses. In addition, let V (∗) consist of all other
variables. Note that, since there is no removable literal in φ, any variable xi in
V (∗) appears in one clause and its negation xi appears also in one clause. Our
assumption guarantees that V = V (+) ∪V (−) ∪V (∗) . For each variable xi ∈ V (+) ,
(+)
we set Bi = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}, {xi [j3 ], ti [1], ti [2]}}, provided that
Cj1 and Cj2 both contain xi and Cj3 contains xi for certain indices j1 , j2 ,
(−)
and j3 with j1 < j2 . Similarly, for each variable xi ∈ V (−) , we set Bi =
{{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}, {xi [j3 ], ti [1], ti [2]}}, provided that Cj1 and Cj2 both
contain xi and Cj3 contains xi . In contrast, given any variable xi ∈ V (∗) , we
(∗) 
define Bi = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [1]}}. Finally, we set D = ( j∈[m] Aj ) ∪
 (+) (−) (∗)
( i∈[n] (Bi ∪Bi ∪Bi )). Notice that every element in X is covered by exactly
two sets in D. To complete our construction, an exemption set R is defined to
be X1 .
Hereafter, we intend to verify that φ is satisfiable iff there exists an exact
cover for X exempt from R. Given a truth assignment σ : V → {T, F }, we
define a set Xσ as follows. We first define X1 to be the set {z[j], sj | j ∈
[m], σ(z) = T, z ∈ V̂ }. For each element xi ∈ V (+) , if σ(xi ) = F , then we
 (+)
set X2,i = {{xi [j1 ], ti [1]}, {xi [j2 ], ti [2]}} ⊆ Bi , and if σ(xi ) = F , then we
 (+)
set X2,i = {{xi [j3 ], ti [1], ti [2]}} ⊆ Bi . Similarly, for each element xi ∈ V (−) ,

we define X2,i . For any element xi ∈ V (∗) , however, if σ(z) = F for a literal

= {{z[j1 ], ti [1]}} ⊆ Bi . In the end, Xσ is set
(∗)
z ∈ {xi , xi }, then we define  X2,i
 
to be the union X1 ∪ ( i∈[n] X2,i ). Assume that φ is true by σ. Since all clauses
Cj are true by σ, each sj in X2 has overlapping cost of 1 in Xσ . Moreover,
each ti [j] in X3 has overlapping cost of 1 in Xσ . Either xi [j] or xi [j] in X1 has
overlapping cost of at most 1. Thus, Xσ is an exact cover for X exempt from R
(= X1 ).
On the contrary, we assume that X  is an exact cover for X exempt
from R. We define a truth assignment σ as follows. For each 2-set Aj , if
{zid [j], sj } ∈ X  for a certain index d ∈ [2], then we set σ(zid ) = T . For
each Bi , if {xi [j1 ], ti [1]}, {xi [j2 ], ti [2]} ∈ X  , then we set σ(xi ) = F . If
(+)

{xi [j3 ], ti [1], ti [2]} ∈ X  , then we set σ(xi ) = F . The case of Bi is similarly
(−)

handled. In the case of Bi , if {z[j1 ], ti [1]} ∈ X  for a certain z ∈ {xi , xi }, then


(∗)

we set σ(z) = F . Since X  is an exact cover for X − R, for any clause Cj , there
exists exactly one z in Cj satisfying σ(z) = T .
790 T. Yamakami

Conversely, we intend to verify that (3XCE2 , mset ) ≤sL m (2SAT3 , mvbl ).


Since (2SAT3 , mvbl ) ≡sL m (2LP 2,3 , m row ) by Lemma 4 and Proposition 1, if we
show that (3XCE2 , mset ) ≤sL m (2LP 2,3 , m row ), then we immediately obtain the
desired consequence of (3XCE2 , mset ) ≤sL m (2SAT 3 , m vbl ). Toward the claim of
(3XCE2 , mset ) ≤sLm (2LP 2,3 , mrow ), let us take an arbitrary instance (X, R, C)
given to 3XCE2 with X = {u1 , u2 , . . . , un } and C = {C1 , C2 , . . . , Cm }. Notice
that R ⊆ X and |Ci | ≤ 3 for all i ∈ [m].
As the desired instance to 2LP2,3 , we define an n × m matrix A = (aij )ij
and two (column) vectors b = (bi )i and b = (bi )i as follows. Since each ui has
overlapping cost of 2, if ui is in Cj1 ∩ Cj2 for two distinct indices j1 and j2 , then
we set aij1 = aij2 = 1. Let bi = bi = 1 for all i ∈ [n] satisfying ui ∈ X − R,
and let bi = 0 and bi = 1 for all i ∈ [n] satisfying ui ∈ R. If D is a set cover,
then we define xD = (xj )j as follows: if Cj ∈ / D, then we set xj = 1; otherwise,
we set xj = 0. We then want to show that D is an exact cover for X exempt
from R iff xD satisfies b ≥ AxD ≥ b. Assume that D is an exact cover for X
exempt from n R. Note that, if ui ∈ Cj1 ∩ Cj2 for two distinct indices j1 and
j2 , then j=1 aij xj = aij1 xj1 + aij2 xj2 = xj1 + xj2 . Since D contains exactly
one set containing ui , we obtain xj1 + xj2 = 1. If ui ∈ R, then we obtain

that b ≥ AxD ≥ b. On the contrary,
n
j=1 aij xj ∈ {0, 1}. Thus, we conclude 
assume that b ≥ AxD ≥ b. We obtain j=1 aij xj = aij1 xj1 + aij2 xj2 = xj1 + xj2
n

for two indices j1 and j2 satisfying aj1 = 0 and aj2 = 0. If ui ∈ / R, then


xj1 + xj2 = 1 holds because of bi = bi = 1, and thus exactly one of Cj1 and Cj2
must be in D. If ui ∈ R, then 1 ≥ xj1 + xj2 ≥ 0 holds, and thus at most one of
Cj1 and Cj2 belongs to D. Therefore, D is an exact cover for X exempt from R.


Similarly to Corollary 1, we obtain the following statement concerning 3XCE.

Corollary 2. Under LSH, no polynomial-time algorithm solves (3XCE, mver )


using O(mver (x)1−ε ) space for a certain constant ε ∈ (0, 1), where x is a symbolic
input.

5 Almost All Pairs 2-Dimensional Matching


The 3-dimensional matching problem (3DM) is well-known to be NP-complete
[9] while the 2-dimensional matching problem (2DM), which is seen as a bipartite
perfect matching problem, falls into P. In fact, 2DM has been proven to be NL-
hard [2] but it is not yet known to be in NL. In this exposition, we wish to place
our interest on a natural variant of 2DM, which turns to be NL-complete. Let
us take a finite set X and consider the Cartesian product X × X. For any two
elements (u, v), (w, z) ∈ X × X, we say that (u, v) agrees with (w, z), denoted
(u, v)  (w, z) = ∅, if either u = w or v = z. A matching over X × X is a
subset M of X × X such that no two distinct elements in M agree with each
other. Given a subset M ⊆ X × X, we define M(1) = {u ∈ X | ∃v[(u, v) ∈ M ]}
and M(2) = {v ∈ X | ∃u[(u, v) ∈ M ]}. A matching M is called perfect if
M(1) = M(2) = X.
Parameterized-NL Completeness of Combinatorial Problems 791

We call (v, v) a trivial pair and we first include all trivial pairs to M . We
then eliminate the trivial perfect matching M  = {(u, u) | u ∈ X} from our
consideration by introducing the following restriction. Given any subset M  ⊆
M and two elements x, y ∈ X, we say that x is linked to y in M  if there
exists a series z1 , z2 , . . . , zt ∈ X with a certain odd number t ≥ 1 such that
(x, z1 ), (zt , y) ∈ M  and (zi , zi+1 ) ∈ M  for any index i ∈ [t − 1]. For any
subset R of X, we say that R is uniquely connected to X − R in M if, for
any element v ∈ R, there exist two unique elements u1 , u2 ∈ X − R such that
(v, u1 ), (u2 , v) ∈ M .
As the desired variant of 2DM, we introduce the following decision problem
and study its computational complexity.
Almost All Pairs 2-Dimensional Matching Problem with Trivial
Pairs (AP2DM):
◦ Instance: a finite set X, a subset R of X, and a subset M ⊆ X ×X including
all trivial pairs such that R is uniquely connected to X − R in M .
◦ Question: is it true that, for any distinct pair v, w ∈ X, if either v ∈
/ R or
w∈ / R, then there exists a perfect matching Mvw in M for which v is linked
to w in Mvw ?

For technicality, all entries of X are assumed to be expressed using


O(log |X|) binary symbols. A natural size parameter mset is then defined as
mset ( X, R, M
) = |X|.
Let k ≥ 2. An instance (X, R, M ) to AP2DM is said to be k-overlapping
if (i) for any v ∈ X, |{u ∈ X | (u, v) ∈ M }| ≤ k and (ii) for any u ∈ X,
|{v ∈ X | (u, v) ∈ M }| ≤ k. When all instances (X, R, M ) given to AP2DM
are limited to those that are k-overlapping, we call the resulting problem from
AP2DM by AP2DMk .
We intend to show that AP2DM4 parameterized by mset , (AP2DM4 , mset ),
is inter-reducible to (2SAT3 , mvbl ) by short L-T-reductions.

Theorem 3. (AP2DM4 , mset ) ≡sL


T (2SAT3 , mvbl ).

Proof. For ease of describing the proof, we use (3DSTCON, mver ) instead of
(2SAT3 , mvbl ) because (2SAT3 , mvbl ) ≡sL T (3DSTCON, mver ) by Lemma 3(3).
As the first step, we wish to verify that (3DSTCON, mver ) ≤sL m
(AP2DM4 , mset ) although this is a stronger statement than what is actually
needed for our claim (since ≤sL m implies ≤T ). Let (G, s, t) be any instance given
sL

to 3DSTCON with G = (V, E). Notice that G has degree at most 3. To simplify
our argument, we slightly modify G so that G has no vertex whose indegree is 3
or outdegree is 3. For convenience, we further assume that s and t are of degree
1. Notationally, we write V (−) for V − {s, t} and assume that V (−) is of the form
{v1 , v2 , . . . , vn } with |V (−) | = n.
Let us construct a target instance (X, R, M ) to which we can reduce (G, s, t)
by an appropriately chosen short L-m-reduction. For any index i ∈ {0, 1, 2}, we
prepare a new element of the form [v, i] for each v ∈ V and define Xi to be
{[v, i] | v ∈ V (−) }. The desired universe X is set to be {s, t} ∪ X0 ∪ X1 ∪ X2 .
792 T. Yamakami

[v1,1] [v2,1] [v3,1] [v4,1] s t [v1,0] [v2,0] [v3,0] [v4,0] [v1,2] [v2,2] [v3,2] [v4,2]

[v1,1] [v2,1] [v3,1] [v4,1] s t [v1,0] [v2,0] [v3,0] [v4,0] [v1,2] [v2,2] [v3,2] [v4,2]

Fig. 3. The subset M of X × X with X = {s, t, vi [j] | i ∈ [4], j ∈ [0, 2]Z } (seen here
as a bipartite graph) constructed from G = (V, E) with V = {v1 , v2 , v3 , v4 , s, t} and
E = {(s, v2 ), (v3 , v2 ), (v2 , v4 ), (v4 , v3 ), (v3 , t)}. Every pair of two adjacent vertices forms
a single element in X. The edges expressing trivial pairs are all omitted for simplicity.
Every vertex has degree at most 4 (including one omitted edge).

As subsets of X × X, we define the following seven sets: M0 = {([v, 0], [w, 0]) |
u, w ∈ V (−) , (v, w) ∈ E}, M1 = {([vi , 1], [vi+1 , 1]), ([vi+1 , 1], [vi , 1]) | i ∈
[n − 1]}, M2 = {([vi+1 , 2], [vi , 2]), ([vi , 2], [vi+1 , 2]) | i ∈ [n − 1]}, M3 =
{([v, 2], [v, 0]), ([v, 0], [v, 1]) | v ∈ V (−) }, M4 = {([v1 , 1], s), ([vn , 1], s), (t, [v1 , 2]),
(t, [vn , 2])}, M5 = {(s, [u, 0]), ([v, 0], t) | (s, u), (v, t) ∈ E}, and M6 = {(ũ, ũ) | ũ ∈
6
X}. Finally, M is defined to be the union i=0 Mi and R is set to be {[v, 0] | v ∈
V (−) }. Note that R is uniquely connected to X − R because of M3 .
To illustrate the aforementioned construction, let us consider a sim-
ple example of G = (V, E) with V = {v1 , v2 , v3 , v4 , s, t} and E =
{(s, v2 ), (v3 , v2 ), (v2 , v4 ), (v4 , v3 ), (v3 , t)}. The universe X is the set {s, t, vi [j] |
i ∈ [4], j ∈ [3]}. The constructed M from G is illustrated in Fig. 3.
In what follows, we claim that there is a simple path from s to t in
G iff, for any two distinct elements ũ, ṽ ∈ X, there is a perfect matching,
say, Mũṽ for which ũ is linked to ṽ. To verify this claim, we first assume
that there is a simple path γst = (w1 , w2 , . . . , wk ) in G with w1 = s and
wk = t. Let T = {([v, 0], [w, 0]) | v, w ∈ γst − {s, t}, (v, w) ∈ E} and
S = {(s, [w2 , 0]), ([wk−1 , 0], t)}. We remark that s and t are linked to each other
in M because there exists a path s  t in G. Hereafter, ũ and ṽ denote two
arbitrary distinct elements in X with either ũ ∈ / R or ṽ ∈/ R.
(1) Let us consider the case where ũ, ṽ ∈ / {s, t}. In this case, let ũ = [vi0 , l]
and ṽ = [vj0 , l ] for l, l ∈ [0, 2]Z and i0 , j0 ∈ [n]. It follows that (l, i0 ) = (l , j0 ).
We then define the desired perfect matching Mũṽ as follows, depending on the
choice of ũ and ṽ.
(Case 1) Consider the case of l, l ∈ {1, 2}. Let M0 = T , M1 =
{([vi , 1], [vi+1 , 1]) | i ∈ [n − 1]}, M2 = {[vi+1 , 2], [vi , 2]) | i ∈ [n − 1]},
M3 = {([v1 , 2], [v1 , 0]), ([v1 , 0], [v1 , 1])}, M4 = {([vn , 1], s), (t, [vn , 2])}, M5 = S,
6
and let M6 contain (z, z) for all other elements z. Finally, we set Mũṽ = i=0 Mi .
It then follows by the definition that Mũṽ is a perfect matching. Since s is linked
to t in Mũṽ , ũ and ṽ are also linked to each other.
(Case 2) In the case where l = 0, l ∈ {1, 2}, and vi0 ∈ / γst , there are three
separate cases (a)–(c) to examine. The symmetric case of Case 2 can be similarly
handled and is omitted here.
Parameterized-NL Completeness of Combinatorial Problems 793

(a) If i0 ≤ j0 , then we define M0 = T , M1 = {([vi , 1], [vi+1 , 1]) |


i ∈ [i0 , n − 1]Z }, M2 = {([vi+1 , 2], [vi , 2]) | i ∈ [i0 , n − 1]Z }, M3 =
{([vi0 , 2], [vi0 , 0]), ([vi0 , 0], [vi0 , 1])}, M4 = {([vn , 1], s), (t, [vn , 2])}, and M5 = S.
We further define M6 to be composed of (z, z) for all the other elements z.
6
Finally, Mũṽ is set to be the union i=0 Mi . Clearly, ũ is linked to ṽ in Mũṽ .
(b) In the next case of i0 > j0 and l = 1, we define M0 = T , M1 =
{([vi , 1], [vi+1 , 1]) | i ∈ [i0 ]}, M2 = {([vi+1 , 2], [vi , 2]) | i ∈ [i0 , n − 1]Z }, M3 =
{([vi0 , 2], [vi0 , 0]), ([vi0 , 0], [vi0 , 1])}, M4 = {([v1 , 1], s), (t, [vn , 2])}, and M5 = S.
6
For all the other elements z, we place (z, z) into M6 . Setting Mũṽ = i=0 Mi
makes ũ be linked to ṽ in it.
(c) In the last case of i0 > j0 and l = 2, we define M0 = T , M1 =
{([vi , 1], [vi+1 , 1]) | i ∈ [i0 , n − 1]Z }, M2 = {([vi , 2], [vi+1 , 2]) | i ∈ [i0 ]},
M3 = {([vi0 , 2], [vi0 , 0]), ([vi0 , 0], [vi0 , 1])}, M4 = {([vn , 1], s), (t, [v1 , 2])}, and
M5 = S. The set M6 consists of (z, z) for all the other elements z. We then
6
set Mũṽ = i=0 Mi .
(Case 3) Consider the case where l = 0, l ∈ {1, 2}, and vi0 ∈ γst . This case
is the same as Case 1. In symmetry, the case of l ∈ {1, 2}, l = 0, and vj0 ∈ γst
can be similarly dealt with.
(Case 4) Consider the case of l = l = 0. Assuming that vi0 ∈ γst ,
we define M0 = T , M1 = {([vi , 1], [vi+1 , 1]) | i ∈ [j0 , n − 1]Z }, M2 =
{([vi+1 , 2], [vi , 2]) | i ∈ [j0 , n − 1]Z }, M3 = {([vj0 , 2], [vj0 , 0]), ([vj0 , 0], [vj0 , 1])},
M4 = {([vn , 1], s), (t, [vn , 2])}, and M5 = S. As before, we form M6 by col-
lecting (z, z) for all the other elements z. Obviously, ũ and ṽ are linked in
6
Mũṽ = i=0 Mi .
(2) Consider the second case where either ũ ∈ {s, t} or ṽ ∈ {s, t}. We remark
that all the cases discussed in (1) make s (as well as t) be linked to any element
of the form [vi0 , l] and [vj0 , l ] in the obtained matching. Therefore, we can cope
with this case by modifying the construction given (1).
In conclusion, for any ũ, ṽ ∈ V , from (1)–(2), if either ũ ∈ / R or ṽ ∈
/ R, then
there is a perfect matching Mũṽ in which ũ is linked to ṽ.
On the contrary, assume that, for any distinct pair ũ, ṽ ∈ X, there is a perfect

matching Mũṽ in which ũ is linked to ṽ. As a special case, we choose ũ = s and
ṽ = t. By the definition of M , there is a sequence (s, [w1 , 0], [w2 , 0], . . . , [wk , 0], t)

such that (s, [w1 , 0]), ([wi , 0], [wi+1 , 0]), ([wk , 0], t) ∈ Mst for any index i ∈ [k −1].
This implies that (s, w1 , w2 , . . . , wk , t) is a path in G.
As the second step, it suffices to verify that (AP2DM4 , mset ) ≤sL T
(4DSTCON, mver ). This is because (3DSTCON, mver ) ≡sL m (kDSTCON, mver )
holds for any k ≥ 3 [14] by Lemma 3(2), and thus the desired reduction of
the theorem instantly follows. We start with an arbitrary instance (X, R, M )
to AP2DM3 . Remember that M contains all trivial pairs. We then define a
graph G = (V, E) by setting V = X and E = {(u, v) | u = v, (u, v) ∈ M }.
Clearly, each vertex in G has degree at most 4. Assuming that Muv is a per-
fect matching, if u is linked to v, then v is also linked to u. For any distinct
pair u, v ∈ X, if either u ∈ / R or v ∈ / R, then it follows that there is a perfect
matching Muv for which u is linked to v iff there exist one simple path from u
794 T. Yamakami

to v and another simple path from v to u in G. Thus, to check the existence of


the desired perfect matching Muv , it suffices to make two queries of the forms
(G, u, v) and (G, v, u) to 4DSTCON and output YES if the oracle answers affir-
matively to the both queries. We then recursively check the existence of Muv for
all distinct pairs u, v ∈ X satisfying either u ∈
/ R or v ∈/ R. Note that the size
mver (G, u, v) = mver (G, v, u) = |V | is equal to mset (X, M ) = |X|. Therefore,
we can reduce (AP2DM3 , mset ) to (4DSTCON, mver ) by short L-T-reductions.


Theorem 3 further yields the NL-completeness of AP2DM. Another direct


consequence of Theorem 3 is the following hardness result for the parameterized
decision problem (AP2DM, mset ).

Corollary 3. Under LSH, there is no polynomial-time algorithm that solves


(AP2DM, mset ) using O(mset (x)1−ε ) space for a certain constant ε ∈ (0, 1),
where x is a symbolic input.

6 A Brief Summary of This Exposition

Since its first proposal in [14], the linear space hypothesis (LSH) has been
expected to play a key role in showing the computational hardness of numerous
combinatorial parameterized-NL problems. However, there are few problems that
have been proven to be equivalent in computational complexity to (2SAT3 , mvbl ).
This situation has motivated us to look for natural, practical problems equiva-
lent to (2SAT3 , mvbl ). Along this line of study, the current exposition has intro-
duced three parameterized decision problems (2CVC3 , mver ), (3XCE2 , mset ),
and (AP2DM4 , mset ), and demonstrated that those problems are all equivalent
in power to (2SAT3 , mvbl ) by “short” log-space reductions.3 The use of such short
reductions is crucial in the equivalence proofs of these parameterized decision
problems presented in Sects. 3–5 because PsubLIN is unlikely to be closed under
“standard” log-space reductions, and short reductions may be more suitable for
the discussion on various real-life problems. Under the assumption of LSH, there-
fore, all parameterized decision problems that are equivalent to (2SAT3 , mvbl )
by short log-space reductions turn out to be unsolvable in polynomial time using
sub-linear space.
In the end, we remind the reader that the question of whether LSH is true
still remains open. Nevertheless, we hope to resolve this key question in the near
future.

References
1. Àlvarez, C., Greenlaw, R.: A compendium of problems complete for symmetric
logarithmic space. Comput. Complex. 9, 123–142 (2000)

3
We remark that it is unknown that (AP2DM4 , mset ) ≡sL
m (2SAT3 , mvbl ) holds.
Parameterized-NL Completeness of Combinatorial Problems 795

2. Chandra, A., Stockmeyer, L., Vishkin, U.: Constant depth reducibility. SIAM J.
Comput. 13, 423–439 (1984)
3. Cook, S.A.: The complexity of theorem-proving procedure. In: Proceedings of the
3rd Annual ACM Symposium on Theory of Computing, pp. 151–158. ACM (1971)
4. Cook, S.A., McKenzie, P.: Problems complete for deterministic logarithmic space.
J. Algorithms 8, 385–394 (1987)
5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman and Company (1979)
6. Immerman, N.: Nondeterministic space is closed under complementation. SIAM J.
Comput. 17, 935–938 (1988)
7. Jenner, B.: Knapsack problems for NL. Inf. Process. Lett. 54, 169–174 (1995)
8. Jones, N.D., Lien, Y.E., Laaser, W.T.: New problems complete for nondeterministic
log space. Math. Syst. Theory 10, 1–17 (1976)
9. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E.,
Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum
Press, New York (1972)
10. Reif, J.H.: Symmetric complementation. J. ACM 31, 401–421 (1984)
11. Reingold, O.: Undirected connectivity in log-space. J. ACM 55(4), 17 (2008)
12. Szelepcsényi, R.: The method of forced enumeration for nondeterministic
automata. Acta Informatica 26, 279–284 (1988)
13. Yamakami, T.: Parameterized graph connectivity and polynomial-time sub-linear-
space short reductions. In: Hague, M., Potapov, I. (eds.) RP 2017. LNCS, vol.
10506, pp. 176–191. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
67089-8 13
14. Yamakami, T.: The 2CNF Boolean formula satisfiability problem and the linear
space hypothesis. In: Proceedings of the 42nd International Symposium on Math-
ematical Foundations of Computer Science, vol. 83 of Leibniz International Pro-
ceedings in Informatics (LIPIcs), Leibniz-Zentrum für Informatik 2017, pp. 1–14
(2017). A corrected and complete version available as arXiv preprint
15. Yamakami, T.: State complexity characterizations of parameterized degree-
bounded graph connectivity, sub-linear space computation, and the linear space
hypothesis. Theor. Comput. Sci. 798, 2–22 (2019)
Rashomon Effect and Consistency
in Explainable Artificial Intelligence
(XAI)

Anastasia-M. Leventi-Peetz1(B) and Kai Weber2


1
Federal Office for Information Security – BSI, Bonn, Germany
[email protected]
2
Inducto GmbH, Dorfen, Germany
[email protected]

Abstract. The consistency of the explainability of artificial intelligence


(XAI), especially with regard to the Rashomon effect, is in the focus
of the here presented work. Rashomon effect has been named the phe-
nomenon of receiving different machine learning (ML) explanations when
employing different models to describe the same data. On the basis of
concrete examples, cases of Rashomon effect will be visually demon-
strated and discussed to underline the difficulty to practically produce
definite and unambiguous machine learning explanations and predic-
tions. Artificial intelligence (AI) presently undergoes a so-called repli-
cation and reproducibility crisis which hinders models and techniques
from being properly assessed for robustness, fairness, and safety. Study-
ing the Rashomon effect is important for understanding the causes of
the unintended variability of results which originate from-* within the
models and the XAI methods themselves.

Keywords: Rashomon effect · SHAP (Shapley Additive


exPlanations) · XGBoost (eXtreme Gradient Boosting) · Consistent
ML explanations · XAI · ML · Model reliability

1 Introduction
Rashomon is the name of an old Japanese film by Akira Kurosawa in which
four different witnesses, called to report about a murder, describe their different
and partly also contradictory views regarding facts of the crime. In his publica-
tion by the title: “Statistical Modeling: The Two Cultures" [1,12], Leo Breiman
established the notion Rashomon effect for ML to describe the fact that different
statistical models or different data predictors can work equally good in fitting
the same data. To explain a model’s behavior, one usually tries to identify a
subset of the model’s parameters, and especially those that seem to have the
strongest influence on the model’s prediction. For a linear regression model with
thirty parameters for example, to search for the best five-parameter approxima-
tion would mean to have to choose out of a set with 140,000 member functions.
Naturally, each approximating function attributes different importance to each
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 796–808, 2023.
https://doi.org/10.1007/978-3-031-18461-1_52
Rashomon Effect and Consistency in XAI 797

of the implicated parameters. The seriousness of the problem of getting differ-


ent results becomes especially critical by the application of ML in prediction
problems, when several models perform equally well while some might deliver
conflicting predictions. The concept of predictive multiplicity has been defined in
this context, to describe the ability of a prediction problem to admit competing
models with conflicting predictions [16]. Breiman argued that multiplicity chal-
lenges explanations if they are derived from a single predictive model. If several
models fit equally good the data and each model provides a different explanation
of the data-generating process, how could one decide, which explanation is the
right one? And how reliable are explanations? An investigation of the ProPub-
lica (COMPAS) data set, used for the development of software to support U.S.
courts by the assessment of the likelihood of defendants to become recidivists,
delivered eye-opening results. It was demonstrated for instance, that a compet-
ing model that was only 1 % less accurate than the most accurate model in the
set, assigned conflicting predictions to over 17 % of individuals and that the pre-
dictions of 44 % of individuals were affected by model choice [6]. In the same
work it was showed that predictive multiplicity disproportionately affects eth-
nic groups, so that people in groups afflicted by higher multiplicity were more
vulnerable to the arbitrariness of the competition of models. The Rashomon
effect is not only a problem for simple model families, as it is even more pro-
nounced in families of complex models with many more degrees of freedom [5].
The initial idea that the problem might become smaller with raising accuracy
of model predictions was proven false. The remarkable predictive gain of neu-
ral network (NN) models, applied for example in computer vision and natural
language processing, has been achieved through model over-parametrization and
growing model complexity. Network models often synthesize new “features" for
example to identify objects in images and these features are some times neither
perceivable nor recognized as important by human observers. This shows that
models reason about decisions in ways that are not unique and some times not
even compatible with human reasoning.
Experts collectively label the multiplicity problem as under-specification. For
instance, taking a model which belonged to a Rashomon-set, it was demonstrated
that the model’s sensitivity in predicting risk on the basis of medical imaging
was affected in substantial ways by the choice of random seeds during training
[7,16]. This underlines the fact that taking model predictive performance as the
only criterion to qualify models leads to poorly constrained training problems
whose results are not reproducible [10].

2 Organization and Aim of This Work


The reproducibility of ML models for establishing deterministic results and
causal interpretations is important for creating trust towards AI systems and
their applications. The topic is under ongoing investigation as part of intensive
current research. Reasons why models are often not reproducible have been dis-
cussed in a number of recent works [20,21]. The non reproducibility of models
798 A.-M. Leventi-Peetz and K. Weber

leading to non reproducible explanations or multiple predictions is relevant to


the original definition of the Rashomon effect. In this work however, emphasis
will be placed also on the differences of the explanations produced by different
explanation methods when applied on one and the same model. This is a further
aspect of multiple explanations that relates to the differences between the expla-
nation methods and also to the non-determinism of explanation methods. To
discuss both aspects, examples of feature-based model explanations will be cal-
culated, graphically depicted and compared. Attribute-based explanations show
how much each input feature contributes to a model’s output either for a specific
instance, in which case it is a local explanation or for the whole model, in case of
global explanations. Most model explanation techniques focus on attribute-based
explanations. In the following section, two XGBoost models will be trained with
different parameters on the same data set. Each model will be then interpreted
on the basis of different attribute-based methods or with different metrics and
the results compared. Also a NN model of equal accuracy will be trained with
the same data set and will then be interpreted with the same attribute-based
method. A cross-model comparison of results interpretation will be discussed. In
the last section, conclusions and plans for future work will be indicated.

3 Attribute-Based Model Explanations

In this section, machine learning models trained with the California Housing
Prices Datasets are described and the variability of their interpretations is dis-
played. The task is to practically demonstrate discrepancies created by estab-
lished model explanation techniques for models of the same accuracy trained
with the same data. In Table 1, there are depicted published information about
the here used dataset, derived by the 1990 U.S. Census Bureau for the estimation
of the median house value for California districts.

Table 1. California housing dataset [2]. The number of instances is 20640, the number
of attributes is 8 (numeric). Target value is the median house value for California
districts.

Attribute name Information


MedInc Median income in block
HouseAge Median house age in block
AveRooms Average number of rooms
AveBedrms Average number of bedrooms
Population Block population
AveOccup Average house occupancy
Latitude House block latitude
Longitude House block longitude
Rashomon Effect and Consistency in XAI 799

3.1 XGBoost-Model
XGBoost or eXtreme Gradient Boosting is a highly efficient and portable open-
source implementation of the stochastic gradient boosting ensemble algorithm
for machine learning. It provides interfaces for the use with Python, R, and
other programming languages. Gradient boosting refers to a class of ensemble
machine learning algorithms that can be used for classification or regression
predictive modeling problems [3]. Ensembles are here constructed from decision
tree models. During training, trees are gradually added to the ensemble and
fitted in the direction of reducing the error of the prior models. This process is
referred to as boosting. In the listings 1.1 and 1.2 respectively, there are described
two model instantiations, their fitting and evaluation code is given in listing 1.3.
The two models display both the same degree of accuracy (R2 = 83 %) and have
been used to produce the plots of Figs. 1 and 2.
Listing 1.1. Model-1
xgbr = xgboost . XGBRegressor ( learning_rate =0.14 ,
n_estimators =500 ,
random_state =1001)

Listing 1.2. Model-2


xgbr = xgboost . XGBRegressor ( learning_rate =0.1 ,
n_estimators =600 ,
min_samples_split =8 ,
max_leaf_nodes =3 ,
max_depth =12 ,
random_state =1001)
Concerning the parameters, n_estimators is the number of boosting stages
to perform, min_samples_split is the minimum number of samples required to
split an internal node, max_leaf_nodes is the total number of leaf nodes in the
decision tree, max_depth is the maximum depth of the XGBoost classifier.
Listing 1.3. Model fitting code
xgbr_model = xgbr . fit ( X_train , y_train ,
eval_metric = ’ logloss ’ ,
eval_set =[( X_test , y_test )] ,
e a r l y _ s t o p p i n g _ r o u n d s =20);

rr = xgbr . score ( X_test , y_test )

ma_test = skl . metrics . mean_absolute_error ( y_test ,


xgbr_model . predict ( X_test ))
ma_train = skl . metrics . mean_absolute_error ( y_train ,
xgbr_model . predict ( X_train ))
In Fig. 1 there are displayed the feature importance and permutation impor-
tance of the two models. The importance values have been extracted from the
models through the scikit-learn interface (API). In Fig. 2 there is depicted the
feature importance for the two models as extracted through the native XGBoost
800 A.-M. Leventi-Peetz and K. Weber

plotting interface and derived with three different metrics, weight, cover and
gain. In Fig. 3 the feature importance is displayed for both models, extracted
through the native XGBoost plotting interface and derived with two further
metrics: total_cover and total_gain.The three basic metrics to measure feature
importance are described as follows:

Fig. 1. Scatter plot of prediction over real price as well as Feature Importance and Per-
mutation Importance extracted from the XGBoost-Models with Scikit-Learn, whereby
the default metric is gain, above for Model-1 and below for Model-2.

Fig. 2. Feature Importance for three different metrics: weight, cover and gain,
extracted from Model-1 (above) and Model-2 (below) using the Native XGBoost API.

Weight: number of times a feature is used to split the data across all trees.
Cover: number of times a feature is used to split the data across all trees
weighted by the number of training data points that go through those splits.
Gain: average training loss reduction gained when using a feature for splitting.
Rashomon Effect and Consistency in XAI 801

Fig. 3. Feature Importance with the metrics: total_cover and total_gain extracted
from Model-1 (above) and Model-2 (below) using the Native XGBoost API.

Fig. 4. Feature Importance as calculated and expressed by the mean absolute SHAP
values: Model-1 (above) and Model-2 (below).
802 A.-M. Leventi-Peetz and K. Weber

A comparison of the bar charts within the upper (lower) part of Fig. 2 shows
that the kind of the employed metric: weight, cover or gain, greatly affects the
resulting order of the calculated feature importance for one and the same model,
in this case Model-1 (Model-2). Comparing the upper and lower parts of Fig. 2
for the same metric, shows that the feature importance as calculated with the
metric weight and the metric cover respectively, is different between the two
models, despite the fact that the two models were trained on the same data
set and have the same accuracy (are competing models). To the contrary, the
metric gain delivers a constant order of feature importance for both models of
Fig. 2, as is obvious by a comparison between the upper and lower plots in the
rightmost column of Fig. 2. This constant order, delivered with the metric gain,
is also independent of the employed API for the feature extraction, as shows a
comparison between the rightmost column of plots in Fig. 2 with the middle col-
umn of plots in Fig. 1. Similar arguments count when comparing the importance
order calculated for the metric total_cover and the metric total_gain respec-
tively in Fig. 3, where it is obvious that for one and the same model the results
are again metric dependent. The new metric: total_cover yields different results
when applied on the two competing models, as seen by comparing the upper and
lower elements of the left column of plots in Fig. 3. The interpretation results
with the metric: total_cover are also different than the results with the metric:
cover, as a comparison between the corresponding columns of plots in Figs. 2 and
3 respectively shows. The differences in the feature order for one and the same
model are not at all negligible. For instance, the feature MedInc is ranked first
in the list of importance when the metric total_cover is chosen for Model-1, as
displayed in the upper left part of Fig. 3, while the same feature is ranked fourth
by the metric cover for the same model, as showed in the upper middle part
of the Fig. 2. In conclusion, a model can be evaluated in different ways which
delivers different interpretations of the model results. Because the gain metric
appears to deliver a firm ranking of feature importance for also competing mod-
els, (models of the same accuracy), this metric is usually preferred. However,
the here discussed stability of the explanations relates to the native XGBoost
API and the scikit-learn feature extraction methods. Other explanation methods
deliver yet different orders of feature importance, as the example of Fig. 4 shows,
where the feature importance is calculated with the help of the mean absolute
SHAP values, to be discussed in the next section. It is not trivial to measure
the consistency and accuracy of model interpretation results, especially when
using global attribution methods as is the case here. As easy to understand, the
wide variety of possible parameter combinations to configure the model training,
renders its interpretation results strongly dependent from the learning strategy.

3.2 SHAP (SHapley Additive exPlanations)


SHAP is the well-known model-agnostic game theoretic framework to explain
the output of any machine learning model [14]. SHAP explanations allow for
both local and global interpretability, showing the contribution of each feature
to the shifting of a model’s output beyond an average or base value of the model,
Rashomon Effect and Consistency in XAI 803

created with the training data set. For the results of the Figs. 4 and 5 respec-
tively, the Tree Kernel SHAP (shap.KernelExplainer) algorithm has been here
employed, which has been especially developed for XGBoost [15]. A short and
precise summary about advantages and disadvantages of SHAP is provided in
Chap. 9.6 of the online book by Christoph Molnar [19]. In Fig. 4 the mean abso-
lute SHAP values for the features of the trained XGBoost models are displayed,
while in Fig. 5 the SHAP interpretation, the importance of features as well as
the individual influence of each of the features on the model result are presented.
Figure 6, taken from Fan et al. (2021) [8], shows the Shapley values computed
for a fully connected layer neural network (NN) model, which was trained on the
same California Housing Dataset, with the same eight features. No detailed infor-
mation about the NN or the SHAP implementation that delivered the results of
Fig. 6 are given in [8]. By comparing the Figs. 6 and 5, it is obvious that Fan et
al. assessed a different feature order and different values impact, in comparison
to those calculated for the XGBoost models here. In SHAP plots the features
are ranked in descending order along the vertical axis. Higher feature values are
marked in red color, lower values are marked in blue. The horizontal deviation,
(distance from the zero axis), is associated with the scale of the impact of the
variable on the result. The partial mixing of colors in the horizontal plot shapes
indicates that the feature’s influence on the target value, (here the house price),
is ambiguous. The vertically changing shape of the bars indicates that there
exist interactions of features with other features. As an example, Fig. 5 shows
that higher values of the feature MedInc push house prices up to also higher val-
ues, in which case this feature is said to correlate with the price, whereas higher
values of the feature AveOccup push prices down, this feature anti-correlates.
In Fig. 7, the SHAP feature interaction values for the two XGBoost models is
depicted. For the SHAP values of Fig. 8, the general shap.KernelExplainer [14]

Fig. 5. SHAP values for the XGBoost Model-1, trained with the california housing
dataset [2]. The respective plot for Model-2 is omitted because it is very Similar.
804 A.-M. Leventi-Peetz and K. Weber

Fig. 6. SHAP values calculated with a fully connected layer NN Model, trained by Fan
et al. with the california housing dataset [2, 8].

has been applied on a NN model, especially developed and trained here with
the California Housing Dataset. The according results in Fig. 8, display distinct
differences as compared to the SHAP values for the XGBoost models in Fig. 5,
but also to the borrowed graphic of Fan et al. in Fig. 6. The exact calculation of
a model feature explainability with Shapley Values, would demand to solve an
NP-complete problem, an exercise which is exponential in the number of features
and can not be solved in polynomial time in most of the cases. Therefore, various
model-specific but also model-agnostic approximate solutions have been devel-
oped under the name KernelSHAP. They perform Shapley Values estimation
by solving a linear regression-based exercise. KernelSHAP utilize data set sam-
pling approaches that lead to solving a constrained least squares problem with
a manageable number of data points [4]. The properties of KernelSHAP are not
yet thoroughly understood. It is not clear if Shapley value estimators are indeed
statistical estimators, or if the uncertainty of their results can be quantified and
how unbiased are their sampling methods. The issue is still under investigation.
SHAP values estimated with KernelExplainers are also not deterministic due
to the sampling methods and the background data set selection. SHAP values
also do not provide explanation causality and have to be handled with care, if
used with predictive models [4,9,13,18]. In addition to technical discrepancies
between explainability plots, taken for example the two Figs. 6 and 5, which show
SHAP values of two different models built on the same data, there often exist
also subjective human interpretation factors and opposing views to make things
even more complicated towards a unique understanding of model results. Fan et
al. [8] deduce from the Shapley value analysis of their NN model that the model
is biased because the house age positively correlates with the house price, which
goes against experience, as they say. However, the house age has also a positive
Shapley value as calculated from the XGBoost models trained here and depicted
in Fig. 5, or the NN model, depicted in Fig. 8. This should not necessarily be
a sign of bias for the model or the training data as many old houses are not
mass products, can be of better quality or particular architectural design, or go
with more land. To the disadvantages of SHAP, there belongs the production of
unintuitive feature attributions. It should be possible also to create intention-
Rashomon Effect and Consistency in XAI 805

Fig. 7. Feature SHAP interaction values for the XGBoost Models: Model-1 (above)
and Model-2 (below).

Fig. 8. SHAP values calculated with a here developed fully connected layer NN Model,
trained with the california housing dataset [2].
806 A.-M. Leventi-Peetz and K. Weber

ally misleading interpretations with SHAP, which can hide biases [19]. This is
certainly a point that needs special care and further investigation. For instance,
it is intuitive that the price of a house strongly depends from the location but it
is unintuitive that it depends from the income of the buyer. In this case, there is
certainly a relation between feature and target value, because a higher income
makes it possible for a buyer to afford a more expensive house, but this relation
is not causal in the sense of a house price prediction model. In other words, the
price of the house can not causally depend on the income of its buyer.

4 Conclusions

Examples of explanations of ML models which, although trained with identi-


cal data, yielded varying results, have been produced and compared. The fact
that apart from the predictive multiplicity which originates from the possible
existence of several equally good models fitting a data set, there also exists an
explanative multiplicity is discussed. The latter is related with the choice of the
method to explain a prediction. That means that even if reproducibility of model
training becomes possible, the reproducibility of the model explanation will still
be an open problem if explanations have to be definite, robust and determin-
istic. LIME (Local Interpretable Model-agnostic Explanations) and SHAP are
two widely-used explanation methods that can explain the local behavior of any
model on a single data instance. Both methods output feature attribution scores,
which measure the influence of each dimension of the data instance on the model
output.
For the production of these scores, LIME and SHAP perturb the data
instance and observe how the model output changes on the perturbations. Both
methods then solve an optimization problem over the perturbation data set.
Because no assumptions about the structure of the underlying model have to
be made, these methods have been used for XAI in domains such as law,
medicine, finance, and science. However, the so generated explanations vary with
the distribution of perturbed data. The results of recent research showed that
perturbation-based post hoc explanation methods are also not to be trusted,
as they can generate explanations that hide discriminatory biases in models
[11]. In a recent study which compared several Shapley-value-based explanation
algorithms, it is demonstrated that although these techniques lay claim to the
axiomatic uniqueness of Shapley values, significantly different feature attribu-
tions have been produced even when evaluated exactly (without approximation)
[17]. Recent efforts are dedicated to the development of deterministic explanation
methods, as for example the Deterministic Local Interpretable Model-Agnostic
Explanations (DLIME) frameworks [22].
The generation of inconsistent explanations is a problem for ML applications,
especially in critical application areas such as healthcare, security etc. The here
presented work will be continued in the direction of advancing the reliability
of ML models by adding stricter constraints in the training and explanation
algorithmic processes.
Rashomon Effect and Consistency in XAI 807

References
1. Breiman, L. : Statistical modeling: the two cultures. Stat. Sci. 16(3), 199–215,
(2001). https://www.jstor.org/stable/2676681
2. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, pp.
785–794 (2016). Scikit-Learn California Housing dataset. http://scikit-learn.org/
stable/datasets/real_world.html#california-housing-dataset. Accessed Apr 2022.
https://doi.org/10.1145/2939672.2939785
3. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings
of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, pp.
785–794 (2016). https://doi.org/10.1145/2939672.2939785
4. Covert, I.: Understanding and improving KernelSHAP. Blog by Ian Covert (2020).
https://iancovert.com/blog/kernelshap/. Accessed Apr 2022
5. D’Amour, A.: Revisiting Rashomon: a comment on “the two cultures”. Observa-
tional Stud. 7(1) (2021). https://doi.org/10.1353/obs.2021.0022
6. Dressel, J., Farid, H.: The accuracy, fairness, and limits of predicting recidivism.
Sci. Ad. 4(1), eaao5580 (2018). https://doi.org/10.1126/sciadv.aao5580
7. Fisher, A., Rudin, C., Dominici, F.: All models are wrong, but many are useful:
learning a variable’s importance by studying an entire class of prediction mod-
els simultaneously. J. Mach. Learn. Res. 20(177), 1–81 (2019). http://jmlr.org/
papers/v20/18-760.html
8. Fan, F.L., et al.: On interpretability of artificial neural networks: a survey. IEEE
Trans. Radiat. Plasma Med. Sci. 5(6), 741–760 (2021). https://doi.org/10.1109/
TRPMS.2021.3066428
9. Gerber E.: A new perspective on shapley values, part II: the Naïve Shapley
method. Blog by Edden Gerber (2020). https://edden-gerber.github.io/shapley-
part-2/. Accessed Apr 2022
10. Gibney, E.: This AI researcher is trying to ward off a reproducibility crisis. Inter-
view Joelle Pineau. Nat. 577, 14 (2020). https://doi.org/10.1038/d41586-019-
03895-5
11. Jia, E.: Explaining explanations and perturbing perturbations, Bachelor’s the-
sis, Harvard College (2020). https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:
37364690
12. Koehrsen, W.: Thoughts on the two cultures of statistical modeling. Towards
Data Sci. (2019). https://towardsdatascience.com/thoughts-on-the-two-cultures-
of-statistical-modeling-72d75a9e06c2. Accessed Apr 2022
13. Kuo, C.: Explain any models with the SHAP values - use the Kernelexplainer.
Towards Data Sci. (2019). https://towardsdatascience.com/explain-any-models-
with-the-shap-values-use-the-kernelexplainer-79de9464897a. Accessed Apr 2022
14. Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predic-
tions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Sys-
tems 30, pp. 4765–4774 (2017). https://proceedings.neurips.cc/paper/2017/hash/
8a20a8621978632d76c43dfd28b67767-Abstract.html
15. Lundberg, S.M., et al.: From local explanations to global understanding with
explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020). https://doi.org/10.
1038/s42256-019-0138-9
16. Marx, C.T., Calmon, F., Ustun, B.: Predictive multiplicity in classification. In:
ICML (International Conference on Machine Learning), Proceedings of Machine
Learning Research, vol. 119, pp. 6765–6774 (2020). https://proceedings.mlr.press/
v119/marx20a.html
808 A.-M. Leventi-Peetz and K. Weber

17. Merrick, L., Taly, A.: The explanation game: explaining machine learning models
using shapley values. In: Holzinger, A., et al. (eds.) Machine Learning and Knowl-
edge Extraction, vol. 12279, pp. 17–38 (2020). https://doi.org/10.1007/978-3-030-
57321-8_2
18. Mohan, A.: Kernel SHAP. Blog by Mohan, A. (2020). https://www.telesens.co/
2020/09/17/kernel-shap/. Accessed Apr 2022
19. Molnar, C.: Interpretable machine learning. Free HTML version (2022). https://
christophm.github.io/interpretable-ml-book/
20. Villa, J., Yoav Zimmerman, Y.: Reproducibility in ML: why it matters and
how to achieve it. Determined AI (2018). https://www.determined.ai/blog/
reproducibility-in-ml. Accessed Apr 2022
21. Warden, P.: The machine learning reproducibility crisis. Domino Data Lab
(2018). https://blog.dominodatalab.com/machine-learning-reproducibility-crisis.
Accessed Apr 2022
22. Zafar, M.R., Khan, N.: Deterministic local interpretable model-agnostic explana-
tions for stable explainability. Mach. Learn. Knowl. Extr. 3(3), 525–541 (2021).
https://doi.org/10.3390/make3030027
Recent Advances in Algorithmic Biases
and Fairness in Financial Services:
A Survey

Aakriti Bajracharya(B) , Utsab Khakurel, Barron Harvey, and Danda B. Rawat

Howard University, Washington, DC 20059, USA


[email protected]

Abstract. Artificial intelligence capabilities and machine learning algo-


rithms have been widely used in different applications including financial
services. Many financial services such as loan or credit limit approval and
credit score estimation rely on automated algorithms to offer efficient
and best possible services to customers. However, algorithms suffer from
intentional and unintentional biases and produce unfair outcomes. This
paper presents a survey of algorithmic biases and fairness in financial
services. We study the sources of bias and the different instances of bias
existing in the prominent areas of the financial industry. We also discuss
on the detection and mitigation techniques that have been proposed,
developed and used to enhance transparency and accountability.

Keywords: Algorithmic bias · Bias detection · Credit approval ·


Mortgage lending · Bias mitigation

1 Introduction

AI and ML are the disruptive technologies that have become valuable tools for
government organizations and large businesses including financial institutions.
In the present era of increased computational capacity, abundant digital data
and its low cost storage, AI has seen significant growth in the financial sector.
ML algorithms can identify a wide range of non-instinctive correlations in struc-
tured and unstructured “big data”. So, they can conceptually be more efficient
than humans at utilizing them for proper forecasting in the present generation
where highly concentrated computing power allows to generate, collect, and store
massive datasets [45].
The possibilities of AI in the different sectors of finance are unlimited and
span across the entire value chain. Combining AI with other existing technologies
such as blockchain, cloud computing, etc. increases the possibilities even further.
AI has streamlined the processes in the different stages of providing financial
services, enhanced cybersecurity, automated routine tasks such as credit scor-
ing, pricing and most importantly improved the customer service experience.
Algorithms are also used in risk assessment and management, real-time fraud
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 809–822, 2023.
https://doi.org/10.1007/978-3-031-18461-1_53
810 A. Bajracharya et al.

detection, stock trading, providing financial advisories and so on. AI makes deci-
sions that have a crucial impact over the business and it is important to know
the reasons behind such critical decisions.
Aside from the multitude of opportunities brought forward by AI, every
machine learning system is prone to retaining multiple forms of bias present
in the tainted data [4]. While computer programming languages are explicitly
instructed with codes, ML algorithms are provided with an underlying frame-
work and are trained to learn through data observation. In the course of learn-
ing, ML algorithms will develop biases towards certain types of input. There are
multiple forms of biases that reflect human prejudices towards race, color, sex,
religion, and many other common forms of discrimination which are amplified
by the ML model. For instance, unjust decisions taken by ML models that are
based on historical police records, bias in under-sampled data from minority
groups and so on. The racial wealth gap that exists between black and white
Americans is preserved in part through such biases in credit and lending [45].
The data fed to the computer is simplified to allow the algorithms to be
programmed to learn by example, but the example is often a faulty one leading
all data mining applications to be capable of replicating human biases [29,44].
Regardless of whether the intentions was good or bad, the financial institutions
risk using either biasedly or specifically selected data or biased algorithmic design
that induces discriminatory outcomes towards legally protected traits such as
race, gender, religion or sexual orientation [39]. Therefore, it is important to
design algorithms in ways that mitigate against the potential for bias and ensure
fairness. A software is called fair if it is not affected by any prejudice that favors
the inherent or acquired characteristics of an individual or a group.
In the context of financial institutions, the scope of bias in the ML model
expands further due to the fact that the data is collected from customers [52]. It
is a consumer facing industry with a reducing level of human involvement, and
therefore it is important for the organizations to be aware of the potential for
algorithmic bias and have strategies for mitigation.
Bias mitigation tools and strategies continue advance in improving algo-
rithms’ accuracy and fairness. Most of the strategies are developed with the
objective to mitigate the effects of bias-from sampling issues, feature selection,
labeling, etc. and to prevent discrimination in a given model’s outputs [47]. In
this paper we discuss the common strategies of achieving fairness such as pre-
processing, in-processing and post-processing tools, fairness metrics and resam-
pling as well as algorithm audit and the use of “alternative” data. It can be
a challenge to mitigate every fairness condition simultaneously, and so fairness
always entails some degree of trade-off with respect to accuracy.
This paper presents a survey on the recent advances of algorithmic biases and
fairness in financial services. The remainder of the paper is organized as follows.
Section 2 discusses the different sources of algorithmic bias. Section 3 presents
an overview of different instances of biases in the major areas of financial ser-
vices. Section 4 focuses on the bias detection and mitigation techniques. Finally,
conclusions are presented in Sect. 5.
Hamiltonian Mechanics 811

2 Sources of Algorithmic Bias


Drawing on the definition coined by Mitchell in 1980, bias was first defined
as “any basis for choosing one generalization (hypothesis) over another, other
than strict consistency with the observed training instances [36].” Algorithmic
bias has existed since the emergence of AI and has long been documented and
analyzed in various research works.
Humans are the ones to develop an AI/ML model by deciding on the features,
relevant attributes and developing classifiers as a collective process of building
the AI/ML systems. In the course, the probability of the inherent human bias
being encoded into the system is very high. The outcome would then reflect the
social inequalities that exist in the actual world. In this section, we divide the
different sources of bias in terms of the stages in which they can emerge-input,
training and programming.

2.1 Bias in Input Data


When data that replicates existing biases are provided as an input into the
algorithm, the outcome incorporates and perpetuates those biases. Bias in input
bias could be due to the following reasons:

Historical Bias. Data may incorporate the gender, racial, economic and other
biases that have existed for a long time. Geographical bias is one common out-
come of historical bias where residents of poor Zip code areas or from minority
communities have historically had more cases of defaults which is reflected in the
data and thus result in a higher proportion of declined loan applications. The
cycle goes on and reinforces the historical biases over time as a phenomenon
called feedback loop.

Measurement Bias. Bias can arise when a study variable is inaccurately mea-
sured, or systematic data recording errors are stored. Such errors generally affect
the entire model but sometimes there is a possibility of it impacting a particular
group when the data collection method of just that particular group was faulty.

Representation Bias. The sample size could be small or skewed towards cer-
tain groups that are not representative of the entire population.

2.2 Bias in Training


Biases can be encoded in the training datasets of ML algorithms, from the
sources mentioned in the above subsection. One of the common reasons is the
training data inadequately or unequally representing the target population. Bias
could also appear in the categorization of the baseline data. When the algorithm
designer selects attributes such as age, gender, skin color, etc. to divide data
into different groups, it advantages certain privileged groups while disadvan-
tages other underprivileged groups [13].
812 A. Bajracharya et al.

2.3 Bias in Programming or Algorithm Design

The concept of machine learning is to develop a smart algorithm and allow it


to learn from existing or new data. There is plenty of scope for bias to occur in
the original design or be generated in the process of learning. Algorithms tend
to learn the wrong lessons from erroneous data and often cannot differentiate
between causal relations and correlation. It becomes concerning when the iden-
tified correlation unintentionally act as proxies for excluding legally protected
groups [41]. One of the more concerning instances is when the algorithms con-
veniently identify the easiest path for problem solving and decision-making and
entirely miss the point of the training [49].

Weighting Bias. Oftentimes, weight is given to each features in an algorithm


to assign it a lighter or heavier importance in the model. Weighting bias emerges
if the weights are not applied correctly and impacts the outcome [35].

Proxy Bias. A variable used in an algorithm could be an important indicator


for the model but still be inadvertently correlated with potentially sensitive
attributes, it can function as a proxy for discriminating against protected class
of people. E.g. neighborhood could be correlated with particular ethnicity. It is
mostly challenging to determine if a variable is correlated with discriminatory
attributes, the degree of correlation and the decision to include or override it
from the training.

Emergent Bias. Algorithms operate in a dynamic environment and the society,


population, and cultural values are ever changing. Such changes may inject new
biases to a fully operational algorithm [19]. When such changes like COVID-
19 occur, algorithms may not reflect them. Algorithms may need an update or
retraining to mitigate the changes. It is more difficult to anticipate and in turn
control for a contingent bias.
As Daniel James Fuchs writes “the complex network of relationships that
compose the learned bias exist as an effectively abstract object” (2018). Hence,
it is often difficult even for the algorithm’s creator to detect a learned bias in
the first place and then understand their root cause and their impact towards
the data.

3 An Overview of Biases in Financial Services


Automation is rapidly replacing human judgment in all industries and the finan-
cial services industry is undeniably an adherent to this trend. More and more
financial institutions are passively outsourcing to and relying on algorithms to
make decisions, in order to get efficient results free from the subjectivity and cog-
nitive biases inherent in human decision-making. However, this tendency ignores
Hamiltonian Mechanics 813

the basic fact that algorithms often use biased programmed reasoning that are
invented by humans themselves.
In this section, we aim at presenting the recent advances of algorithmic bias
in the most prominent areas of the financial services industry.

3.1 Credit Reports and Credit Scores


Consumer credit scoring is a common practice used by insurance, rental, hous-
ing, labor and lending markets as a primary means of assessment to protect
themselves from financial loss due to asymmetric information. Credit reporting
and scoring mechanisms implemented by credit card companies and consumer
lending firms have been accused of being based on affinity profiling and this has
been a common form of FTC complaints [15].
An experimental study to assess if social biases were captured into credit
scoring found that the loan application information is sufficient to figure out
the sensitive information such as gender and ethnicity of the customers and the
pertaining biases were mechanically transmitted into their evaluation [25]. Such
socially biased data are fed into the credit scoring algorithm and also utilized
for the training datasets which subsequently reinforces a biased credit score.
FICO (Fair Isaac Corporation) is an organization founded in 1989 to calculate
the credit score of consumers. Today it is widely used by financial institutions to
assess the creditworthiness of the borrowers. Most FICO scores are on a range
of 300 to 850 [1]. African-American people have the lowest average credit score
in 2021 as depicted in Fig. 1 that shows the average credit scores by race. Con-
sequently, they end up paying higher interest and depend on subprime lenders
[9]. Once their credit score is damaged, it takes a lot of effort for the individuals
to improve their financial health.

3.2 Mortgage Lending


In the U.S., home ownership is a key building block of wealth and intergener-
ational wealth transfers. It is illegal to discriminate against someone based on
their legally protected traits at any stage of the mortgage process, including
property appraisals. Hence, credit models of financial institutions are expected
to comply with a number of regulatory enforcements at local, state and federal
levels.
The U.S. is also notorious for many factors contributing to unequal access
in mortgage loans such as redlining, geographically targeted predatory lending,
discrimination in lending standards, and racial covenants. The HMDA dataset
(Home Mortgage Disclosure Act) has permanent records of every possible pref-
erential treatment and discrimination in the past and consequently the all the
events of declined loan applications. Hence the data displays the fact that peo-
ple of colour are two to three times more likely to be denied mortgage loan in
comparison to their white counterparts [9].
Analysis of geographical bias has a growing importance given the gradually
declining mobility of Americans. Geographical discrimination exacerbates the
814 A. Bajracharya et al.

Fig. 1. Average credit scores by race for 2021 [1].

lack of access to fairly priced credit and leads to racially disparate outcomes.
Lenders have shifted from excluding redlined neighborhoods from the main-
stream credit to exploiting them to maximize profits [26]. For instance, Cathy
O’Neil cites an example of a borrower in the majority black neighborhood of
East Oakland, California receiving a low e-score due to historical correlations
between her ZIP code and high default rates [38]. Lenders are basically abusing
borrowers with such low credit scores and little-to-no credit histories by extend-
ing credit at arguably unreasonably high interest rates. The borrowers, despite
being aware of the exploitation, have no other option except to submit to the
unfair credit terms [29].
Predatory tactics such as aggressive advertising, consent solicitation and bait-
and-switch schemes have been historically employed by creditors to target vul-
nerable borrowers [17]. For instance, these creditors visit university campuses
annually to conduct exciting advertisement campaigns full of music and freebies
to attract students towards their offers of teaser low interest rate credits. As AI
keeps on advancing, the ability of such predatory creditors and lenders to target
vulnerable consumers would also advance.
Black and Hispanic borrowers suffer from racial gaps in mortgage costs and
are more likely to be rejected when they apply for a loan. In a Northwestern Uni-
versity meta-analysis, authors find that racial disparities in the mortgage market
have no evidence of decline over the past four decades. Subtle but persistent
forms of discrimination between whites and minorities have been compounding
Hamiltonian Mechanics 815

residential segregation preventing minority households to build wealth through


housing and subsequently maintaining the racial wealth gaps [43].
Despite the covenants’ judicial unenforceability and illegality, racial
covenants continue to forbid the ownership or occupancy of certain land by
non-white people indicating that the properties are exclusive. Use of such pre-
viously archived racial covenant data has a long term impact on racial wealth
gaps and contributes to extreme racial and economic segregation in neighbor-
hood development [5].
Recent studies also demonstrate the persistence of substantial wealth gap
between white and non-white families. The median wealth of a typical white
household is eight times the wealth of a typical black household and five times
that of a Hispanic household [8].

3.3 Small Business Lending and Access to Other Banking Services


The unequal access to funds is resulting in a lower level of Black-owned businesses
and business assets constraining their growth and consequently factoring for
racial wealth gap. Data from a 2020 report [40] from The Brookings Institution
states, “Black people represent 12.7% of the U.S. population but only 4.3%
of the nation’s 22.2 million business owners.” Fairlie et al. [18] explored racial
inequality in access to startup capital and reported substantially higher levels
of loan denials for Black startups. They also observe that Black entrepreneurs
opt for alternative sources of funds such as financial help from friends, family
and relatives, personal savings, etc. instead of applying for a loan despite a good
credit history because they expect to be denied credit. It is by the far the largest
cause of disparities in total financial capital.
According to the 2020 Small Business Credit Survey [6], the pandemic finan-
cially impacted Black business owners more than any other groups. This survey
was conducted on nearly 10,000 small-businesses and another 4,500 non-employer
firms and it found that 92% of Black-owned businesses reported struggling finan-
cially during the pandemic, compared to 79% for white-owned businesses. In
addition to that 38% of Black-owned businesses borrowed from a friend or rela-
tive, 25% worked a second job, and 74% used their personal funds to deal with
the financial challenges.
The number of Black-owned banks has declined by 62.5% in the past two
decades (48 in 2001 to 18 in 2020) [37]. As per the latest Minority Depository
Institutions Program report by FDIC, the number of Black-owned banks are 19
[3]. Apart from this, another reason for limited access to financial services is an
overall decline in the number of banks in majority Black and Latino or Hispanic
neighborhoods [9]. Although the overall number of banks in the US are declining
due to various types of market failure, racial discrimination is a catalyst in this
situation to further increase banking and credit deserts in underserved urban
and rural communities. Out of 7.1 million unbanked households in the United
States [2], the group is disproportionately composed of women and people of
color [46]. These real-life biases in human decisions are again reproduced into
biased algorithms.
816 A. Bajracharya et al.

3.4 Operational Models

While the majority of the bias detection and mitigation approaches are concen-
trated towards credit assessment, there seems to be a limited focus on the biases
prevalent in the operational models in financial services [33]. The accelerating
use of AI in the operation of financial firms has digitalized the customer service
area of business. AI is pervasively used in credit card fraud detection, customer
authentication, chatbots, etc. Facial recognition algorithms are known to exhibit
bias-researchers have found it to falsely identify Black and Asian faces 10 to 100
times more than white faces. The algorithms also falsely identified female faces
more often than they did male faces. This would especially increase the vulnera-
bility of Black women towards algorithmic bias [24]. On the other hand, Natural
Language Processing (NLP) models, widely used in the customer service appli-
cations, financial assistants, recruiting, and personnel management, are basically
a product of linguistic data full of discriminatory patterns that reflect human
biases, such as racism, sexism, and ableism [11].

3.5 FinTech and Big Tech

Fintech firms are experiencing remarkable growth and have reached 64% global
adoption rate [20]. Fintech firms are increasingly attracting business from many
financial institutions claiming to provide sophisticated, analytical and model-
based interpretation of big data. Such services are primarily oriented towards
predicting the creditworthiness of borrowers and their results are treated almost
like universal truths by the financial institutions [29].
While fintech is experiencing remarkable growth, the reports of bias have
also increased, especially the concerns regarding privacy and fairness. Fintech
firms around the world utilize thousands of features and attributes for assessing
creditworthiness [50]. Fintech makes use of digital footprints to analyze the indi-
viduals such as marital and dating status, social media profiles, SMS message
contents, cookie data, facial analysis and micro-expressions and even the typing
speed and accuracy and they are scrutinized for assessing the creditworthiness
[33].
Fintech firms are promising that their developers are expressly programming
the algorithm to prevent it from replicating statistical discrimination. Several
researches conducted to study the discrepancies in FinTech lending also claim
that Fintech mortgage lenders show little to no gap in lending terms provided
to Black and Hispanic borrowers after adjusting for GSE credit-pricing deter-
minants and loan size [48]. The findings of the research conducted by Bartlett
et al. [7] suggest that, “In addition to the efficiency gains of these innovations,
they may also serve to make the mortgage lending markets more accessible to
African-American and Latinx borrowers.” They found that FinTech algorithms
also discriminate, but 40% less than face-to-face lenders. Shoag, 2021 claims that
Fintech mortgage lenders show little to no gap in lending terms provided to Black
and Hispanic borrowers after adjusting for GSE credit-pricing determinants and
loan size [48]. During the pandemic, researchers at New York University [27]
Hamiltonian Mechanics 817

found Black businesses owners were 12.1% points more likely to get PPP funds
from a Fintech firm than from a conventional bank, while small banks were much
less likely to lend to Black businesses.
Fintech firms are now writing the most home mortgages but they face less
regulatory scrutiny due to which their AI models pose growing ethical concerns
threatening the most marginalized individuals and families [10]. Recent federal
banking regulations adopt a deregulatory approach that enable fintech firms to
dominate the financial markets. While the approach was intended to encourage
innovation, it may amplify the exploitation of the most vulnerable communities
[29]. A research [22] claims no evidence of fintech firms working on increasing
financial services to low-income borrowers. A possibility that cannot be ignored
is that Fintech lenders may ingrain predatory inclusion, existing inequities, and
unconscious biases into the financial system for many decades to come and it
will continue to accelerate the wealth gap and constrain minority communities
from developing [29].

4 Approaches to Bias Detection and Mitigation


Bias in algorithms could impact the organization legally in the form of legal
implications and regulatory enforcement, financially in the form of loss of con-
sumer trust, and also impact the image of the firm. Disparate impact on any
group or community due to algorithmic decisions has been codified in US law
and regulation as evidentiary basis for closer review and even sanction [30]. It is
therefore important to mitigate against bias and create algorithms that are free
from potential bias [35].
Historically, bias in the data or in the algorithmic design was easier to access
compared to the present context where learning algorithms are continually evolv-
ing in terms of sophistication. Since there is little to no algorithmic transparency
in the way that the ML algorithms operate, it does not always make sense even
for the programmer on how the algorithm chooses, studies and assess factors
from a large pool of data. It makes it challenging to directly observe learned
biases from the outside. Hence it is fundamentally difficult to assess the fairness
of an algorithmic and its decision-making patterns. Also the understanding of
what is fair and what is a bias is quite dependent on the context which again
adds to this difficulty. It is now a common knowledge that algorithms can be
biased even after overriding sensitive information as an input [30]. Over the
past decades, there has been a substantial amount of research on the tools and
techniques to avoid bias and achieve fairness in ML algorithms.
Fuchs, 2018 [21] suggests that rather than attempting to interpret the process
and determine cases of bias, bias can be identified by observing trends in the
ML algorithm’s decisions. In [13], the authors recommend integrating routine
bias detection and mitigation in machine learning software development cycle
by using a method called Fairway which provides a combination of mitigation
tools in the pre-processing and in-processing stages to remove ethical bias from
the model. In [34], the article put forward a holistic approach to deal with such
818 A. Bajracharya et al.

challenges by proposing an integrative framework for trustworthy and ethical AI


while also introducing AI in the educational curriculum.
In [52], the authors presented the challenges introduced by AI technology
in value-based decision making, particularly the unintentional bias, and focus
on the fairness assessment for establishing AI governance framework. Different
measures for mitigating bias were reviewed under this article such as the use of
bias mitigation tools, algorithms, fairness metrics and imbalanced data treatment
by resampling.

4.1 Pre-processing, In-Processing and Post-Processing Tools

Algorithms can be used for mitigating bias during the three stages of process-
ing namely pre-processing, in-processing (algorithm modifications), and post-
processing. Pre-processing bias mitigation is all about preparing and optimizing
the data to accurately represent the population and reduce the predictability
of the protected attribute. Neglecting bias in the source data can cause big-
ger bias in the model conclusion. Resampling, reweighting, massaging, and data
transformation tactics such as flipping the class labels across groups, and omit-
ting sensitive variables or proxies are some methods of bias mitigation in the
early stage [31]. In-processing tends to focus on creating a classifier and train-
ing it to optimize for both accuracy and fairness. Mitigation can range from
using adversarial techniques, ensuring underlying representations are fair such
as Kamishama’s prejudice remover [32], or by framing constraints and regulariza-
tion [12]. Finally, there is an abundance of methods that focus only on adjusting
the outcome of a model i.e. post-processing. Early works in this area focus on
modifying thresholds in a group-specific manner whereas recent work has sought
to extend these ideas to regression models [42].

4.2 Fairness Metrics and Open Source Libraries

In addition to above, algorithms can be tested using bias detection techniques


such as computer simulations and comparing outcomes for different groups. Sev-
eral fairness metrics have become available in recent years among which statisti-
cal parity, equalized odds and equality of opportunity, predictive parity and cal-
ibration are widely used in testing the fairness of an algorithm [23]. Researchers
have recently developed a three-level rating system which can determine the
relative fairness of an algorithm [51]. Algorithm users and designers also have
various open source libraries to choose from such as FairML, AI Fairness 360,
Fairness comparison, etc. [52].

4.3 Resampling

Unequal representation of data is one of the major reasons for bias because the
model lacks sufficient data of a certain class and is therefore unable to learn
about that particular class. The algorithms are much more likely to classify new
Hamiltonian Mechanics 819

observations to the majority class. Resampling is one of the methods of treat-


ing imbalanced data which includes either randomly removing instances from
the majority class (under-sampling) or randomly replicating instances from the
minority class (over-sampling) to achieve balance. Some notable resampling tech-
niques are Bootstrap which simply over samples or under samples with replace-
ment and SMOTE [14] (Synthetic Minority Over-sampling Technique) which
over-samples by generating new synthetic minority samples. It is more popu-
lar as it overcomes the drawback of both under-sampling and over-sampling by
preventing loss of information and overfitting of data.
While implementing the above methods, it is important to keep in mind
that balancing the dataset can escalate the biases and if a proper model is not
selected, the model itself may have a potential bias [52].

4.4 Algorithm Audit


Akula et al. [4] anticipate the emergence of algorithm audit which would ver-
ify the lawfulness, ethics and trustworthiness of AI. They suggest 7 potential
audit phases to assess vulnerability and maintain an appropriate degree of trans-
parency. There could be technical complexities and limitations in considering AI
models and algorithms as the decision-makers in the eyes of the law.

4.5 Use of Alternative Data


Many fintech lenders are using “alternative” data to develop more inclusive credit
scoring algorithms such as utility payment, cell phone and internet data usage,
electronic records of deposit and withdrawal transactions, insurance claims, bank
account transactions, consumer’s occupation and education and so on. In [28]
authors have demonstrated that, over the years, such alternative sources of infor-
mation have contained many additional information that are not the part of
traditional credit approval criteria. Some lenders do not use FICO scores at all
whereas some include FICO scores along with the alternative data to ultimately
decide on the creditworthiness. Alternative credit scoring products are relatively
new, but there is already some evidence that they can paint a more accurate
and comprehensive picture of the risks posed by borrowers with thin or impaired
credit histories than traditional scoring techniques [16]. researchers are hopeful
that the future of big data and alternative information is bright. However, if
the alternative information are used without the borrower’s consent, it can raise
concerns surrounding consumer privacy. Another potential risk is the misuse of
the alternative data by the online fintech lenders to identify sensitive attributes
or consumers from vulnerable communities.
It is theoretically impossible to construct a perfect algorithm that fulfills all
the requirements of being both “fair” and “accurate” [4]. While this survey has
reviewed the recent approaches to bias detection and mitigation strategies, ML
researchers are continually working to evolve the strategies along with the evo-
lution of learning algorithms. Therefore, it is necessary to constantly scrutinize
and assess these techniques as well as the algorithms.
820 A. Bajracharya et al.

5 Conclusion
The advent of Al technology has the potential to generate ample AI solutions
that can upgrade traditional banking, develop new market infrastructure, and
foster the inclusion of unbanked or underbanked consumers. It has to be backed
by effective state and federal regulations that monitor the consumer lending
market and promote the accountability, transparency, and explainability of their
algorithms. In this paper, we have presented a survey of recent advances in the
algorithmic biases and fairness in financial services. Algorithmic bias continues
to prevail in all sectors of the financial industry catalyzing the deeply rooted
economic segregation and racism in the U.S. With the advent of more and more
bias detection and mitigation tools, the AI/ML solutions will gradually gain
more trust from the consumers of the financial industry and as a result, help to
facilitate a more equitable future. There is also a crucial need for a detail study
of modern datasets to identify the current status of lending disparities.

Acknowledgments. This work was supported in part by Mastercard Inc research


funds and Intel Corporation research funds at Howard University. However, any opin-
ion, finding, and conclusions or recommendations expressed in this document are those
of the authors and should not be interpreted as necessarily representing the official
policies, either expressed or implied, of the funding agencies.

References
1. Credit score, August 2021
2. How america banks: Household use of banking and financial services, 2019 fdic
survey, December 2021
3. Minority depository institutions program, December 2021
4. Akula, R., Garibay, I.: Audit and assurance of AI algorithms: a framework
to ensure ethical algorithmic practices in artificial intelligence. arXiv preprint
arXiv:2107.14046 (2021)
5. Bakelmun, A., Shoenfeld, S.J.: Open data and racial segregation: mapping the
historic imprint of racial covenants and redlining on American cities. In: Hawken,
S., Han, H., Pettit, C. (eds.) Open Cities — Open Data, pp. 57–83. Springer,
Singapore (2020). https://doi.org/10.1007/978-981-13-6605-5 3
6. Federal Reserve Banks. Small business credit survey: 2021 report on employer firms
(2021)
7. Bartlett, R., Morse, A., Stanton, R., Wallace, N.: Consumer-lending discrimination
in the era of fintech. Unpublished working paper. University of California, Berkeley
(2018)
8. Bhutta, N., Chang, A.C., Dettling, L.J., et al.: Disparities in wealth by race and
ethnicity in the 2019 survey of consumer finances (2020)
9. Broady, K.E., McComas, M., Ouazad, A.: An analysis of financial institutions
in black-majority communities: black borrowers and depositors face considerable
challenges in accessing banking services, March 2022
10. Buckley, R.P., Arner, D.W., Zetzsche, D.A., Selga, E.: The dark side of digital
financial transformation: the new risks of fintech and the rise of techrisk. In: UNSW
Law Research Paper (19-89) (2019)
Hamiltonian Mechanics 821

11. Caliskan, A., Bryson, J.J., Narayanan, A.: Semantics derived automatically from
language corpora contain human-like biases. Science 356(6334), 183–186 (2017)
12. Celis, L.E., Huang, L., Keswani, V., Vishnoi, N.K.: Classification with fairness
constraints: a meta-algorithm with provable guarantees. In: Proceedings of the
Conference on Fairness, Accountability, and Transparency, pp. 319–328 (2019)
13. Chakraborty, J., Majumder, S., Yu, Z., Menzies, T.: Fairway: a way to build fair ml
software. In: Proceedings of the 28th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineer-
ing, pp. 654–665 (2020)
14. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
15. Federal Trade Commission, et al.: Big data: a tool for inclusion or exclusion?
understanding the issues. FTC report (2016)
16. Demyanyk, Y., Kolliner, D.: Peer-to-peer lending is poised to grow. Economic
Trends (2014)
17. Engel, K.C., McCoy, P.A.: A tale of three markets: the law and economics of
predatory lending. Tex. L. Rev. 80, 1255 (2001)
18. Fairlie, R., Robb, A., Robinson, D.T.: Black and white: access to capital among
minority-owned start-ups. Manage. Sci. 68, 2377–2400 (2021)
19. Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst.
(TOIS) 14(3), 330–347 (1996)
20. Frost, J.: The economic forces driving fintech adoption across countries. The tech-
nological Revolution in Financial Services: How Banks, Fintechs, and Customers
win Together, pp. 70–89 (2020)
21. Daniel James Fuchs: The dangers of human-like bias in machine-learning algo-
rithms. Missouri S&T’s Peer Peer 2(1), 1 (2018)
22. Fuster, A., Plosser, M., Schnabl, P., Vickery, J.: The role of technology in mortgage
lending. Rev. Financ. Stud. 32(5), 1854–1899 (2019)
23. Garg, P., Villasenor, J., Foggo, V.: Fairness metrics: a comparative analysis. In:
2020 IEEE International Conference on Big Data (Big Data), pp. 3662–3666. IEEE
(2020)
24. Grother, P.J. Ngan,, M.L., Hanaoka, K.K., et al.: Face recognition vendor test part
3: demographic effects (2019)
25. Hassani, B.K.: Societal bias reinforcement through machine learning: a credit scor-
ing perspective. AI Ethics 1(3), 239–247 (2020). https://doi.org/10.1007/s43681-
020-00026-z
26. Howell, B.: Exploiting race and space: Concentrated subprime lending as housing
discrimination. Calif. L. Rev. 94, 101 (2006)
27. Howell, S.T., Kuchler, T., Snitkof, D., Stroebel, J., Wong, J.: Racial disparities in
access to small business credit: Evidence from the paycheck protection program.
Technical report, National Bureau of Economic Research (2021)
28. Jagtiani, J., Lemieux, C.: The roles of alternative data and machine learning in
fintech lending: evidence from the lendingclub consumer platform. Financ. Manage.
48(4), 1009–1029 (2019)
29. Johnson, K., Pasquale, F., Chapman, J.: Artificial intelligence, machine learning,
and bias in finance: toward responsible innovation. Fordham L. Rev. 88, 499 (2019)
30. Kallus, N., Mao, X., Zhou, A.: Assessing algorithmic fairness with unobserved
protected class using data combination. Manage. Sci. 68(3), 1959–1981 (2022)
31. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without
discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)
822 A. Bajracharya et al.

32. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with
prejudice remover regularizer. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.)
ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 35–50. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33486-3 3
33. Kurshan, E., Chen, J., Storchan, V., Shen, H.: On the current and emerging chal-
lenges of developing fair and ethical AI solutions in financial services. arXiv preprint
arXiv:2111.01306 (2021)
34. Liu, X.M., Murphy, D.: A multi-faceted approach for trustworthy AI in cyberse-
curity. Journal of Strategic Innovation & Sustainability 15(6), 68–78 (2020)
35. KPMG LLP. Algorithmic bias and financial services. Technical report (2021)
36. Mitchell, T.M.: The need for biases in learning generalizations. Department of
Computer Science, Laboratory for Computer Science Research . . . (1980)
37. Neal, M., Walsh, J.: The Potential and Limits of Black-Owned Banks. Urban Insti-
tute, Washington, DC (2020)
38. O’neil, C.: Weapons of math destruction: how big data increases inequality and
threatens democracy. Broadway Books (2016)
39. Nizan Geslevich Packin: Consumer finance and AI: the death of second opinions?
NYUJ Legis. Pub. Pol’y 22, 319 (2019)
40. Perry, A., Rothwell, J., Harshbarger, D.: Five-star reviews, one-star profits: the
devaluation of businesses in black communities. Brookings Institutution (2020)
41. Petrasic, K., Saul, B., Greig, J., Bornfreund, M., Lamberth, K.: Algorithms and
bias: what lenders need to know. White & Case (2017)
42. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., Weinberger, K.Q.: On fairness
and calibration. In: Advances in Neural Information Processing Systems, vol. 30
(2017)
43. Quillian, L., Lee, J.J., Honoré, B.: Racial discrimination in the us housing and
mortgage lending markets: a quantitative review of trends, 1976–2016. Race Soc.
Probl. 12(1), 13–28 (2020)
44. Rawal, A., McCoy, J., Rawat, D., Sadler, B., Amant, R.: Recent advances in trust-
worthy explainable artificial intelligence: status, challenges and perspectives (2021)
45. Rea, S.: A survey of fair and responsible machine learning and artificial intelligence:
implications of consumer financial services. Available at SSRN 3527034 (2020)
46. Seamster, L.: Black debt, white debt. Contexts 18(1), 30–35 (2019)
47. Selbst, A.D., Barocas, S.: The intuitive appeal of explainable machines. SSRN
Electron. J. (2018). https://doi.org/10.2139/ssrn.3126971
48. Shoag, D.: The impact of fintech on discrimination in mortgage lending. Available
at SSRN 3840529 (2021)
49. Simonite, T.: When bots teach themselves to cheat. Wired Magazine (Aug. 2018)
(2018)
50. Ramandeep Singh. Gk digest (2015)
51. Srivastava, B., Rossi, F.: Towards composable bias rating of AI services. In: Pro-
ceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 284–
289 (2018)
52. Zhang, Y., Zhou, L.: Fairness assessment for artificial intelligence in financial indus-
try. arXiv preprint arXiv:1912.07211 (2019)
Predict Individuals’ Behaviors from Their
Social Media Accounts, Different
Approaches: A Survey

Abdullah Almutairi(B) and Danda B. Rawat

EECS Department, Howard University, Washington, DC 20059, USA


{almutairi,db.rawat}@ieee.org

Abstract. Predicting individual behavior has been among the key


objectives in the social sciences, helping derive important insights, such
as the individuals to target with specific marketing material. For many
years, rudiments behavioral, geographic, and demographic methods
were used for prediction. However, different researchers have developed
advanced, computerized methods to quantify emotional and sentimental
intensity from social media posts. Therefore, the research employed a lit-
erature review methodology to determine the main approaches proposed
in the last five years. IEEE Xplore was used to select reliable research
articles about different prediction methods. Four techniques were iden-
tified: the Lexicon approach, the Louvain algorithm, Naı̈ve Bayes classi-
fication, and MCDM. Based on collected information, the Lexicon app-
roach can be used to arrange data into neutral, negative, and positive
labels that depict prevailing sentiments. It can also be combined with
Multi-Criteria Decision Making to detect emotions. Conversely, the Lou-
vain method is a clustering algorithm that can be employed for topic
modeling, the process of extracting a group of words from a set of doc-
uments that best represent the contained information. The Naı̈ve Bayes
approach can also predict personality and emotions from typical social
media posts. The best results are attained when the method is combined
with statistical tests.

Keywords: Lexicon · Louvain · Naı̈ve · MCDM

1 Introduction
Predicting individual behavior is a key objective in the social sciences, ranging
from business to sociology, psychology, and economics. In marketing, prediction
helps select groups of individuals to target with promotional material. Specif-
ically, organizations can determine people most likely to take action, such as
adopting a new product. For many years, demographic, geographic, and behav-
ioral targeting were the main prediction methods. However, with the increasing
availability of social media data, it is now possible to predict the behaviors of
individuals from their accounts. For example, [3]. State that by analyzing the
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 823–836, 2023.
https://doi.org/10.1007/978-3-031-18461-1_54
824 A. Almutairi and D. B. Rawat

content of posted tweets, it is possible to understand users’ sentimental and


emotional intensity. In recent years, researchers have developed various tools to
predict individual behaviors. This research uses a literature review to identify
the most effective available methods include the Lexicon approach, the Lou-
vain algorithm, Naı̈ve Bayes classification, and Multi-criteria decision making.
The current paper documents the application of the Lexicon approach, the Lou-
vain algorithm, Naı̈ve Bayes classification, and Multi-Criteria Decision Making
(MCDM) for predicting individual behaviors from social media posts. The body
is organized into sections discussing each method. Comprehensive details on the
underlying analysis steps and applied algorithms with accompanying images are
provided. A summative table (Table 1) is also drafted, presenting each method’s
goal, advantages, and limitations. The paper ends with a conclusion section that
compares the discussed approaches to provide a holistic view of the general sen-
timent analysis concept.

2 Problem Statement

Different researchers approached the topic in diverse dimensions and techniques.


However, their contributions towards the topic are of significant help for future
researchers and anyone interested in understanding the topic under investigation.
The contributions towards this topic also come as a strategic approach to enhance
insights concerning the topic under investigation. Essentially, various researchers
criticized while at the same time highlighting the benefits of the use of some
methods in developing solutions for complex systems.

3 Background
According to [7], social ties that exist among online users of the social systems
play a vital role in determining the behaviors of such users. In particular, social
influence is one of the critical factors that shape the behaviors of the users
of online social networks. The actions of an individual user trigger his or her
friends to follow the same trend of behavior while using the same social network.
Behavioral modes, ideas, and new technological advancements can easily spread
via social networks through the power of social influence among the users. For
instance, the interaction of people on online social networks like Facebook and
Flicker attracts huge traffic that denotes the possible influence of users on the
behavioral aspects of other users in the same social networks [7]. The extent of
influence is immense due to the huge number of users on online social platforms.
Noteworthy, the online social networks avail huge volumes of data about the
actions of the users of social networks. As a result, it is possible to extract and
analyze such data for studies regarding the influence of actions of individuals on
the behaviors of fellow users sharing social networks.
Predict Individuals’ Behaviors 825

4 Methodology
The current paper does not generate new knowledge but compiles research on
the different approaches to predict individuals’ behaviors from their social media
accounts. A literature review methodology was used to fulfill this overarching
aim. IEEE Xplore was the main search tool, targeting peer-reviewed articles pub-
lished in the last five years. The string “predicting behaviors from social media
accounts” was initially used to select the first potential sources. Twenty-one
sources were identified and scrutinized to identify the main prediction approaches
available in literature. The main varieties identified included the Lexicon app-
roach, the Louvain algorithm, Naı̈ve Bayes classification, and Multi-Criteria
Decision Making (MCDM). IEEE was then used to identify specific publications
on each presented method. The string “-prediction method- for predicting behav-
iors from social media accounts’ was used for each approach”. For example, the
string “Lexicon approach for predicting behaviors from social media accounts’
was used to identify sources for the said method”. Cumulatively, 10 sources were
selected using this technique. The identified references were analyzed based on
several criteria. Firstly, they had to align with the literature review’s overarching
aim. Secondly, the sources had to be peer-reviewed and published by reputable
journals or presented in major conferences. Notably, journals and conferences
focusing on computing applications, data science, analytics, and computational
linguistics were given more weight. The evidence extracted from the references
was collated, summarized, aggregated, organized, and compared. Ultimately, the
four prediction methods were elaborated based on the derived insights.

4.1 Lexicon Approach


The Lexicon approach is useful for sentiment analysis, which arranges collected
data into neutral, negative, and positive labels that depict prevailing behaviors
and tendencies. According to [2], this method is based on sentiment words. Data
must be collected from a specific social media platform and a corpus built to
perform the underlying analysis.
Consequently, the data is cleaned through various processes, such as remov-
ing numbers, URLs, punctuation, and whitespace. An appropriate matrix is then
used to convert unstructured data into structured formats. After that the Lexi-
con approach is implemented to derive sentiment scores. Notably, unlike machine
learning, this method does not need data training and labeling. However, an elab-
orate emotion dictionary is required for each dataset. Ultimately, this process
accurately depicts users’ attitudes on a particular subject.
The Lexicon approach can also be combined with Multi-Criteria Decision
Making (MCDM) to detect emotions. The MCDM employs a decision-making
matrix to arrive at a logical decision-making scenario [13].
Similar to conventional sentiment analysis, the Lexicon method requires an
emotion lexicon. For example, [1] manually built an emotion lexicon in Ara-
bic before analyzing amassed tweets. They derived five emotion types based on
826 A. Almutairi and D. B. Rawat

Fig. 1. Co-Plot algorithm. [1]

negative and positive tweets: disgust, anger, fear, sadness, and happiness. Conse-
quently, the lexicon approach was combined with MCDM to categorize collected
tweets into the identified emotion clusters using the Co-Plot method. This pro-
cess adequately classifies fine-grained and mixed emotions without the need for
factor analysis [1]. Thus, Co-Plot (Fig. 1) analysis can depict the emotional posi-
tioning of social network posts using a two-dimensional analysis surface. This
possibility further proves the feasibility of the Lexicon approach to predict emo-
tions and their intensity.
We should apply a hybrid approach to Lexicon and Multi-Criteria Deci-
sion Taking. We use a predetermined plot to define and evaluate the text by
constructing a graphical analysis space of two dimensions, one aspect reflects
findings (tweets) and the other reflects our ratings.

4.2 Louvain Method


The Louvain method is a hierarchical clustering algorithm that detects commu-
nities in expansive networks. [4] state that this prediction strategy can be used
for topic modeling, the process of extracting a group of words from a set of doc-
uments that best represent the contained information. Topic modeling is essen-
tially a text mining approach that can help predict individual mannerisms. They
also propose a four-step topic modeling process (Fig. 3): pre-processing, Adaptive
Distribution of Vocabulary Frequencies (ADVF) analysis, co-occurrence extrac-
tion, and Louvain calculus and topic identification. This technique relies on
datasets comprising texts, eliminating the need for additional information from
social media.
Since the data is derived from online social networks, filtering and cleaning
in the pre-processing phase are imperative. For example, unnecessary spacing,
Predict Individuals’ Behaviors 827

Incoming Tweet

Segmenation

Extract Behavioral words

Scoring Algorithm
Behavioral
Generating Tweet by Emotion Matrix words
Lexicon

Matrix Normalization

Dissimilarity Measurement

Mapping, Multi Dimensinoal Scaling

Mapping Behavioral Variable Scores

Tweets and Behavioral State

Fig. 2. Lexicon

special characters, and links must be eliminated. Consequently, tokenization is


applied, after which ADVF analysis is implemented. The ADVF method reveals
the terms that can be regarded as noise. Specifically, a probabilistic frequency of
terms’ sensitivity to noise is created. Consequently, the N terms with a minor dif-
ference between probabilistic and real frequencies are chosen. The co-occurrence
of the selected terms is then computed to produce an adjacency list that gen-
erates a graph. After that, the Louvain method is used to split each graph into
communities by always prioritizing the modularity optimization [6].
Lastly, the derived communities are categorized as either artificial or natural
topics. The latter are associated with the base theme, while spamming activities
produce the former. Thus, the Lexicon approach can help separate individual
behaviors on social media networks from bot activities, which are often mistak-
enly taken as main topics.
Notably, the Louvain method is based on the concept of modularity, offering
a measure of network quality. The modularity Q is given a value between 0 and
1, whereby values closer to 1 allude to a strong community relationship. Q can
be derived using Eq. 1 below.
828 A. Almutairi and D. B. Rawat

Co-occurrence

DataBase

Louvain
Pre-processing

Base

ADVF

Fig. 3. Proposed approach for topic modeling [4]

 
1  ki kj
Q= Aij − δ(ci , cj ), (1)
2m i,j 2m

where ki = σj Aij is the sum of the weight of edges connecting node i, Aij is
node i and node j’s weight, and ci is the community holding I. δ assigns 1 in
case of similar communities, and m = σij Aij .
The Louvain method comprises two stages that recur iteratively. For a graph
containing N nodes, each node is taken as a community. In the first phase, I takes
a new community with a maximum and positive modularity. This phase ends
when the local maximum is attained a no additional modularity improvement is
possible. Phase two involves new graph generation, where communities are taken
as the new nodes. The stages are iterated until no modularity gain is attainable.
These processes produce highly accurate topic modeling results.
The authors employed the Louvain algorithm as a tool for the identification
of the communities to ensure high levels of accuracy and the ability to analyze
huge datasets [11]. It also enables the separation of communities based on mod-
ularity scores leading to improved efficiency compared to the tools used by [14].
After sorting users into communities, labeling of the communities based on likes,
retweets, and hashtags follows. The authors applied a support vector machine
(SVM) algorithm to label the identified communities due to its high level of
accuracy. In particular, the efficiency of SVM pegs on its ability to demarcate
the communities using a line [11].

4.3 Naı̈ve Bayes Classification


Naı̈ve Bayes Classification offers among the most reliable approaches to derive
diverse inferences from online social networks. The author in [5] illustrate that
the said method can be used to predict the personality of Facebook users from
their status posts. Specifically, individuals can be categorized based on the big
Predict Individuals’ Behaviors 829

five traits: neuroticism, agreeableness, extraversion, conscientiousness, and open-


ness (Fig. 4) [5].
The underlying process begins with collecting data and text mining. The
main steps in the latter approach are text preprocessing, tokenizing, and select-
ing features. Preprocessing involves converting unstructured data into structured
formats, while tokenizing separates words in a sentence and changes all letters
to lowercase. Contrarily, selecting features encompasses deleting unimportant
words. Consequently, the Naı̈ve Bayes classifier is used to determine the high-
est probability value for categorizing data in the most relevant category. This
algorithm is fundamentally employed to determine the greatest probability value
for classifying data in the most appropriate cluster. Equation 2 represents the
typical Naı̈ve Bayes process. This approach has a high accuracy level, validating
the consequent predictions.
This algorithm is fundamentally employed to determine the greatest prob-
ability value for classifying data in the most appropriate cluster. Equation 2
represents the typical Naı̈ve Bayes process.

(p(X|H|p(H)
P (H|X) = (2)
P (X)

where the probability of X is p(X), the probability of X based on hypothesis H is


P (H|X), the probability of hypothesis H is P (H), the probability of hypothesis
H under scenario X is P (H|X), the hypothesis for data X is H, and the data
containing ambiguous clusters is X. Using this equation produces a high accuracy
level, validating the consequent predictions.

Openness
Conscien-
tiousness

Neuroticism Personality

Extraversion
Agreeabl-
eness

Fig. 4. Big five personality model

Naı̈ve Bayes Classification can also be used to predict individuals’ emotions.


The author in [14] combine Naı̈ve Bayes with statistical tests to investigate the
830 A. Almutairi and D. B. Rawat

impact of emotional tweets on user relationships. This approach begins with


collecting data from select uses. Data on the subject under investigation must
also be amassed to train the Naı̈ve Bayes classifier. This information is manually
classified as neutral, positive, or negative. Consequently, the average number
of negative and positive posts is computed, and posts totaling to the derived
number are randomly selected for training. Consequently, the dataset is analyzed
using the trained classifier to determine the emotion scores. Users are then sorted
according to their emotion scores, and statistical tests validate the findings. This
process accurately predicts the variation of user relationships between negative
and positive groups [10].
Here [14] explored the impact of emotional posts on Twitter on the relations
among us online users. People use Twitter across the world to post emotional
messages due to it is an easy to use platform. Twitter limits the characters of a
message to a maximum of 280 characters, thereby making it easy for users to post
sensational messages using their Twitter handles as well as getting responses
from followers [14]. Additionally, Twitter allows a user to tag other Twitter
users on his or her post without the need to request permission. In this regard,
positive messages will enable Twitter to attract a mass following. On the other
hand, tweets that harm the feelings or emotions of other users are likely to scare
followers. The study employed a machine-learning tool known as naive.

P (y)P (x1 , . . . xn | y)
P (y | x1 , . . . , xn ) = (3)
P (x1 , . . . , xn )

Bayes to extract data about the emotional tweets posted by Twitter users.
The use of naive Bayes enabled the classification of tweets into negative and
positive tweets [14]. As a result, this allows the study of emotional messages
posted by Twitter users and the impact of such emotional posts on the relation-
ships of the individuals using the social media network. The authors employed
the Brunner-Munzel test to carry out the analysis of user relationships and the
influence on the followers on the Twitter platform. The use of naive Bayes classi-
fication showed that accuracy increases with the use of data from many Twitter
users [14]. This machine-learning tool showed a high level of accuracy in the
classification of messages into negative, neutral, and positive compared to other
tools like the decision tree and random forest classification tools. Consequently,
this proved that the use of the naive Bayes tool yields accurate results in the
analysis of emotional Twitter messages, especially with data from many users [9].
As a result, it helps in the identification of both pessimist and optimist
aspects of the posts [14]. The use of graph theory enabled researchers to ana-
lyze the data based on the density of follower ship on Twitter. The classification
of the follower networks into low co-link, and high co-link categories help in
understanding the patterns of interactions and the behaviors of the followers in
both categories. In particular, a few famous users dominated the conversations
and interactions in low co-link groups. The detection of communities is the key
aspect of the user relationships on the Twitter platform [14]. According to [14],
Predict Individuals’ Behaviors 831

positive emotional tweets resulted in an increase in the number of followers com-


pared to negative tweets. The application of naive Bayes classification allowed
the researchers to incorporate the emotional word dictionary in gauging the feel-
ings featured in the posts as well as evaluating the effect of such emotions on
user relationships. As a result, the emotions of the users have substantial effects
on the network of followers. The nature of groups also influences the patterns of
followers and interaction among users. Non-restricted tagging enabled Twitter
users to express their feelings and emotions on Twitter, thereby availing accurate
data regarding the impact of emotional tweets on user relationships [14].
The increase in the number of online users on social networks led to exponen-
tial growth in the volumes of data from online social networks. The exponential
expansion of online social networks led to alterations in patterns of social inter-
actions over the social network platforms [11]. Other than the entertainment,
social media became inseparable from users across the globe. Twitter is one of
the social media platforms that revolutionized social interactions among peo-
ple. In particular, Twitter users with similar interests and goals tend to form
some sort of categories or communities for the propagation of information. Addi-
tionally, potential users such as decision-makers and analysts utilize such huge
volumes of data and information in making informed decisions that satisfy their
target goals and objectives [16]. However, data analysts must first detect or
identify specific communities in social networks as a way of making appropri-
ate outcomes and reports from the data. In this regard, the authors applied big
data techniques to identify the communities on Twitter. The big data solution
applied a multi-criteria approach for searching Twitter [11]. Additionally, the
study applied a social graph as a way of boosting the accuracy of the results
from the analysis.

4.4 Multiple-criteria Decision-Making Approach

In [13] examined the application of MCDM in selecting the best solution to


a problem. In this study, the authors examined the contribution that MCDM
has on the Technique for Order Preference by Similarity to Ideal Solution (TOP-
SIS). In Fig. 5, a method is used for selecting the best solution alternative from a
wide range of possible alternatives. The MCDM simplifies the TOPSIS method
by enabling an extensive assessment of each of the alternatives versus all the
criteria to arrive at the best possible solution alternative because of the applica-
tion of a decision matrix that can compare a specific number of criteria against
a specific number of alternatives. A normal TOPSIS method focuses on the
distance between a negative ideal solution (NIS) and a positive ideal solution
(PIS) [13]. However, to arrive at each of these solutions, the TOPSIS method
requires MCDM to select the best possible PIS and NIS. In setting the appli-
cation of MCDM on TOPSIS, [13] examined case studies across various fields
832 A. Almutairi and D. B. Rawat

such as energy management, stock exchange, marketing management, supply


chain management, and engineering, among others. Despite the improvement
in decision making by applying MCDM in TOPSIS, the method still has sev-
eral limitations, which can affect the quality of the final decision. However, the
use of MCDM in TOPSIS showed that it is possible to solve these limitations
by using variant calculation methods that enhance the strengths of alternative
assessments done for better decision-making. In the case, therefore, predicting
individual behavior in social media is one complex problem because of the many
variables in place. However, the decision matrix of the MCDM, which simpli-
fied complex problems in technical fields such as engineering, might be the best
technique for this problem [15].

Establish a matrix

Normalize the decision


matrix

Calculate the Weighted Normalize


Matrix

Determine NIS and


PIS

Calculate the separation measures

Calculate the relative closeness to the ideal


solution

Rank the preference


order

Fig. 5. TOPSIS method algorithm


Predict Individuals’ Behaviors 833

Table 1. Summary of the main sentiment analysis methods

Approach Gaols Specialties+


Limitations
Lexicon -The lexicon approach employs + The lexicon approach is fairly easy to implement
Approach a pre-determined sentimentlexicon to + It does not need extensive data training
combine thesentiment scores of + It can be combined with other sentiment analysis
allwords in a document techniques, such as Multi-Criteria Decision Making
-In sentiment analysis, this approach (MCDM).+Sentiment lexicons can be human, corpus,
can organize data into positive, or dictionary-based
neutral, and negative clusters -The method can be highly inaccurate
depicting prevailing behaviors -It also fails to detect sarcasm in written text
Louvain - This method is essentially a + The Louvain method is adequately simple
Method hierarchical clustering algorithm for +It is also easy-to-implement for detecting
detecting communities in expansive communities in expansive networks
networks +The approach facilitates zooming into
- In sentiment analysis, the primary communitiesto identify sub-communities and
goal is to extract a cluster of words sub-sub-communities
from documents that best represent +The method is also highly accurate
the contained information -The Louvain method identifies arbitrarily and badly
connected communities
-The conventional Louvain method cannot also split
communities after merging
-The approach can experience the resolution limit
problem
Naı̈ve -This method is founded on the +Naı̈ve Bayes classification is fast and easy to predict
Bayes Bayes Theorem and assumes a test dataset class
Classifica- predictor independence +The method is also effective for multi-class
tion -The approach aims to determine the prediction
probability of a test point fitting in a +It performs better than other alternatives when the
class instead of thetest point’s label independence assumption holds
+The Naı̈ve Bayes algorithm requires minimal
training data
+The method is also well suited for categorical input
variables
-Naı̈ve Bayes classification experiences the
zero-frequency problem
-The assumption of independent predictors is mostly
absent in real life
MCDM The MCDM method aims to produce + MCDM is a relatively simple computational process
Method a logical decision-makin scenario +The method is based on a comprehensive logic
using a decision-making matrix +The calculation stages are relatively short
+The approach can be applied to both qualitative
and quantitative data
+MCDM comprises basic statistical operations
+The method is highly accurate for single dimension
problems
-The approach does not accommodate the integration
of multiple preferences
-There is a high susceptibility to inaccuracies in the
case of interdependent objectives and variables
-The method is based on the hypothesis of
compensatory evaluation criteria

5 Discussion

Based on [1] findings, the Lexicon approach based on the Co-Plot method can
be decomposed into different sections, as illustrated in Fig. 1. The system starts
by accepting a tweet comprising several words. A tokenization process based
on the AMIRA toolkit is then deployed. The resultant outcome is a sequence of
segmented words from which emotion words can be extracted. Consequently, the
834 A. Almutairi and D. B. Rawat

emotion scoring algorithm captured in Fig. 2 is launched. Notably, the percentage


method is used to handle negation scoring and intensification in the algorithm.
For example, if a tweet contains a word from the intensifier list, such as “I am
very happy”, the happiness emotion score is 4.0, and the intensifier word score
is 50%. Thus, the happiness score is 4.0 * (100% + 50%) = 6.0 [1].
A tweet-by-emotion matrix is produced for every n number of tweets, and
the Co-Plot algorithm is triggered. This algorithm comprises four stages. Xn×k
is normalized into Zn×k to ensure equal treatment of variables, while a Dil ≥ 0 is
deployed between every observation pair [1]. Consequently, the MDS approach
is used to map the Dil matrix, and the k arrows are produced on the Euclidean
expanse [1]. These processes result in a graphical analytical 2-D Euclidean space
denoting the affiliation between tweets and emotion states.
The presentation of the results featured in a particular web application gener-
ated with the help of the Louvain algorithm. The machine learning tools employed
in this research led to an improvement in the real-time detection of communi-
ties. The capturing of real-time data enabled researchers to update the social
graph [12]. The multi-criteria approach used to search for information proved use-
ful in the detection of communities found in social media sites like Twitter. Con-
sequently, this allows accurate analysis of data and informed decisions based on
real-time data on issues of interest to the data analysts and users [11].
Lastly, [8] researched the ability of an analysis of network metrics-based social
networks to detect influential social nodes. The behavior, as well as the role
played by an individual, is a significant indicator of the importance of social
media. The network analysis conducted by [8] focused on various metric measures
such as Coefficient Clustering, centrality measures, density, as well as Page Rank
because of their importance in visualizing social networks (influential nodes).
Because of the huge traffic of entry and exit into social networks daily, the
detection of influential nodes is a complex problem. However, the application of
a social network analysis that applies network metrics helps to yield sufficient
identification outcomes.

6 Conclusion

Previously, demographic, geographic, and behavioral targeting were the main


methods of predicting people’s behaviors. However, the advent and widespread
uptake of online social networks have presented an opportunity to advance
behavior prediction for business, sociology, psychology, and economic objectives.
Users’ posts on various platforms, such as Twitter and Facebook, provide suffi-
cient data to draw valid inferences through sentimental and emotional analyses.
The most reliable prediction systems presented in literature include the Lexi-
con approach, the Louvain algorithm, Naı̈ve Bayes classification, and MCDM.
The Lexicon method arranges collected data into neutral, negative, and positive
labels that depict prevailing behaviors and tendencies. Contrarily, Naı̈ve Bayes
Classification determines the highest probability value for categorizing data in
the most relevant category or generates emotion scores. The Louvain approach
Predict Individuals’ Behaviors 835

is a hierarchical clustering algorithm that detects communities in expansive net-


works, while the MCDM can predict the behavior from a user’s posts. Although
these technologies vary structurally and functionally, they involve similar initial
processes, such as data collection, preprocessing, tokenizing, and selecting fea-
tures. However, the accuracy of predicting individual behavior also introduces
privacy concerns for users. Therefore, future studies should quantify the perti-
nent security threats and offer appropriate mitigations.

References
1. Abd Al-Aziz, A.M., Gheith, M., Eldin, A.S.: Lexicon based and multi-criteria deci-
sion making (MCDM) approach for detecting emotions from Arabic microblog
text. In: 2015 First International Conference on Arabic Computational Linguistics
(ACLing), pp. 100–105 (2015). https://doi.org/10.1109/ACLing.2015.21
2. Tiwari, D., Singh, N.: Sentiment analysis of digital India using lexicon approach.
In: 2019 6th International Conference on Computing for Sustainable Global Devel-
opment (INDIACom), pp. 1189–1193 (2019)
3. Chatzakou, D., et al.: Detecting Cyberbullying and Cyberaggression in Social
Media. ACM Trans. Web 13, 3, Article 17, August 2019, 51 pages. https://doi.
org/10.1145/3343484
4. Kido, G.S., Igawa, R.A., Junior, S.B.: Topic Modeling based on Louvain method
in Online Social Networks. In: Proceedings of the XII Brazilian Symposium on
Information Systems on Brazilian Symposium on Information Systems: Informa-
tion Systems in the Cloud Computing Era - Volume 1 (SBSI 2016). Brazilian
Computer Society, Porto Alegre, BRA, pp. 353–360 (2016)
5. Sarwani, M., Salafudin, M., Sani, D.: Knowing personality traits on Facebook
status using the Naı̈ve Bayes classifier. Int. J. Artif. Intell. Robot. (IJAIR) 2, 22
(2020). https://doi.org/10.25139/ijair.v2i1.2636
6. Samuel, H., Noori, B., Farazi, S., Zaiane, O.: Context prediction in the
social web using applied machine learning: a study of Canadian Tweeters. In:
IEEE/WIC/ACM International Conference on Web Intelligence (WI), vol. 2018,
pp. 230–237 (2018). https://doi.org/10.1109/WI.2018.00-85
7. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social
networks. In: Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD 2008). Association for Computing
Machinery, New York, NY, USA, 7–15 (2008). https://doi.org/10.1145/1401890.
1401897
8. Farooq, A., Joyia, G.J., Uzair, M., Akram, U.: Detection of influential nodes using
social networks analysis based on network metrics. In: 2018 International Confer-
ence on Computing, Mathematics and Engineering Technologies (iCoMET), pp.
1–6 (2018). https://doi.org/10.1109/ICOMET.2018.8346372
9. Meeragandhi, G., Muruganantham, A.: Potential influencers identification using
multi-criteria decision making (MCDM) methods. Procedia Comput. Sci. 57, 1179–
1188 (2015). https://doi.org/10.1016/j.procs.2015.07.411
10. King, I., Li, J., Chan, K.T.: A brief survey of computational approaches in social
computing. In: International Joint Conference on Neural Networks, pp.1625–1632
(2009). https://doi.org/10.1109/IJCNN.2009.5178967
836 A. Almutairi and D. B. Rawat

11. Nizar, L., Yahya, B., Mohammed, E.: Community detection system in online social
network. In: Fifth International Symposium on Innovation in Information and
Communication Technology (ISIICT), pp. 1–6 (2018). https://doi.org/10.1109/
ISIICT.2018.8613285
12. Ozer, M., Kim, N., Davulcu, H.: Community detection in political twitter networks
using nonnegative matrix factorization methods (2016)
13. Panda, M., Jagadev, A.K.: TOPSIS in multi-criteria decision making: a survey.
In: 2018 2nd International Conference on Data Science and Business Analytics
(ICDSBA), pp. 51–54 (2018). https://doi.org/10.1109/ICDSBA.2018.00017
14. Tago, K., Jin, Q.: Analyzing influence of emotional tweets on user relationships
by Naive Bayes classification and statistical tests. In: 2017 IEEE 10th Conference
on Service-Oriented Computing and Applications (SOCA), pp. 217–222 (2017).
https://doi.org/10.1109/SOCA.2017.37
15. Çakır, E., Ulukan, Z.: An intuitionistic fuzzy MCDM approach adapted to mini-
mum spanning tree algorithm for spreading content on social media. In: 2021 IEEE
11th Annual Computing and Communication Workshop and Conference (CCWC),
pp. 0174–0179 (2021). https://doi.org/10.1109/CCWC51732.2021.9375942
16. Umamaheswari, S., Harikumar, K.: Analyzing product usage based on twitter
users based on datamining process. In: 2020 International Conference on Compu-
tation, Automation and Knowledge Management (ICCAKM), pp. 426–430 (2020).
https://doi.org/10.1109/ICCAKM46823.2020.9051488
Environmental Information System Using
Embedded Systems Aimed at Improving
the Productivity of Agricultural Crops
in the Department of Meta

Obeth Hernan Romero Ocampo1,2(B)


1 Fundación Universitaria Compensar, Cra 32 # 34 – 76, Villavicencio, Colombia
[email protected], [email protected]
2 American University of Europe, Cancún Av. Bonampak Sm. 6-Mz, Cancún, Mexico

Abstract. Agriculture is an activity that is developed from generation to gener-


ation and has remained so over time. Taking into account the different environ-
mental factors that influence the process of agriculture and the variety of climate
due to climate change, has generated alterations in planting procedures; for that
reason, precision agriculture is an important tool for the development in these
activities. In the same way, guarantees agricultural production and becomes infor-
matic systems that involve automation and information processes in real time.
In this manner the implementation of sensor networks allows to determine the
environmental conditions of temperature, humidity, air to control and monitor an
agricultural system and identify changes during a harvest period. Based on these
factors, is created a database to store the information from the sensors. In addi-
tion, the different queries and decision making are generated, considering internet
access through a web application. On the other hand, the terms of ubiquity, dis-
tance or communication play an important role for the control and monitoring of
agricultural crops in real time.

Keywords: Arduino · Sensors · Agriculture · Network · Programming

1 Introduction

The problem of world overpopulation and the massive exploitation of the resources
present on the planet has caused innumerable social difficulties in different regions of
the world that affect the quality of life of people and puts their food security at risk.
It is also worth mentioning that it aligns perfectly with two sustainable development
objectives of the United Nations Organization, goal two (Zero Hunger) and goal eleven
(Sustainable Communities) [1].
The concept of food security began to be used in the national spheres of government
since the 1990s. However, it has not been possible to consolidate a government policy
that guarantees that people do not suffer from this scourge [2]. From the international
scene, the Food and Agriculture Organization of the United Nations, have proposed

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 837–849, 2023.
https://doi.org/10.1007/978-3-031-18461-1_55
838 O. H. Romero Ocampo

sustainability policies and urge the governments of the world to implement such food
and environmental sustainability policies [3]. The main obstacle of the traditional way
of doing agriculture in Colombia is that it is a manual process, artisanal and traditional,
in which the inclusion of technology is very limited and therefore productivity levels
are very low, compared to the technician agriculture of the great world powers, in which
research and development have a direct impact on the entire production chain, including
food production [2]. In addition to this low crop productivity, there is the freedom of
the market generated by the implementation of free trade agreements, where a large
amount of foreign food or products can be imported into the national territory, and these
products have been obtained from processes in which technology has notably increased
their productivity and efficiency, Therefore, the local market has problems keeping the
price of products at the level of said disproportionate competition.
Information systems, automation, the advances in the internet and web technologies,
provide platforms where information becomes easier to access. Therefore, information
systems are appropriate tools for the design of an agricultural control and monitoring
system; where the interaction between electronic devices and programming languages
allows the connectivity of all the physical elements in their environment, allowing users
to access and control devices from anywhere and at any time. In addition, the internet is
a dynamic information network that can be adapted to any place or situation, that allows
communication between sensors and intelligent devices enabling decision making.
Not only is an embedded system needed to automate a process, although it is impor-
tant to optimize resources and tasks, the data and information that the system can throw
must be taken into account, therefore, these systems usually lose data due to lack of
storage and treatment, For the development of this project, the environmental conditions
of the passion fruit crop were taken into account within its production process. In this
way, it was seeked to control the variables of temperature, humidity and air of the fruit,
therefore, storing the data and representing it through the web is another function of this
project that allows generating traceability of the system for different future analyses; in
the same way, to be able to have enough information to generate future predictions and
improve the conditions of agricultural production.

2 Materials and Methods


2.1 Materials
1. Room temperature and humidity sensor DHT22
2. MQ 135 (oxygen)
3. Arduino Mega 2560 board
4. Arduino Shield Ethernet Boards

2.2 Methods
The research project has a mixed method where there are quantitative and qualitative
data that allow the analysis of behavior of the crop. The Environmental Information
System using embedded systems aimed at improving the productivity of agricultural
crops in the department of Meta, It takes place in three fundamental stages.
Environmental Information System Using Embedded Systems 839

In the first stage, the environmental conditions of temperature, humidity and air of
an agricultural crop are determined.
The meteorological data of the vanguard station belonging to IDEAM, which is the
Institute of Hydrology and Environmental Studies in Colombia, are consulted, likewise
the NASA weather database.
In the second stage, a control system is implemented to maintain the environmental
conditions of the crop.
The different sensors and devices for the system are identified by means of a com-
parison matrix discussed below. With the AutoCAD software, the electronic circuit is
designed to make the corresponding implementation.
In the final stage, a web application is developed to provide information on the
conditions of the crop in real time.
From different programming languages, hardware connections are made, the pro-
gramming of electronic devices is done with the Arduino sketch, for the following, Free
software such as XAMPP, PHP, MySQL and HTML tags are used to store sensor data,
HTML tags are added from the Arduino sketches code to send the data to the database,
from the following, with HTML and JavaScript tools the web application is developed.

3 Results

Temperature and humidity data are obtained from the Vanguardia weather station (code
35035020 from the Institute of Hydrology and Environmental Studies - IDEAM),
acquired in flat files which were treated for analysis in the last 11 years, and classi-
fied in Table 1, 2, 3 and 4 by their minimum and maximum monthly values per year,
of temperature and humidity, unit of measurement in degree Celsius and percentage of
humidity.

Table 1. Maximum monthly temperature value

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2010 37,6 38,2 35,3 33,7 32,4 32,4 32,0 32,6 33,4 33,6 32,6 32,7
maxima
Temperatura 2011 34,0 34,4 34,4 32,7 32,2 32,4 32,2 33,2 32,6 33,0 32,4 33,0
maxima
Temperatura 2012 34,0 34,4 34,4 32,7 32,2 32,4 32,2 33,2 32,6 33,0 32,4 33,0
maxima
Temperatura 2013 34,9 35,8 34,4 34,6 32,2 32,0 31,4 31,4 33,2 32,8 31,8 33,2
maxima
Temperatura 2014 34,9 35,8 34,8 34,2 33,4 32,8 32,6 33,4 33,8 32,8 32,8 33,4
maxima
Temperatura 2015 33,6 34,8 36,4 34,3 33,4 32,0 31,8 33,4 34,6 33,6 32,2 34,0
maxima
(continued)
840 O. H. Romero Ocampo

Table 1. (continued)

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2016 36,2 36,4 37,2 33,4 32,0 31,8 31,6 32,6 33,0 34,0 32,3 –
maxima
Temperatura 2017 36,2 36,4 37,2 33,4 32,0 31,8 31,6 32,6 33,0 34,0 32,3 –
maxima
Temperatura 2018 33,8 35,4 35,8 32,1 32,0 30,8 30,4 32,6 33,2 33,5 32,2 33,6
maxima
Temperatura 2019 34,8 35,8 35,6 32,8 33,0 31,9 31,2 32,6 32,8 34,4 32,2 33,2
maxima
Temperatura 2020 34,2 35,8 36,0 33,1 32,4 32,6 31,4 33,6 33,2 33,0 32,7 32,8
maxima

Table 2. Minimum monthly temperature value

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Temperatura 2010 20,6 21,2 20,8 20,7 20,7 18,8 19,8 19,2 19,5 20,2 19,8 18,8
minima
Temperatura 2011 19,4 21,0 19,5 19,7 19,2 18,2 19,0 18,7 18,4 18,6 19,1 19,0
minima
Temperatura 2012 19,4 21,0 19,5 19,7 19,2 18,2 19,0 18,7 18,4 18,6 19,1 19,0
minima
Temperatura 2013 19,9 20 18,5 20,2 18,1 18,5 17,3 18,3 19,1 18,8 18,3 18,5
minima
Temperatura 2014 18,6 19,6 17,7 18,1 19,6 17,0 18,7 17,7 17,8 18,0 18,8 18,8
minima
Temperatura 2015 18,0 20,6 20,4 20,1 20,2 19,7 19,2 19,4 19,6 19,4 20,3 19,0
minima
Temperatura 2016 20,4 21,4 21,8 20,0 20,2 19,8 19,2 19,7 20,0 19,3 19,7 –
minima
Temperatura 2017 20,4 21,4 21,8 20,0 20,2 19,8 19,2 19,7 20,0 19,3 19,7 –
minima
Temperatura 2018 18,6 19,6 19,0 18,8 18,4 18,8 18,6 18,8 18,8 19,6 19,7 18,6
minima
Temperatura 2019 19,0 21,6 20,6 19,8 19,8 18,2 18,6 19,0 19,0 17,9 19,8 19,2
minima
Temperatura 2020 18,4 21,4 19,2 20,2 20,2 19,0 19,3 19,4 18,8 20,1 20,4 19,3
minima
Environmental Information System Using Embedded Systems 841

From the data consulted on the maximum and minimum temperature of the last
11 years, an average is projected, so that the behavior of the temperature of the area in
said time interval is known, ranging between 17 °C and 38.2 °C, information used as a
starting point for agricultural crop conditions (Fig. 1).

Fig. 1. Average temperature of the last 11 years per month in the city of Villavicencio, (own
elaboration)

The analyzed data allows comparing the environmental conditions of the passion fruit
crop, considering that the best conditions for cultivation are given with a temperature of
17 to 30 °C, for the following, the environmental conditions of the department of Meta
converge with the data for the cultivation of passion fruit [4].

Table 3. Maximum value of monthly relative humidity

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2010 88 98 99 99 99 98 100 99 98 98 97 100
maxima
Humedad relativa 2011 99 97 96 97 96 97 97 100 98 97 97 95
maxima
Humedad relativa 2012 99 97 96 97 96 97 97 100 98 97 97 95
maxima
Humedad relativa 2013 88 93 97 96 99 99 98 99 97 98 97 97
maxima
(continued)
842 O. H. Romero Ocampo

Table 3. (continued)

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2014 98 97 96 97 97 97 97 99 96 100 99 97
maxima
Humedad relativa 2015 94 93 96 97 96 96 96 96 97 95 95 96
maxima
Humedad relativa 2016 89 95 97 97 98 97 97 97 96 97 96 –
maxima
Humedad relativa 2017 89 95 97 97 98 97 97 97 96 97 96 –
maxima
Humedad relativa 2018 97 91 95 99 99 99 98 98 98 98 99 –
maxima
Humedad relativa 2019 97 97 100 98 98 97 99 98 97 97 98 97
maxima
Humedad relativa 2020 95 92 96 97 98 97 97 97 97 97 98 97
maxima

Table 4. Minimum value of monthly relative humidity

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2010 24 21 32 51 46 50 49 41 41 46 48 44
minima
Humedad relativa 2011 41 35 42 43 45 47 43 39 43 43 43 34
minima
Humedad relativa 2012 41 35 42 43 45 47 43 39 43 43 43 34
minima
Humedad relativa 2013 33 28 32 41 48 48 48 47 43 41 44 40
minima
Humedad relativa 2014 36 34 34 40 34 40 43 35 40 41 43 39
minima
Humedad relativa 2015 32 34 34 38 38 38 40 40 32 37 41 34
minima
Humedad relativa 2016 34 29 32 41 44 44 42 43 38 39 39 –
minima
Humedad relativa 2017 34 29 32 41 44 44 42 43 38 39 39 –
minima
Humedad relativa 2018 34 32 33 42 44 43 44 38 40 38 39 –
minima
(continued)
Environmental Information System Using Embedded Systems 843

Table 4. (continued)

Parameter Año Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Humedad relativa 2019 36 32 36 42 41 36 37 33 37 40 45 39
minima
Humedad relativa 2020 37 29 29 37 34 36 38 42 36 38 47 36
minima

From the data consulted on the maximum and minimum temperature of the last
11 years, an average is projected, so that the behavior of the relative humidity of the area
in said time interval is known, ranging between 21% and 100%, information used as a
starting point for agricultural crop conditions (Fig. 2).

Fig. 2. Average relative humidity of the last 11 years per month in the city of Villavicencio, (own
elaboration)

4 Agricultural Circuit Design


It was sought to provide a low-cost system with an electronic circuit for the monitoring
and control of crops. Figure 3 presents each of the components that interact in the system,
therefore, the formed elements are DHT22 (temperature and humidity) and MQ 135 (air)
that allow managing this process.
844 O. H. Romero Ocampo

Fig. 3. Circuit monitoring and control systems

The system is designed to monitor a crop, based on the following environmental


variables: temperature, humidity, and air, as shown in Fig. 4, the protoboard shows the
connection between the sensors and the Arduino mega card, allowing the system to
generate the different environmental readings.

Fig. 4. Monitoring and control system


Environmental Information System Using Embedded Systems 845

5 System Architecture
The system is based on a group of devices connected through a Mega card (micro-
processor) and Ethernet mega shield for communication between the different hard-
ware. A high-level system architecture is presented, composed of hardware - software.
The proposed system was conceived using hardware capable of supporting different
communication protocols and flexible enough to adapt adaptable software (Fig. 5).

Fig. 5. System architecture, (own elaboration)

5.1 Hardware

As can be seen in Figs. 3 and 4, the hardware devices adopted for the implementation of
the system are made up of: An Arduino MEGA 2560 card, a laptop with processing power
and memory, which supports programming compatible with various programs, input-
output ports and use of standard peripherals with gateway; Arduino Shield Ethernet
for communication between devices (sensors), which collect information on ambient
temperature, relative humidity and air; To control the sending of the data to the server, it
will be done through the Arduino mega card and in this way it is stored in the database.

5.2 Software

There are several important pieces of software for the proposed architecture, divided
into two sections mentioned below:

Web Platform
It consists of visualizing the information obtained from the sensors for its monitoring
in the cultivation process. The platform allows you to store the information from the
sensors in a MySQL database, designed with HTML, PHP and JavaScript language for
agricultural control and monitoring (Fig. 6).
846 O. H. Romero Ocampo

Fig. 6. Web design, (own elaboration)

C Language – Arduino Sketches


This programming language allows the use of HTML tags to send the information from
the sensors to the database, such as the use of the GET and SET method; allowing
the communication and configuration of the network, using IP connection protocols,
Gateway, subnet mask, as well as the generation of the code to control the cultivation
system (Fig. 7).

Fig. 7. Arduino sketch programming


Environmental Information System Using Embedded Systems 847

Figure 7 shows the programming code where the sensors are parameterized and the
connection to the database is configured to store the information. The code is designed
so that the system responds to the conditions of temperature, humidity and air and allows
control of the agricultural cultivation process (Fig. 8).

Fig. 8. Database, (own elaboration)

As illustrated in Fig. 3 and 4, the hardware devices adapted to the agricultural farming
system consist of the design of a web application to monitor the environmental conditions
(temperature, humidity and air) of the agricultural system, made up of electronic devices,
sensors, Arduino MEGA 2560 card, laptop with processing capacity and memory, also
allowing the use of programmable software and also the use of peripheral ports for the
traffic of the data reported by the sensors to the database.
Figure 9 shows the web page where the temperature data obtained from the DTH22
sensor is presented and subsequently displayed.
848 O. H. Romero Ocampo

Fig. 9. Temperature data website

Fig. 10. Relative humidity data website

Figure 10 shows the web page where the relative humidity data obtained from the
sensor is presented and Vs the number of times the data was taken is displayed.
Figure 11 shows the web page where the air data obtained from the MQ 135 sensor
is presented and subsequently displayed.
As evidenced in Figs. 9, 10 and 11, on the Y axis, the temperature in degrees Celsius,
the percentage relative humidity and the air in PPM are represented, on the X axis, for
the two graphs it corresponds to the number of intervals of sensor data storage.
The system allows to visualize in real time the meteorological data of temperature,
humidity, and air, from, of a network of sensors with Ethernet technology facilitating
communication and transmission of information; in this way, the environmental con-
ditions and range of the passion fruit crop are compared, allowing the monitoring and
control of the crop during its agricultural process.
Environmental Information System Using Embedded Systems 849

Fig. 11. Air data website

6 Conclusion

Web development can be adjusted to system conditions and to an app application for
practicality in the field. In the same way, the system presents a scalability in time since
it can be adjusted to the number of sensors or electronic devices added and desired for
its continuous improvement.
The system could be self-sufficient if the devices were powered by renewable energy.
In this case and due to its location, photovoltaic energy is recommended as an alternative
to electricity. By the following, the system can change its power consumption and adapt
to new hardware conditions, thus being more economical.

References
1. PNUD: Objetivos de desarrollo sostenible (2019). https://www.undp.org/content/undp/es/
home/sustainable-development-goals.html
2. Mejía, M.A.: Seguridad alimentaria en Colombia después de la apertura económica (2016)
3. FAO: Sistemasalimentarios (2020). Villavicencio, “Presentación”. http://www.fao.org/food-
systems/es/.de
4. CDT CEPASS: El maracuyá en Colombia. Corporación Centro de Desarrollo Tecnológico de
las Pasfloras de Colombia (2015)
An Approach of Node Model TCnNet: Trellis
Coded Nanonetworks on Graphene Composite
Substrate

Diogo F. Lima Filho1(B) and José R. Amazonas2


1 Paulista University - UNIP, São Paulo, Brazil
[email protected]
2 São Paulo University - USP, São Paulo, Brazil

Abstract. The feasibility of obtaining an integrated model of a nanonetwork node


on a Graphene Composite Substrate (GCS) exploring the same mechanical, elec-
trical and self-sustainable characteristics, contributes to the proposal of this work
consisting of an integrated node model applying the same concepts of TCNet
to nanodevice networks, where the nodes are cooperatively interconnected with
a low-complexity Mealy Machine (MM) topology, integrating in the same elec-
tronic system the modules necessary for independent operation in wireless sensor
networks (WSNs), compound of Rectennas (RF to DC power converters), Code
Generators based on Finite State Machine (FSM) & Trellis Decoder and On-chip
Transmit/Receive with autonomy in terms of energy sources applying the Energy
Harvesting technique. One of the most critical and ubiquitous problems for nodes
in a network is battery life. The battery supply for thousands of wireless sensors
used in IoT networks and the logistics of replacement and disposal with conse-
quences for the environment are the main objectives of this research project, with
the use of harvesting of energy. In addition, graphene consists of a layer of carbon
atoms with the configuration of a honeycomb crystal lattice, which has attracted the
attention of the scientific community due to its unique Electrical Characteristics.

Keywords: Wireless sensor networks · Finite state machine · Nanonetwork ·


Energy harvesting · Graphene · Composite substrate · Rectennas

1 Introduction
Considering that Wireless Sensor Networks (WSNs) are an important infrastructure for
the Internet of Things (IoT) and the interest in using sensor networks in the same universe
as IP networks, this work innovates by solving the limited hardware resources of the
network nodes using the new concept of “Trellis Coded Network”- (TCNet) introduced
in previous works: (i) “A new algorithm and routing protocol based on convolutional
codes using TCNet: Trellis Coded Network” [1], where the network nodes are associated
to the states of low complexity Finite State Machine (FSM) and the routing discovery
corresponds to the best path in a trellis based on trellis theory; (ii) “Robustness situations
in cases of node failure and packet collision enabled by TCNet: Trellis Coded Network

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 850–859, 2023.
https://doi.org/10.1007/978-3-031-18461-1_56
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 851

- a new algorithm and routing protocol” [2], where is shown the robustness of the
TCNet algorithm in making decisions in cases of nodes failure and packages collisions,
taking advantage of the regeneration capacity of the trellis. This proposal innovates in
making decisions on the node itself, without the need for signaling messages such as
“Route Request”, “Route Reply” or the “Request to Send (RTS)” and “Clear to Send
(CTS)” to solve the hidden node problem that is known to degrade the throughput of
ad hoc networks due to collisions and the exposed node problem that results in poor
performance by wasting transmission opportunities.
An extension of this proposal is to apply the same concepts of TCNet to networks of
nanodevices where the nodes are cooperatively interconnected with a low-complexity
Mealy Machine (MM) topology composed of XOR gates and shift registers. This new
configuration can be called TCnNet: Trellis Coded nanonetwork, a “firmware protocol”,
where the nanonetwork node can integrate on the same substrate: rectifiers, Finite State
Machine and on-chip Transmission/Reception.
This approach considers the use of a Graphene Composite Substrate (GCS) for
the integrated electronic circuits, enabling TCnNet: Trellis Coded nanonetwork to inte-
grate the necessary electrical and mechanical characteristics of sensor nodes to meet ad
hoc network scenarios with the following characteristics: limitations of energy sources,
changes topologies, poor link quality, and bandwidth limitation.
In addition, Graphene is an advantageous composition along with semiconductor
oxides in passive and active applications of electronic circuits with excellent cost benefit
[3, 4].

1.1 Related Work

The challenges to integrate electronic systems in the most diverse IoT objects present in
wireless sensor networks (WSNs) and extending its applications to nanonetworks, must
meet the following characteristics: autonomy in terms of energy sources, mechanical
flexibility, miniaturization, and optical transparency, in addition to being ecological.
The use of Graphene Composite Substrate (GCS) has attracted attention because it
is a two-dimensional structure and responds very efficiently when used as a channel in
Field Effect Transistors (FET) used in electronic models of sensors [5]. (Furthermore,
a carbon-based system could benefit from the integration antennas, rectifiers, sensors
and transmit/receive) on the same substrate with the function of sensor node and energy
storage [6].
The research [7] compares techniques to obtain maximum conductivity with
graphene composite films as shown in Fig. 1. The process in Fig. 1(a) subjects a “binder-
free graphene ink” jet base containing nanoflakes, dispersants and solvent. After drying,
the result is an excellent film with 2D structure and porous characteristics as shown in
Fig. 1(b). The next step of the process is to apply compression, obtaining a highly dense
nanoflake laminate, Fig. 1(c).
852 D. F. Lima Filho and J. R. Amazonas

Fig. 1. Schematic illustration of the formation of binder-free graphene laminate. No binder was
used in graphene ink due to the strong strength of Van der Waals graphene nanoflakes. The adhesion
and conductivity of the graphene laminate were improved by rolling compression [7]

Other techniques for obtaining graphene composite films, such as growth of graphene
films by Chemical Vapor Deposition (CVD) [8], consisting of an alternative to produce
monolayer graphene films with good efficiency to be used on a large scale and even the
most empirical mechanical processes like Micromechanical Cleavage [9].

1.2 Contributions and Proposal of the Paper


The objective of this research is to obtain integration models in the same Graphene
Composite Substrate (GCS) with modules that allow the independent operation of the
node in an ad hoc network with a configuration consisting of Rectennas (RF to DC
power converters), Code Generators based on Finite State Machine (FSM) & Trellis
Decoder and On-chip Transmit/Receive, making an RF Energy Harvesting environment
self-sustaining to meet the limited characteristics of nanonetwork nodes, as shown in
Fig. 2.

Fig. 2. Block diagram of the approach of WSN TCnNet node model on graphene composite
substrate – GCS

One of the most critical and ubiquitous problems for nodes in a network is battery life.
Although batteries have followed the evolution of smart devices improving efficiency, it
comes up against the supply of thousands of wireless sensors used in IoT networks, due
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 853

to the logistics of replacement and disposal with consequences for the environment. In
contrast to conventional copper surfaces the conductivity of graphene adds better yields
and performances due to increased carrier concentration and minimized film strength by
optimizing energy scavenged using the Electromagnetic Waves rectification concept in
DC power.
On the other hand, the increased demand for self-sustaining systems converges to one
of the main objectives of this research project, which is the use of harvesting of energy
that is collectors of residual energy sources present in the environment. In addition to the
already known energy sources: solar energy, heat, gradient, thermoelectric, electromag-
netic, wind and others, the radiofrequency (RF) signals present in urban environments
represent an interesting energy source with recycling potential, considering that they are
already integrated into smart devices [10, 11].
Research on Energy Harvesting technologies is promising for the near future due
to the speed with which electronic devices with low energy consumption and use in
Wireless Sensor Networks (WSNs) are emerging.

2 TCnNet Node Model Implementation Scenario on GCS


2.1 Potential Sources of Energy - RF

Looking at the Electromagnetic Spectrum, it shows the feasibility of applying the Energy
Harvesting technique, considering the region of the Spectrum where the best energy
efficiency can be obtained to be used by WSN nodes.
In this analysis, it is possible to identify the viable energy intensity distributed con-
sidering the frequency band range, from the very low frequency (VLF) to the super high
frequency (SHF) that correspond the frequencies from 10 kHz to 30 GHz [12].
Previous works show power efficiencies for Energy Harvesting corresponding to
7.0 µW at 900 MHz and 1.0 µW at 2.4 GHz with coverage areas with 40 m of radius

Fig. 3. Signal patterns obtained in the research using spectrum analyzer [14]
854 D. F. Lima Filho and J. R. Amazonas

and experiences in the 1.584 MHz corresponding to the Amplitude Modulated (AM)
band with average currents of 8 µA [13].
Outdoor searches in urban regions of cities like Tokyo, using Spectrum Analyzer
as Receiver and Dipole Antenna, obtained patterns of spectrum readings as shown in
Fig. 3, where signal levels of −15 dBm can be observed close to the 800 MHz band and
levels close to 0 dBm, in the 920 MHz band, corresponding to mobile telephony [14].
The applications of Energy Harvesting in WSN scenarios are justified due to the
large number of low-cost sensors used and working collaboratively collecting data and
transmitting it to a Sink node (Base Station) with IoT applications making it difficult to
replace energy sources. The combination of Energy Harvesting with the small charges
required for the batteries guarantees the WSN’s extended autonomy.
The scenario shown in Fig. 4 illustrates the power consumption of a WSN node
where current peaks occur during transmission, reception and shows the battery recharge
periods when the node is not being requested [15].

Fig. 4. Typical scenario of power consumption by the WSN node [15]

2.2 Code Generators - FSM and Trellis Decoders TCNet Node Configuration

The TCNet node configuration is associated with the states of a Finite State Machine
(FSM) with the function of Code generators and Trellis-based decoders implemented as
firmware, contributing for building links in the network using the state transition of the
Convolutional Codes, proposed in [1] and [2].
The integration in the same Graphene Composite Substrate (GCS) of the sensor node,
allows the implementation of FSM registers using the Graphene Field Effect Transistor
(GFET) in a low complexity topology (XOR gates and shift registers) with the concepts
of a Mealy Machine (MM) as shown in the configuration in Fig. 5.
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 855

Fig. 5. Example of: (a) The code generator, configured by an MM with the input sequence Kn
(t) generating an output sequence out n (t) = (c1, c2); (b) Trellis decoding for a 4-node network
implemented as firmware

2.3 Transmission and Reception Using Plasmonic Waves

The wireless communication of the sensor node integrated in the same Graphene Com-
posite Substrate (GCS) is possible with the use of Plasmonic Antennas or Graphennas
through Electromagnetic Waves in the THz range (0.1 to 10 THz), below optical commu-
nications, allowing the use of surfaces radiation in the transmission range of antennas in
graphene compounds, thus reducing radiating structures. The low complexity of TCNet
concepts justifies the application in nanodevices networks such as TCnNet using the
Wireless Network-on-Chip (WNoC) paradigm [16].
The propagation and detection of plasmonic waves, proposed in 2000 by the group
of Professor Harry Atwater [17]; this approach, consists of the coupling between the
electromagnetic field (EM) and the free charges in the metal that propagate at the interface
metal-dielectric as waveguides, called Surface Plasmon Polariton (SPP).
Figure 6 shows the description of SPP waves using classical electromagnetism where
the relationship between the charge distribution on the metal surface and the electric field
attenuates exponentially from the interface. The oscillation of the wave coupled to the
electromagnetic field propagates as a “package” in the x-direction.

Fig. 6. Visualization of the SPP wave, resulting from the coupling between the EM field and the
free charges on the metal surface [17]. In TM polarization, there are EM field components: Hy,
Hx and Hz
856 D. F. Lima Filho and J. R. Amazonas

As well the conductivity of graphene has been considered both for DC and for
frequencies that range from the Terahertz band (0.1–10 THz), the experiments show the
RF results obtained from the graphene laminate processes illustrated in Fig. 1, taking
advantage of the graphene laminate’s flexibility when printed on paper or plastic, which
is very important for flexible electronics such as wearable and RFID dipole antenna
applications [18]. As displayed in Fig. 7(a), the gain obtained reaches peaks to −1 dBi
between 930 MHz and 990 MHz, and (b)–(c) the radiation pattern shows a typical dipole
pattern, showing that printed graphene laminated dipole antenna can radiate effectively.

Fig. 7. (a) Gain of graphene laminate dipole antenna. Measured gain radiation patterns: (b)
Elevation plane and (c) Azimuth plane

Establishing links between nanonetwork nodes in the THz band takes advantage
of the huge bandwidth, allowing high-speed transmission rates with very low energy
consumption, using low-complexity hardware. Figure 8(a) shows a simple conceptual
implementation of a Wireless Network-on-Chip (WNoC). On the other hand, the com-
plexity of the channels in the THz band must be considered limited to distances of
a few meters due to attenuation and noise. Figure 8(b) shown in research [19] illus-
trates the behavior of a section (L × W) of graphene subjected to radiation in THz and
Fig. 8(c) shows the survey of the resonance frequency of a graphene nano- antenna with
dimensions L = 5 µm and W = 10 µm.
The results obtained in a previous work [20] show the energy consumed in a WSN,
considering: Transmission (tx), Reception (rx), Processing (proc) and Guard Band (bg)
situations using the IEEE 802.11b Standards:

• Transmission Power (Ptx): 2 mW;


• Reception Power (Prx): 1 mW;
• Processing Power (Pproc ): 1 mW;
• Guard Band (bg): not considered

The contribution of the energy consumption of the node to obtain the necessary
consumption of the network was considered by Eq. 1:
ΣE(n) = Etx + Erx + Eproc + Ebg (1)
Tests were done with an eight-node network using the simulation environment
OMNet++ based on C++ [21], where the sink node sends a query with CBR traffic
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 857

Fig. 8. (a) Schematic diagram of TCnNet wireless network-on-chip (WNoC), in which wireless
links between nodes are omitted for simplicity. (b) Graphene section considered. (c) Resonance
frequency obtained from graphene nano-antenna (L = 5 µm and W = 10 µm) in the THz band

to verify the reachability of the nodes. Figure 9 shows the used node’s model config-
ured by a MM with k/n = 1/2, the respective trellis decoder and results of the energy
consumed by the nodes in the network using OMNet++ are shown in Table 1.

Fig. 9. (a) MM with k/n = ½, resulting output words (n1, n2); (b) Trellis diagram for an 8-node
network corresponding to the MM and (c) Energy distributed among TCNet Nodes

Table 1 shows the necessary energy consumption ΣE(n), considering a network with
8 nodes using the simulation environment OMNet++ [21] implemented by Eq. 1.
858 D. F. Lima Filho and J. R. Amazonas

Table 1. Individual contribution of energy consumed by network nodes in joule (J) [1] and [2]

ΣE (n) Energy (Joule)


ΣE (0) 4. 10–4 J
ΣE (4) 5. 10–4 J
ΣE (2) 6. 10–4 J
ΣE (5) 7. 10–4 J
ΣE (6) 8. 10–4 J
ΣE (7) 9. 10–4 J
ΣE (3) 10. 10–4 J
ΣE (1) 11. 10–4 J

3 Conclusion
The proposal of this research studies the feasibility of obtaining an integrated model of
a nanonetwork node on a Graphene Composite Substrate (GCS) exploring the mechan-
ical electrical and self-sustainable characteristics necessary for Internet of Things (IoT)
infrastructures. The techniques presented correspond to the state-of-the-art of research
that can be integrated into the nodes of a nanonetwork, taking advantage of the effi-
ciency of graphene, consisting of a layer of carbon atoms with a honeycomb crystal
lattice configuration; which has attracted the attention of the scientific community due
to its unique electrical characteristics, that can contribute with all the necessary charac-
teristics to nanodevices: low energy consumption, scalability, broadband communication
in the network, in addition to innovative mechanical aspects such as flexibility, reduced
thicknesses and optical transparency.

References
1. Lima Filho, D.F., Amazonas, J.R.: Robustness situations in cases of node failure and packet
collision enabled by TCNet: Trellis Coded Network - a new algorithm and routing protocol.
In: Pathan, A.S., Fadlullah, Z., Guerroumi, M. (eds.) SGIoT 2018. LNICST, vol. 256, pp. 100–
110. Springer, Cham (2019). https://doi.org/10.4108/eai.7-8-2017.152992
2. Lima, D.F., Amazonas, J.R.: Robustness situations in cases of node failure and packet collision
enabled by TCNet: Trellis Coded Network – a new algorithm and routing protocol. In: The
2nd EAI International Conference on Smart Grid Assisted Internet of Things, Niagara Falls,
Canada, 11 July 2018. http://sgiot.org/2018
3. Neves, A.I.S., et al.: Transparent conductive graphene textile fibers. Sci. Rep. 5, 9866-1–
9866-7 (2015)
4. Kumar, S., Kaushik, S., Pratap, R., Raghavan, S.: Graphene on paper: a simple, low-cost
chemical sensing platform. ACS Appl. Mater. Interfaces 7(4), 2189–2194 (2015)
5. Novoselov, K.S., et al.: Electric field effect in atomically thin carbon films. Science 306(5696),
666–669 (2004)
6. Zhu, J., Yang, D., Yin, Z., Yan, Q., Zhang, H.: Graphene and graphene based materials for
energy storage applications. Small 10(17), 3480–3498 (2014)
An Approach of Node Model TCnNet: Trellis Coded Nanonetworks 859

7. Huang, X., et al.: Binder-free highly conductive graphene laminate for low cost printed
radiofrequency applications. Appl. Phys. Lett. 106(20), 203105-1–203105-4 (2015)
8. Mattevi, C., et al.: A review of chemical vapour deposition of graphene on cooper. J. Mater.
Chem. 21, 3324–3334 (2011)
9. Torres, L., Armas, L., Seabra, A.: Optimization of micromechanical cleavage technique of
natural graphite by chemical treatment, January 2014. https://doi.org/10.4236/graphene.2014.
31001. http://www.scirp.org/journal/graphene
10. Sangkil, K., Rushi, V., Kyriaki, N., Collado, A., Apostolos, G., Tentzeris, M.M.: Ambient
RF energy-harvesting technologies for self-sustainable standalone wireless sensor platforms.
Proc. IEEE 102(11) (2014). http://www.ieee.org/publications_standards/publications/rights.
html
11. Paradiso, A.J., Starner, T.: Energy scavenging for mobile and wireless electronics. IEEE
Pervasive Comput. 4(1), 18–27 (2005)
12. Mantiply, E.D., Pohl, K.R., Poppell, S.W., Murphy, J.A.: Summary of measured radio fre-
quency electric and magnetic fields (10 kHz to 30 GHz) in the general and work environment.
Bioelectromagnetics 18(8), 563–577 (1997)
13. Le, T.T.: Efficient power conversion interface circuits for energy harvesting applications.
Doctor of philosophy thesis, Oregon State University, USA (2008)
14. Tentzeris, M.M., Kawahara, Y.: Novel energy harvesting technologies for ICT applications.
In: IEEE International Symposium on Applications and the Internet, pp. 373–376 (2008)
15. Vullers, R.J.M., et al.: Micropower energy harvesting (2009)
16. Abadal, S., Alarcón, E., Lemme, M.C., Nemirovsky, M., Cabellos-Aparicio, A.: Graphene-
enabled wireless communication for massive multicore architectures. IEEE Commun. Mag.
51(11), 137–143 (2013)
17. Atwater, H.A.: The promise of plasmonics. Sci. Am. 296, 38–45 (2007)
18. Huang, X., et al.: Binder-free highly conductive graphene laminate for low cost printed radio
frequency applications. Appl. Phys. Lett. 105, 203105 (2015). https://doi.org/10.1063/1.491
9935
19. Llatser, I., Kremers, C., Cabellos-Aparicio, A., Jornet, J.M., Alarcón, E., Chigrin, D.N.:
Graphene-based nano-patch antenna for terahertz radiation. Photonics Nanostruct. Fundam.
Appl. 10, 353–358 (2012)
20. Lima, D.F., Amazonas, J.R.: Novel IoT applications enabled by TCNet: Trellis Coded
Network. In: Proceedings of ICEIS 2018, 20th International Conference on Enterprise
Information Systems (2018). http://www.iceis.org
21. Varga, A.: OMNeT++ Discrete Event Simulation System (2011). http://www.omnetpp.org/
doc/manual/usman.html
CAD Modeling and Simulation of a Large
Quadcopter with a Flexible Frame

Ajmal Roshan(B) and Rached Dhaouadi

American University of Sharjah, Sharjah, UAE


{aroshan,rdhaouadi}@aus.edu

Abstract. In this paper, we present the design issues, dynamic model-


ing, and feedback control problems for quadcopters having a large frame
with mechanical flexibility. This situation occurs for example for the
case of solar quadcopters that require a large structure to have onboard
PV panels for the power supply and battery recharging. Flexibility may
manifests itself as mechanical oscillations and static deflections, greatly
complicating the motion control of a quadcopter platform. If the time
to settle the oscillations is significant relative to the cycle time of the
overall task, flexibility will be a major consideration in the flight control
design, and a degradation of the overall expected system performance
typically occurs. For quadcopters with flexible frames, it is difficult to
use the Euler-Lagrange or Newtonian approach to derive the dynamic
model as is done with small and rigid quadrotors. Many relevant fac-
tors that lead to the consideration of the distributed flexibility should
be analyzed. To deal with this challenge, an optimal approach for sizing
the frame and platform structure is followed. This is achieved by using a
modern approach to computer modeling of quadcopters through the inte-
gration process of SolidWorks CAD modeling and MATLAB/Simulink
environments. This is followed by identifying the resonant frequencies
of the model using ANSYS Workbench. Initially, a SolidWorks model is
created and then imported into MATLAB Simscape Multibody. Then,
the transfer functions of the quadrotor model are derived using system
identification. Subsequently, the equivalent transfer functions from the
Simulink model are obtained and used for the PID controllers design. The
validity of the quadcopter model represents an essential step to simulate
the model before flying the quadcopter in the real world. It brings con-
fidence that when the commands are run on the actual quadrotor, they
will bring the same results. The design procedure for modeling a dynam-
ical model of a large quadcopter and the tuning of its PID-based control
strategies has been successfully implemented for a solar-quadcopter pro-
totype, creating a reliable and effective automatic navigation and control
system.

Keywords: Quadcopter · SolidWorks · MATLAB · Simscape


multibody · Simulink · ANSYS workbench · Transfer function

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 860–879, 2023.
https://doi.org/10.1007/978-3-031-18461-1_57
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 861

1 Introduction
There has been a growing interest in the field of robotics. In fact, several indus-
tries require robots to replace the human presence in dangerous and onerous sit-
uations. Among them, a wide area of the research is dedicated to aerial robots.
The Vertical Take Off and Landing systems represents a valuable class of flying
robots. Quadrotors have been successfully used for monitoring and inspection of
rural and remote areas as well as many other industrial applications.
A quadrotor, with four propellers around a main body, presents the advan-
tage of having quite simple dynamic features. Quadrotors have gained a large
amount of interest due to their high manoeuvrability and multi-purpose usage.
They are a suitable alternative for monitoring, ground mapping, agricultural
and environmental preservation [1–3]. Self-sustainable quadrotors require auto
charging to allow a long operating time and a wider coverage area. However, the
difficulties of power consumption and mission planning lead to the challenge of
optimal sizing of the power supply such as the case of solar powered quadrotors
[1]. In [1], the main objective was to allow a large enough structure and mini-
mize the total weight while maintaining the system rigidity. Results show that
the optimal design of the quadrotor platform system is dependent on PV panel
size, and total weight, which affect the output power of the PV system, as well
as the power consumption profile.
In most of the literature, the discussion on quadcopters are usually focused
on rigid structural frames. In such scenarios, Newton − Euler or Lagrangian
equations could be used to derive the dynamic equations. However, in the case
of a large quadcopters a different approach needs to be followed. The frame’s
large size leads to a flexible strucrure with possible bending modes, which will
be challenging for the control system design. To deal with these challenges this
paper presents a CAD based modeling approach using SolidWorks and MAT-
LAB/Simulink to generate a realistic mathematical model based on the actual
parameters and material properties of the quadcopter. To deal with the challenge
of the platform flexibility, this paper presents an optimal approach for sizing the
frame and platform structure for solar quadrotors to allow a systematic model-
ing and design of quadrotors with large size and flexible frames. This is achieved
by using a modern approach to computer modeling of quadcopters through the
integration process of SolidWorks CAD modeling and MATLAB/Simulink envi-
ronments. This is followed by identifying the resonant frequencies of the model
using ANSYS Workbench.
There are multiple parameters, including principle axes of inertia, moment of
inertia and location of centre of mass, that needs to be obtained for evaluating the
kinematics and dynamics of the quadcopter. This is not an easy task and usually
experiments are employed to identify them. In this work, as the properties of
the materials are incorporated while working in SolidWorks, these parameters
are computed within the software and are obtained. A procedure to export the
quadcopter along with the properties from SolidWorks to MATLAB/Simulink
is discussed. Next, the techniques for creating an improved Simulink layout are
also specified. The academic contribution of this paper is the detailed procedure
862 A. Roshan and R. Dhaouadi

and adjustments that needs to be followed while working with these software
tools for a large quadcopter.
The paper is organized as follows: Sect. 1 gives an introduction of quadro-
tors and their immense potential. Section 2 describes modeling the quadrotor
in SolidWorks software as well as in MATLAB software. Section 3 identifies the
resonant frequencies of the model. Section 4 investigates into transfer functions
from the mathematical modeling techniques as well as from MATLAB. Section 5
is the discussions. Section 6 summarises the works performed.

2 Using SolidWorks and MATLAB Simscape Multibody

In the case of quadcopter, studies usually focus on first principles approach where
the body’s equation of motion are defined to determine the forces and moments
that are applied to a dynamic model. With advanced modeling techniques, it is
now possible for an individual to define the dynamics of the systems and apply
propeller forces to understand how the quadcopter will behave and then develop
the control strategies. This approach simplifies the design process as one need
not derive the equations to analyze the behaviour of a body. In addition to this,
it is possible to import a CAD model into the simulation software. This further
facilitates the overall work [2,4–6]. Before we test it in real world, simulation
software could help us to understand the dynamics of the system.
In this section, a quadcopter was modeled using SolidWorks software. It was
then imported to MATLAB Simscape Multibody. Certain techniques were also
implemented to create an improved block diagram layout in Simscape Multi-
body, which is also described. Finally, a lift force and a torque were given to
the propellers and the simulation was analyzed using Mechanics Explorer in
MATLAB.

2.1 Drawing the Large Quadcopter Prototype in SolidWorks


and Importing into MATLAB Simscape Multibody
The components of the quadcopter were first created in SolidWorks part draw-
ings. The material properties were assigned from the SolidWorks library, except
for the Carbon Fibre. The material properties found in [7] for Standard Carbon
Fibre were assigned as input to create a new material in SolidWorks library and
assigned to the carbon fibre components in the model.
Once all the parts were created, SolidWorks Assembly was used to create the
final model. A cubical block was placed at the centre that accounts for the mass
of sensors and wirings. Special considerations were taken, while assembling the
components in SolidWorks Assembly especially in mating the motors, propellers
and the body frame. The mating is carried out in such a manner that the motor
used in the model is an outrunner motor.
Figure 1 shows the final assembly model from SolidWorks. In the SolidWorks
model, the Y-axis is pointed upwards. This will be changed to Z-axis pointed
upwards; X-axis and Y-axis along the quadcopter arms, using Simulink axis
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 863

Fig. 1. Solar quadcopter prototype in SolidWorks

transformation discussed in subsequent section. The property of the Model from


SolidWorks is depicted in Table 1.

Table 1. Properties of the quadcopter

Mass (g) Inertia (g.mm2 )


⎡ ⎤ ⎡ ⎤
2215.57 Ixx Ixy Ixz 355529576.51 0.55 13071517.37
⎢ ⎥ ⎢ ⎥
⎣ Iyz Iyy Iyz ⎦ = ⎣ 0.55 709414966.05 −0.56 ⎦
Izx Izy Izz 13071517.37 −0.56 356039743.77

As can be seen from Table 1, the moment of inertia about the x and the z
axis (i.e., Ixx and Izz ) have the same value. It means the quadcopter has a high
degree of symmetry with respect to these axes. On the other hand, the moment
of inertia round the y axis (Iyy) is almost twice that of the other two. This
implies that it is easier to change the angular speed on the x or the z axes than
to change it on the y axis.
A few components (namely the motor mount) were assembled beforehand,
and then the assembled components were imported as sub-assemblies in the
final work. This was done so as to create a better block diagram layouts in the
Simscape Multibody after importing. MATLAB Simscape Multibody creates
a block diagram for each component in the SolidWorks model. Grouping the
components as a sub-assembly helps to create improved block diagrams in the
XML document while working in MATLAB Simscape Multibody.
864 A. Roshan and R. Dhaouadi

In order to import the SolidWorks model to MATLAB Simscape Multibody,


the Simscape Multibody Plugin needs to be installed.
The steps involved in importing a SolidWorks model into Simscape Multi-
body [2,5,8] are as follows:
– Install and register Simscape Multibody plugin.
– Export the CAD model to an XML file.
– Import the XML file into Simscape Multibody.
Using the command smimport’(filename)’ in MATLAB command window,
the XML file would generate the block diagrams in Simscape Multibody.
The algorithm for generation of an exported model is shown in Fig. 2.

Fig. 2. Algorithm for generation of an exported SolidWorks model

2.2 Working with Simscape Multibody


One of the main advantages of using MATLAB Simscape Multibody for the
analysis of dynamic systems is the ability of analyzing the complex systems which
may be difficult to analyze analytically. The software is able to solve simulation
tasks by employing approximate solutions numerically to mathematical models
that are complicated or difficult to describe. Simscape Multibody automatically
formulates the equations for a physical system in order to develop a mathematical
formulation which effectively represents the system [11,12].
The imported model of the quadcopter, shown in Fig. 3 includes a block
diagram representation of the nonlinear quadcopter dynamics.
The Simscape Multibody model is composed of block libraries, sensors and
actuators blocks that connects every other element in the model. Simscape Multi-
body automatically sets up an absolute inertial reference frame and coordinate
system called World.
Table 2 gives a brief description of the blocks in MATLAB Simulink.
The material properties of the components were defined in SolidWorks soft-
ware. The imported model contains this information including the mass, geom-
etry, centre of mass, inertia tensor, etc. The imported model almost always
requires modifications like removing unnecessary constraints between the ele-
ments of the model or changing their types. Therefore, to improve the block dia-
gram, some blocks can be transformed into subsystems. The modified Simscape
Multibody block diagram after grouping is shown in Fig. 4. The Quadrotor Plant
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 865

Fig. 3. Imported model in MATLAB Simulink

consists of the airframe elements of the quadrotor. The internal structure is


depicted in Fig. 5.
The motion of the quadcopter is measured from the 6 DoF joint that is avail-
able in MATLAB Simulink library. The measured quantities are translations in
X, Y and Z axes as well as the roll, pitch and yaw angles. In order to measure the
roll, pitch and yaw angle, a Transform Sensor is used from MATLAB Simulink
library. The Transform sensor measures the quaternion angles.
As can be seen from Fig. 6, the 6 DoF joint could provide an output on
the translations along X, Y and Z axis. The Transform Sensor block is used
to calculate the Quaternion angles. The Quaternions to Rotation angles block
from Aerospace subset in the library is used to measure the roll, pitch and yaw
angles.
866 A. Roshan and R. Dhaouadi

Table 2. Description of blocks used in SimMechanics

Group Block Name Description


World frame is the ground of all frame networks
Bodies World
in a mechanical model.
Sets mechanical and simulation parameters that
Mechanism
Bodies apply to an entire machine. Defines the environ-
Configuration
ment parameters.
Represents a solid whose geometry, material and
Bodies Solid
visual properties are read from a file

Spatial Con- Applies a contact force between two geometries


Bodies
tact Force and prevents penetration
Represents a 6-DoF joint between two frames.
This joint has three translational and three ro-
Joints 6-DOF tational degrees of freedom represented by three
prismatic primitives axes along a set of mutually
orthogonal axes, plus a spherical primitive

Joints Revolute This joint has one rotational degree of freedom

Joints Prismatic This joint has one translational degree of freedom

An axis transformation was also carried out. The axis transformation is done
in such a manner that the X-axis and the Y-axis are along the arms of the
quadcopter and the Z-axis points upwards following the right hand rule.
A platform was also incorporated such that the quadcopter rests on the
platform. The Spatial Contact Force block from the Simulink library is used for
this purpose connecting the World Frame and the quadcopter.
The XML import creates the geometry of the body. At the same time, there
are some changes that needs to be made in the block diagram environment. In
the file that is imported, 6-DoF joints were observed between the rigid trans-
form blocks, connectors, and frames. These are unnecessary blocks that needs
to be eliminated. Hence, all 6-DoF joints between rigid transform blocks, con-
nectors, and frames were deleted. As we are interested in the motion of the
quadcopter with respect to the World frame, a 6-DoF joint was inserted after
the World frame. In MATLAB Simscape Multibody, revolute joints are used for
rotational motions and prismatic joints are used for translational motions. It was
also observed that revolute joints and prismatic joints are positioned between
multiple parts of the quadcopter where there are no such motions associated in
real world. Such joints were also eliminated.
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 867

Fig. 4. Modified SimMechanics model - I

Fig. 5. Modified SimMechanics model- II

The modeling of motors is shown in Fig. 7. A quadcopter has four motors.


In the model, the front and back motors rotate in the clockwise (CW) direction,
and the right and left motors rotate in the counter clockwise (CCW). In terms
of their internal structure, they are all identical. With the aim to distinguish
the direction of the rotation of the motors, CW motor speeds are going to be
represented by a positive value and the CCW motor speeds will be represented
by a negative value. This sign convention will also be followed while the values
are input to the revolute joint.
The Torque and Lift Force by the propeller are proportional to the square
of the motor speed. The Lift Force and Torque were calculated by using the
following formulas:
868 A. Roshan and R. Dhaouadi

Fig. 6. Internal structure of 6 DOF joint

Fig. 7. Motor block internal structure

Lif tF orce = Ct ω 2
T orque = Cq ω 2
(1)
ρ. Kq .Dr 5 ρ. Kt .Dr 4
where, Cq = 2 and Ct = 2
(2π) (2π)
ρ − Air Density (taken as 1.225 Kg m3 )
Kq - Torque coef f icient
Kt - Lift Force coef f icient
Dr - Propeller Diametre
ω - Motor speed (rad/s)
Ct - Lift Force constant (taken as 44 ∗ 10−6 N.s2 )
Cq - Torque constant (taken as 5.96 ∗ 10−6 N.m.s2 )
The values for ρ, Ct andCq were taken from [13] and [14].
Let ω1 , ω2 , ω3 and ω4 be the angular velocities of the front, right, rear and
left propellers respectively, and U1 , U2 , U3 and U4 be the thrust, rolling moment,
pitching moment and yawing moment respectively. ‘L’ is the distance between
the centre of the quadcopter and the centre of the propeller. Then,
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 869

U1 = Ct (ω12 + ω22 + ω32 + ω42 )


U2 = LCt (−ω22 + ω42 )
(2)
U3 = LCt (−ω12 + ω32 )
U4 = Cq (−ω12 + ω22 − ω32 + ω42 )

In the MATLAB model, the input is the thrust, rolling moment, pitching
moment and yawing moment. The angular velocity needs to be calculated from
them. For this the above equations were inverted [15] as follows:
1 1 1
ω12 = U1 − U3 − U4
4Ct 2LCt 4Cq
1 1 1
ω22 = U1 − U2 + U4
4Ct 2LCt 4Cq
(3)
1 1 1
ω32 = U1 + U3 − U4
4Ct 2LCt 4Cq
1 1 1
ω42 = U1 + U2 + U4
4Ct 2LCt 4Cq
These equations were used in the MATLAB Simulink model.
As an outrunner motor was used in the model, a revolute joint was attached
to the motor. This causes the motor to spin. The propeller is mated in SolidWorks
Assembly in such a way as that when the motor spins the propeller spins with
it. A revolute joint is attached to all the four motors. It should be noted that all
revolute joints in Simscape Multibody cause the follower frame to rotate with
respect to the base frame about the Y-axis only. Hence, if the desired axis of
rotation of the component is not in Y-axis, a rigid transform block could be
used to align the axis of rotation. In order to cause the rotation, an actuation
value should be fed to the revolute joint [9,10]. The actuation torque setting in
revolute joint gives us three options:

– None
– Provided by Input
– Automatically Computed

In our model, ‘Provided by Input’ option was selected, and the calculated
torque was fed to the revolute joint.
These changes were done to improve the imported XML file from the Solid-
Works software to simulate real world mechanics. In the next step, the run button
in the Simulink window is selected to view the simulated model in Mechanics
Explorer in MATLAB. The Mechanics Explorer window is shown in Fig. 8.
870 A. Roshan and R. Dhaouadi

Fig. 8. Mechanics Explorer Window in MATLAB

3 Identifying the Resonant Frequencies of the Large


Flexible Quadcopter
Every system can be described in terms of a stiffness matrix that connects the
displacements (the response of the system) and the force (input to the system).
These frequencies are known as natural frequencies of the system and are known
as resonant frequencies.
As the frequency of system approaches the resonant frequency, we receive
a response with high displacements (high amplitude), which may result in the
failure of the structure. Hence, it is of utmost importance for the quadcopter to
identify the natural frequencies and to avoid working in those frequencies.
In this section, the modal analysis on the quadcopter is carried out in ANSYS
workbench, and the natural frequencies were tabulated.
The modal analysis determines the vibrational characteristics (natural fre-
quencies and mode shapes) of the structure.

3.1 Importing from SolidWorks to ANSYS

There are multiple literatures [16–19] where the 3D CAD modeling was com-
pleted in SolidWorks software and followed by simulations in ANSYS. The Solid-
Works Connected Help [20] webpage, describes a method to export a SolidWorks
file into ANSYS software. A list of file types that are compatible in SolidWorks
and ANSYS [21] are given in Table 3:
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 871

Table 3. Formats that are compatible with both SolidWorks and ANSYS

Parasolid (*.x t, *.xmt txt)


IGES (*.igs, *.iges)
ACIS (*.sat)
Unigraphics (*.prt)

3.2 Using ANSYS - Modal Analysis for Identifying the Resonant


Frequencies
In the ANSYS software, the ‘Modal’ module was activated on to the Project
Schematic section. Under Engineering Data, all the materials used on the quad-
copter were selected. A new material was created called Standard Carbon Fibre
and the properties documented in [7] were updated into the model.
In the Geometry section, the SolidWorks quadrotor model that was saved in
Parasolid format was imported using the tab imported geometry.
Selecting the Model section, opens the Mechanical window. Here, the parts of
the quadcopter were selected, and the material was assigned to each part of the
quadcopter. A mesh was created. Repeated trials were implemented, especially
with regards to sizing, to get a finer and uniform mesh before the simulation.
The first 18 modes were analyzed and the results were tabulated as shown
in Table 4.

Table 4. Resonant frequencies

Mode 1 2 3 4 5 6
Frequency (Hz) 0 0 2.4528e−3 2.4562 3.3645 3.5357
Mode 7 8 9 10 11 12
Frequency (Hz) 6.6066 9.6004 12.922 13.433 16.317 22.648
Mode 13 14 15 16 17 18
Frequency (Hz) 25.636 25.736 31.217 36.268 64.734 67.407

Figure 9 and Fig. 10 are some of the Mode Shapes.


An interesting argument was found in a previous work [22]. The work states
that performing modal analysis without due regards to the stress might be inad-
equate in some cases and that the stresses needed to be included in certain
conditions, such as while analyzing the strings in the guitar (strings are in high
tension), where the body is subjected to pre-stress.
Hence, the quadcopter modal analysis was carried out with stresses acting
on it as while a quadcopter is operating. Here, a static structural simulation in
ANSYS was first carried out. Then, the super-imposition of the results on to the
modal analysis was done. It was found that these conditions yielded the same
resonant frequencies as in no pre-stress condition.
872 A. Roshan and R. Dhaouadi

Fig. 9. Mode shape 12 at 22.648 Hz

Fig. 10. Mode shape 16 at 36.268 Hz

Another point to note is that, as a modal result is based on the model’s


properties and not a particular output, one will be able to interpret where the
maximum and minimum deviations from the original model will occur from a
particular modal shape but not the actual value [23].
It needs to be taken care that the quadcopter does not operate in these
frequencies so as to avoid resonance.

4 Obtaining the Transfer Functions of the Large Solar


Quadcopter Prototype

A transfer function represents the relationship between the input and the output
of a component or a system.
In this chapter, the transfer function of the quadcopter is determined by two
methods:
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 873

i. Transfer function is achieved from the derivation of the quadcopter model.


ii. Transfer function is obtained using the system identification toolbox while
the quadcopter is in hovering mode.

4.1 Transfer Function from Mathematical Model of a Rigid


Quadcopter
The modeling approach discussed in this section is for rigid quadcopters. The
results obtained will be used for comparison purpose with the transfer functions
obtained for the large solar quadcopter.
h h
Consider, T i as the thrust from it motor, Qi as the torque from it motor
and L as the distance between the centre of the quadcopter and the centre of
the propeller. Then:

T otal T hrust(U1 ) = T1 + T2 + T3 + T4
Rolling M oment(U2 ) = L(T3 − T4 )
(4)
P itching M oment(U3 ) = L(T1 − T2 )
Y awing M oment(U4 ) = Q1 + Q2 + Q3 + Q4

The Rotation Matrix for Body to Earth transformation is given by:


⎡ ⎤
sθsΦsψ + cψcθ sΦsθcψ − cθsψ sθcΦ
R=⎣ cΦsψ cψcΦ −sΦ ⎦
cθsψsΦ − sθcψ cθsΦcψ + sθsψ cθcΦ (5)
where, Φ, θ, ψ are the roll, pitch and yaw angles respectively
sΦ − sinΦ, cΦ − cosΦ, tΦ − tanΦ and so on

Let:
p - rate of change of roll angle in the body axis system
q - rate of change of pitch angle in the body axis system
r - rate of change of yaw angle in the body axis system
To relate Euler angle rates to body angular rates, we could use the rotational
matrix from Eq. 5 ⎡ ⎤ ⎡ ⎤⎡ ⎤
φ̇ cψ −sψ 0 p
⎣ θ̇ ⎦ = ⎣ sψ cψ 0⎦ ⎣q ⎦ (6)
cφ cφ
ψ̇ sψtφ cψtφ 1 r
From [24–27], differentiating Eq. 6 and substituting the inertia matrix gives
the following equation:
cψU2 sψU3
Φ̈ = ψ̇ θ̇cΦ + − (7)
Ixx Iyy

ψ̇ Φ̇ sψU2 cψU3
θ̈ = + ψ̇ θ̇tanΦ + + (8)
cΦ cΦIxx cΦIyy
874 A. Roshan and R. Dhaouadi

Φ̇θ̇ sψtΦU2 cψtΦU3 U4


ψ̈ = Φ̇ψ̇tΦ + + + + (9)
cθ Ixx Iyy Izz
From Newton’s Second Law
U1
ẍ = (sΦsψ + cφcψsθ) (10)
m
U1
ÿ = (cΦsψsθ − cψsΦ) (11)
m
U1
z̈ = (cΦcθ) − g (12)
m
x, y and z are the translational motions in the x, y and z axes respectively
The above equations will lead to the transfer function as follows:
Consider Eq. 12
U1
z̈ = (cΦcθ) − g
m
Here z is a function of U1 , U2 , U3 , U4 , φ, θ, ψ
The equilibrium position is defined as when the quadcopter is hovering, i.e.;
when U1 = mg as well as φ = θ = ψ = U2 = U3 = U4 = 0; then
d2 z ∂f ∂f ∂f ∂ ∂f ∂f ∂f
dt2 = ∂U1 ∂U1 + ∂U2 ∂U2 + ∂U3 ∂U3 + ∂U4 ∂U4 + ∂Φ ∂Φ + ∂θ ∂θ + ∂ψ ∂ψ

at equilibrium

d2 z ∂U1
=
dt2 m
ΔU1
s2 Δz(s) =
m
Δz(s) 1
= 2 (13)
ΔU1 s m
Here altitude (the translational motion along z - axis) is the output and thrust
(U1 ) is the input

4.2 Transfer Function Identification for the Flexible Quadcopter


Prototype
It is important to have a good model of the quadcopter. However, some of the
parameters are hard to measure. System Identification tool box in MATLAB
is used to overcome this issue, which consists of identifying the plant model
using the input and output data. The transfer functions were directly obtained
from the System Identification toolbox. The quadcopter is made to hover at 1 m,
within T = 5 s with the help of a PID controller. The PID parameters were man-
ually tuned. Then at T = 5 s, a perturbation in form of Pseudo Random Binary
Signal (PRBS) is input to the system. The simulation is used for identifying
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 875

Fig. 11. Altitude-time plot for T<5 s

Fig. 12. PRBS signal


876 A. Roshan and R. Dhaouadi

the transfer functions for the roll, pitch, yaw and altitude. Figure 11 shows the
Altitude-Time graph for the initial 5 s. The PRBS signal is shown in the Fig. 12.
The internal structure of the Simulink block for altitude control is shown in
Fig. 13. Figure 14 shows the pitch curve with PRBS input.
The Transfer Functions for roll, pitch, yaw and altitude obtained by System
Identification toolbox is shown in Table 5.

Fig. 13. Internal structure of simulink block for altitude control

Fig. 14. Pitch-time plot with PRBS input


CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 877

Table 5. Transfer Function from System Identification toolbox

Transfer functions
Δz(s) 6.135(s2 −1.506s+431.8)
For altitude, G(s) = ΔU 1 (s) (s+38.81)(s+0.5191)(s2 +6.611s+67.66)
Δφ(s) 0.004297(s−1.884)(s−0.06632)
For the roll, G(s) = ΔU2 (s) (s2 +0.3575s+0.1322)(s2 +1.817s+1.597)
Δθ(s) 0.004982(s−0.9959)(s−0.1242)
For the pitch, G(s) = ΔU 3 (s) (s+0.9572)(s+0.2654)(s2 +0.4334s+0.1165)
Δψ(s) −0.358(s−5.138)(s+0.2242)
For the yaw, G(s) = ΔU 4 (s) (s+1.096)(s+0.1277)(s2 +7.8173s+6.416)

5 Discussions

The quadcopter was modeled in SolidWorks software. The properties including


mass and interia matrix were obtained from the SolidWorks model. It can be
shown that the transfer functions of a quadcopter depends on the mass as well
as inertia matrix. The mathematical modeling used for obtaining the transfer
functions were for rigid quadcopters. This is used for the purpose of compari-
son with the transfer functions obtained for the large quadcopter. For a large
quadcopter with a flexible frame, oscillations and static deflections greatly com-
plicates the motion control. The resonant frequencies of the model were identified
with ANSYS Workbench software. The MATLAB Simscape Multibody enables
us to analyze complex systems. Changes were made on the imported body from
Solidworks to replicate the real-world scenerios. System Identification tool box
was to identify the plant model using the input and output data. In the next
step, transfer functions were obtained by System Identification toolbox. A per-
turbation in the form of PRBS is used as an input signal. The transfer functions
obtained will be used for the purpose of control system design. Comparing the
transfer function obtained for the altitude by System Identification toolbox, in
Table 5 with that of Eq. 13 for the rigid quadrotor, it is clear that the large
frame involves additional dynamics which are mainly due to the larger size and
structural flexibility.
The results of this work will be used for the design of the quadcopter flight
controller, specifically for controlling the roll, pitch and yaw so that quadcopter
follows the desired trajectory.

6 Conclusion

A quadcopter that requires a large structure encounters issues in design, dynamic


modelling and feedback control. The Euler-Lagrange or Newtonian approach
which is commonly used with small and rigid quadcopters presents a difficulty
on quadcopters with flexible frames. This paper describes the process to import
a SolidWorks file to MATLAB Simscape Multibody. SolidWorks software was
used to draw the parts of a quadrotor and assembled using SolidWorks Assembly.
Using Simscape Multibody plug-in the geometry was imported into XML docu-
ment to be read by MATLAB. Changes were made in the block diagrams that
878 A. Roshan and R. Dhaouadi

were generated automatically to simulate the real-world scenarios. The modal


analysis on the quadrotor were carried out using ANSYS Workbench software
and the resonant frequencies were analyzed. In the next step, transfer functions
for roll, pitch, yaw and altitude were obtained by using MATLAB’s System
Identification toolbox.

References
1. Dhaouadi, R., Takrouri, M., Shapsough, S., Al-Bashayreh, Q.: Modelling and
design of a large quadcopter. In: Proceddings of the Future Technologies Con-
ference (FTC), vol. 1, pp. 451–467 (2021)
2. Jatsun, S., Lushnikov, B., Leon, A.S.M.: Synthesis of SimMechanics model of quad-
copter using SolidWorks CAD translator function. In: Proceedings of 15th Interna-
tional Conference on Electromechanics and Robotics “Zavalishin’s Readings”, pp.
125–137 (2020)
3. Shreurs, R.J.A., Tao, H., Zhang, Q., Zhu, J., Xu, C.: Open loop system identifica-
tion for a quadrotor helicopter. In: 10th IEEE International Conference on Control
and Automation, Hangzhou, China, 12–14 June 2013 (2013)
4. Cekus, D., Posiadala, B., Warys, P.: Integration of Modeling in SolidWorks and
MATLAB/Simulink Environments. Archive of Mechanical Engineering, Vol. LXI
(2014)
5. Gordan, R., Kumar, P., Ruff, R.: Simulating Quadrotor Dynamics using Imported
CAD Data. Modeling and Simulating II: Aircraft, Mathworks (2013)
6. Shaqura, M., Shamma, J.S.: An automated quadcopter CAD based design and
modeling platform using SolidWorks APLI and smart dynamic assembly. In: 14th
International Conference of Informatics in Control, Automation and Robotics, vol.
2, pp. 122–131 (2017). https://doi.org/10.5220/0006438601220131
7. Performance Composites, Mechanical Properties of Carbon Fibre Composite Mate-
rials, Fibre / Epoxy Resin. http://www.performance-composites.com/carbonfibre/
mechanicalproperties 2.asp
8. MathWorks R2021b, Install the Simscape Multibody Link Plugin. https://ww2.
mathworks.cn/help/physmod/smlink/ug/installing-and-linking-simmechanics-
link-software.html
9. MATLAB Simulink, SimMechanics User’s Guide, The MathWorks, United States
of America
10. MathWorks R2021b, Revolute Joint. https://ww2.mathworks.cn/help/physmod/
sm/ref/revolutejoint.html
11. Tijonov, K.M., Tishkov, V.V.: SimMechanics Matlab as a dynamic modeling tool
complex aviation robotic systems. J. Trudy MAI 41, 1–19 (2010)
12. Blinov, O.V., Kuznecov, V.B.: The study of mechanical systems in the environment
of SimMechanics (MatLab) using the capabilities of three-dimensional modeling
programs. Ivanovo State Polytechnic University (2012)
13. International Civil Aviation Organisation (ICAO): ICAO Standard Atmosphere,
Doc 7488-CD (1993)
14. T-motor. Test Report - Load Testing Data’, MN3508 KV380 Specifications.
https://store.tmotor.com/goods.php?id=354
15. Bresciani, T.: Modelling, Identification and Control of a Quadrotor Helicopter.
M.Sc. Lund University (2008)
CAD Modeling and Simulation of a Large Quadcopter with a Flexible Frame 879

16. Hassan, M.A., Phang, S.K.: Optimized autonomous UAV design for duration
enhancement. In: 13th International Engineering Research Conference, Malaysia,
27 November 2019, pp. 030004–1:030004–10. AIP Publishing (2020). https://doi.
org/10.1063/5.0001373
17. Pretorius, A., Boje, E.: Design and modelling of a quadrotor helicopter with vari-
able pitch rotors for aggressive manoeuvres. In: The 19th International Federation
of Automatic Control World Congress, South Africa, pp. 12208–12213 (2014)
18. Ibrahim, S., Alkali, B., Oyewole, A., Alhaji, S.B., Abdullahi, A.A., Aku, I.: Prelimi-
nary structural integrity investigation for quadcopter frame to be deployed for pest
control. Proceedings of Mechanical Engineering Research Day 2020, pp. 176–177
(2020). http://repository.futminna.edu.ng:8080/jspui/handle/123456789/9863
19. Ersoy, S., Erdem, M.: Determining unmanned aerial vehicle design parameters for
air pollution detection system. Online J. Sci. Technol. 10(1), 6–18 (2020)
20. SolidWorks Connected Help. Export Options - ANSYS, PATRAN, IDEAS, or
Exodus. http://help.solidworks.com/2021/English/SWConnected/cworks/IDH
HELP PREFERENCE EXPORT OFFSET.htm?id=f937f775202444789000e092
f83f3c2b
21. LIGO Laboratory, (2004). SW-ProE-Ansys Compatible file types.pdf. https://
labcit.ligo.caltech.edu/ctorrie/QUADETM/MPL/SW-ProE-Ansys Compatible
file types.pdf
22. Bedri, R., Al-Nais, M.O.: Prestressed modal analysis using finite element package
ANSYS. In: International Conference on Numerical Analysis and Its Applications,
pp. 171–178 (2004)
23. ANSYS: Lecture 8: Modal Analysis. Introduction to ANSYS Mechanical,
pp 10. https://www.clear.rice.edu/mech517/WB16/lectures trainee/Mechanical
Intro 16.0 L08 Modal Analysis.pdf
24. Fernando, E., De Silva, A., et al.: Modelling simulation and implementation of
a quadrotor UAV. In: 2013 IEEE 8th International Conference on Industrial and
Information Systems, pp. 207–212 (2013). https://doi.org/10.1109/ICIInfS.2013.
6731982
25. Balas, C.: Modelling and Linear Control of a Quadrotor. Master thesis. University
of Cranfield (2007)
26. Wang, P., Man, Z., Cao, Z., Zheng, J., Zhao, Y.: Dynamics modelling and linear
control of a quadcopter. In: Proceedings of the 2016 International Conference on
Advanced Mechatronic Systems, Melbourne, Australia (2016)
27. Dong, W., Gu, G.Y., Zhu, X., Ding, H.: Modelling and control of a quadrotor UAV
with aerodynamic concepts. Int. J. Aerospace Mech. Eng. 7(5), 901–906 (2013)
Cooperative Decision Making
for Selection of Application Strategies

Sylvia Encheva(B) , Erik Styhr Petersen, and Margareta Holtensdotter Lützhöft

Western Norway University of Applied Sciences, P.O. Box 7030, 5020 Bergen, Norway
{sbe,Erik.Styhr.Petersen,Margareta.Holtensdotter.Lutzhoft}@hvl.no

Abstract. The landscape for potential research funding is diverse, and


occasionally, multiple opportunities are available simultaneously. More-
over, considering that the preparation of a well-founded research appli-
cation is demanding in terms of resource expenditure, the ‘best’ opportu-
nities are continuously evaluated and selected in the hope of maximizing
the beneficial outcome of efforts invested in applications - but the defini-
tion of ‘best’, and thus the selection of opportunities has a tendency of
being subjective and individual, researcher by researcher. In this work,
we are suggesting that a research group or department will benefit form
an explicit discussion and agreement about selection criteria - KPIs - and
from applying such more objective and well understood criteria in the
selection process. To reach this goal, we are also aiming at providing a
structured approach to selection of sources for external research funding.

Keywords: Cooperative decision making · External funding · Hurwicz


criterion

1 Introduction
Our university department, like so many other organizations, can be charac-
terized by having an area of excellence, a mission, a strategy, a set number of
assignments and activities, and resources, human, financial and temporal, which
are both limited and fixed in size, at least short-term. Besides teaching, research
- and the associated publication of results - is a Key Performance Indicator
(KPI) which is constantly in focus, aimed at providing the students, as well
as any other interested stakeholder, with knowledge that is on the forefront of
the subject field, and the resources available for research are often tied to suc-
cessful research grants. Still comparable to most other organizations, sustaining
or expanding research activities in a department as ours mean that externally
funded research opportunities are continuously being considered and selected, in
which cases research applications are being prepared and submitted as appro-
priate. Nothing is new about this, and neither are the potential challenges that
results from this process. High-level, often national, regional, or even global
agendas set the direction of research subjects being offered by funding orga-
nizations and thus essentially dictates the direction of departmental research.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 880–887, 2023.
https://doi.org/10.1007/978-3-031-18461-1_58
Cooperative Decision Making 881

This may eventually leave strategic considerations to be abandoned and other


important subjects, also in need for further research, to be left aside, or to be
reinvented/rebranded to align with new political and/or external agendas. The
overall result is that the departmental knowledge base is at risk of becoming
fragmented, and that the research strategy becomes a fait-accompli rather than
a result of conscious deliberations. Only a well-managed process carefully select-
ing the most appropriate funding opportunities can reduce such an erosion of
the subject area excellence, and being aware of the risk, we are asking ourselves
whether we have such a process sufficiently in place? Cost/benefit is, we believe,
a subject that should also be considered in this context. It is expensive to pre-
pare a good research proposal, especially for the organization that is in the lead
of the effort, a cost that is mostly in terms of time spent, time which could have
been used on something else - like writing publications, updating and improv-
ing courses or spending more time on supervising students, where the effect is
immediately visible and where the time spent provides a clear benefit to the
department. In spite of many diverse funding opportunities, writing research
applications, on the other hand, is often unsuccessful due to competition and
the reduced hit rates that follows - we know from the past that in some cases
hit-rates are less than 10 Under these circumstances, we are arriving at a point
where we increasingly ask ourselves ‘Is it worth it?’ - or, rather, ‘When is it
worth it?’. There is no easy answer to this, obviously, but this line of think-
ing has made us look into a more rational approach to decision-making in this
area, in the hope that we can give chance lesser room and can spend our avail-
able resources more effectively. To realize that we employ tools based on formal
concept analysis (FCA) [2], the Hurwicz method, [3], data analytics [10], and
ordered weighted averaging aggregation operators [7].

2 Background
2.1 Decision Making

“The AHP provides a comprehensive framework to cope with the intuitive, the
rational, and the irrational in us at the same time. It is a method we can use
to integrate our perceptions and purposes into an overall synthesis” [4]. Multi-
criteria decision-making methods addressing the measurement of the priorities
of conflicting tangible/intangible criteria are shown in [5]. For a free web based
AHP tool we refer to [14].
Tangible and factors for supplier selection are discussed in [6]. University
R&D funding strategies are analyzed in [13]. Guidelines for competing for
research funding are provided in [1].

2.2 Formal Methods

Let P be a non-empty ordered set. If sup{x, y} and inf {x, y} exist for all x, y ∈
P , then P is called a lattice, [2].
882 S. Encheva et al.

“A context is a triple (G, M, I) where G and M are sets and I ⊂ G × M . The


elements of G and M are called objects and attributes respectively. A concept
of the context (G, M, I) is defined to be a pair (A, B) where A ⊆ G, B ⊆ M ,
A = B and B  = A. The set of all concepts of the context (G, M, I) is denoted
by B (G, M , I ), where B (G, M , I ); ≤ is a complete lattice and it is known as
the concept lattice of the context (G, M, I)”, [9].
For a good selection of formal concept analysis (FCA) tools see [12].
Pessimism and optimism in decision making under uncertainty can be bal-
anced by using the Hurwicz [3] method for compromise aggregation. The aggre-
gated value d of n attributes, a1 , a2 , . . . , an , is defined as a weighted average of
the max and min values of that tuple

ρmaxi ai + (1 − ρ) mini ai = d (1)

where parameter ρ represents the optimism of the decision maker, 0 ≤ ρ ≤ 1.


The values of ρ are usually interpreted as follows: below 0.5 - optimistic, neutral
if equal to 0.5, and pessimistic above 0.5. Other methods worth mentioning are
maximin, minimax, and minimax regret.
Ordered weighted averaging aggregation operators (OWA) provide a
parametrized class of mean type aggregation operators, [7]. Such operators are
often used to model linguistically expressed aggregation instructions, [8].
Orange [10] is a free data mining tool that can be used for building a word
cloud and a bag of words among other things. This software suite does not
require coding and can be downloaded locally.

3 Evaluation of Research Funding Opportunities


During brain-storming sessions, we have examined parameters which we find
weighs heavily in our decision-making. Some of them have already been men-
tioned in the foregoing, especially the cost associated to the preparation of a
research application and the hit-rate expected, but on the benefit side also the
strategic fit is of clear importance. For us, however, this is only the tip of the
iceberg, and having started the conversation we have come to realize that there
is a long, and even growing, list of significant parameters that ideally should be
considered when a more rational decision is to be reached, both tangible and
intangible. It is also very important to take into account balancing the strategy
of the group/researcher, the skills and previous expertise including publications
to show it, the current ‘in’ topics i.e. greening, the time to write and to either
find or establish a consortium, the ‘size’ of the call = complexity, and strategy
of the home university.
As mentioned, choices are needed to select opportunities when they are
offered, and in some cases, this involves having to evaluate, compare and choose
from a number of simultaneous opportunities, which is illustrated in the fol-
lowing scenario, where there are several calls for proposals an application could
address. One way to consider calls’ relevance for submitting a proposal is to sim-
ply to read their contents, and draw a conclusion based purely on the impression
Cooperative Decision Making 883

generated. Another slightly more analytical possibility is to create a word cloud


and a bag of words for each call to create a shortlist, and thus decrease the
number of calls for further, more careful examination.
However, the systematic approach we are suggesting in this paper, and which
we will be trialing internally going forward, builds on the criteria we have
presently selected for considering which calls to focus on along with nominal
grading and tipping points are summarized in Table 1. Our approach is illus-
trated with an example where 7 calls, (C1, C2, ..., C7), some of which stipu-
lated, are evaluated with respect to 15 criteria (A1, A2, ..., A15), see Table 1. To
preserve anonymity, we omit details about each call. The first thirteen criteria
are classified as tangible while the last two as intangible.
A cross in a cell in Table 2 indicates that a corresponding call satisfies a
respective criterion (for tipping points see Table 1), while an empty cell indicates
the opposite.
In a case of multi-person decision-making, a team can apply AHP to facilitate
the process of reaching a consensus on entrances in Table 1, Table 2, and Table 3.
Based on data in Table 2 a concept lattice is derived in Fig. 1 using Gali-
cia [11], an open source FCA based software tool. Notations in Fig. 1 fol-
low the ones in Table 1. Calls are placed in sets ‘E’ in curly brackets, f. ex.
E = {C1, C2, C3} while criteria that these three calls satisfy can be found in a
set I = {A1, A2, A4, A11, A15}, in the corresponding concept in Fig. 1. In FCA
‘E’ and ‘I’ are referred to as ‘extent’ and ‘intent’ of a concept. Concepts in Fig. 1
can be used to find out which sets of calls satisfy most of the listed criteria.
Focusing on the above-mentioned concepts will decrease the number of calls to
be considered and noticeably shorten the time to select those that fit best to a
team’s interests and abilities.
Suppose a team is interested to identify three proposals. In this case the
members should focus on calls C1, C2, and C3 since they satisfy the max number
of criteria among any three calls in this scenario, i.e. A1, A2, A4, A11, and A15.
If, however, the team is interested to identify two instead of three calls their
choice should involve calls C1 and C3 since they satisfy the max number of
criteria among any two calls in this scenario, i.e. A1, A2, A3, A4, A6, A11, A13,
and A15.
Aggregated values for thus selected calls, belonging to a particular concept,
can be calculated afterwards to rank a team’s choices. Table 2 is used to build
Table 3 where binary representations are substituted with numerical values. The
latter should be provided by a team planning to send a proposal. All aggregated
values in Table 3 are obtained applying the Hurwicz method.
Aggregated values in Table 3 viewed together with information derived from
concepts in Fig. 1 indicate that:

– Call C3 should be considered ahead of calls C2 and C1.


– Call C4 has the highest aggregated value, and it should therefore also be
considered even though it satisfies a smaller number of criteria than the first
three calls.
884 S. Encheva et al.

Table 1. Tipping points for criteria

Criteria Nominal Tipping


grading point
A1 Subject expertise Low/High .8
A2 Strategic fit (department, faculty, No/Yes .8
university/local/regional/global)
A3 Political goals (Exposure/marketing/academic No/Yes .5
academic recognition)
A4 Research interests (personal and departmental) Low/High .5
A5 Expected hit-rate Low/Decent .8
A6 Quality of partnership (if relevant) Poor/Good .8
A7 Known barriers/complications (based Many/Few .4
on past experiences with a
particular funding source)
A8 Capacity to undertake proposal No/Yes .7
preparation - Personnel
A9 Capacity to undertake proposal Inadequate/ .5
preparation - Time Adequate
A10 Capacity to undertake proposal Unavailable/ .5
preparation - Financial resources Available
A11 Capacity to undertake the research Non satisfactory/ .7
if the proposal is successful - Satisfactory
Personnel (available expertise; hiring;
opportunities for PhD candidates)
A12 Capacity to undertake the research Insufficient/ .6
if the proposal is successful - Sufficient
Lead time/project duration/other
commitments
A13 Capacity to undertake the research Inadequate/ .8
if the proposal is successful - Adequate
Funding size/level/financial resources
(own contribution)
A14 Gut-feeling, intuition Poor/Good .7
A15 Drive/passion/staff personal Weak/Strong .8
motivation

– Call C5 has a bit higher aggregated value than C1 and C3 but satisfies a
smaller number of criteria than the other two and should therefore be excluded
from the priority list.
– Call C6 has the lowest aggregated value and should therefore be excluded
from the priority list.
– Calls C7 and C1 have equal aggregated values but C7 satisfies a smaller
number of criteria than C1 and should therefore be excluded from the priority
list.
Fig. 1. A concept lattice based on data shown in Table 2
Cooperative Decision Making
885
886 S. Encheva et al.

Table 2. Calls for proposals and their relevance

Calls A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15


C1 x x x x x x x x x x x x
C2 x x x x x x x x x
C3 x x x x x x x x x
C4 x x x x x x x x x x
C5 x x x x x x x
C6 x x x x x x x x x
C7 x x x x x x x

Table 3. This is Table 2 populated with numeric values and calculated aggregated
values

Calls A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 Aggregated


values
C1 .8 .8 .5 .5 .9 .4 .7 .5 .7 .8 .7 .9 .5
C2 .8 .8 .6 .8 .8 .5 .8 .6 .8 .56
C3 .8 .8 .6 .6 .8 .6 .8 .8 .8 .66
C4 .5 .8 .8 .6 .7 .5 .6 .6 .8 .8 .8
C5 .9 .9 .5 .5 .7 .7 .8 .58
C6 .5 .8 .4 .7 .5 .8 .6 .8 .8 .48
C7 .9 .8 .4 .5 .7 .9 .8 .5

Note that calculations carried out this way imply equal importance of all
criteria. We suggest application of Yager’s weights in case a team would prefer
to emphasize the relevance of some of the listed criteria. If two or more calls
have the same aggregated value and are still of interest to the team, we suggest
additional discussions in order to select the most desirable option. The same
applies in cases where aggregated values are not significantly different.

4 Conclusion

This work presents a structured approach to selecting and ranking potential


sources for external funding according to a number of predefined criteria and
grading scale chosen by a team’s members. The approach could also facilitate
certain local selection processes in an organization, like alignment of individual
and departmental interests and priorities, and for finding out which proposal
should receive internal support in order to meet requirements from a granting
organization.
Additional research based on real data is needed in order to validate the effec-
tiveness of the suggest approach. Other multi-criteria methods can be applied
along with a variety of outcomes comparisons.
Cooperative Decision Making 887

References
1. Blume-Kohout, M.E., Kumar, K.B., Sood, N.: University R&D funding strategies
in a changing federal funding environment. Sci. Public Policy 42(3), 355–368 (2015)
2. Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge Uni-
versity Press, Cambridge (2005)
3. Gaspars-Wieloch, H.: Modifications of the Hurwicz’s decision rule. CEJOR 22,
779–794 (2014)
4. Saaty, T.L.: The analytic hierarchy process: decision making in complex environ-
ments. In: Avenhaus, R., Huber, R.K. (eds.) Quantitative Assessment in Arms
Control. Springer, Boston (1994). https://doi.org/10.1007/978-1-4613-2805-6 12
5. Saaty, T.L., Ergu, D.: When is a decision-making method trustworthy? Criteria
for evaluating multi-criteria decision-making methods. Int. J. Inf. Technol. Decis.
Mak. (IJITDM) 14(06), 1171–1187 (2015)
6. Tahriri, F., Osman, M.R., Ali, A., Yusuff, R., Esfandiary, A.: AHP approach for
supplier evaluation and selection in a steel manufacturing company. J. Ind. Eng.
Manag. 1, 54–76 (2008)
7. Yager, R.R., Kacprzyk, J.: The Ordered Weighted Averaging Operators: Theory
and Applications. Kluwer, Norwell, MA (1997)
8. Yager, R.R.: OWA aggregation over a continuous interval argument with applica-
tions to decision making. IEEE Trans. Syst. Man Cybern. Part B 34(5), 1952–1963
(2004)
9. Wille, R.: Concept lattices and conceptual knowledge systems. Comput. Math.
Appl. 23(6–9), 493–515 (1992)
10. https://orangedatamining.com/
11. http://www.iro.umontreal.ca/∼galicia/
12. https://upriss.github.io/fca/fca.html
13. https://intranet.bloomu.edu/documents/research/ebook-funding.pdf
14. https://bpmsg.com/ahp/?lang=en
Dual-Statistics Analysis with Motion
Augmentation for Activity Recognition
with COTS WiFi

Ouyang Zhang(B)

The Ohio State University, Columbus, OH 43210, USA


[email protected]

Abstract. In recent years, WiFi signal based activity recognition


attracts attention in the community. One traction is the ubiquity of
WiFi devices. The challenge is to achieve sufficient accuracy with min-
imal infrastructure cost without compromising user experience, e.g., no
device attachment on body. In this work, we propose a novel design
paradigm called WiSen, to enhance the performance of the status quo.
WiSen is able to fully utilize the channel information in received signals.
Behind the scenes, WiSen exploits the diversity across subcarriers in the
WiFi band while solving the challenge of dual-statistics analysis. With
extensive experiments in typical environments, the dual-statistics scheme
enhances the accuracy by 36% over the traditional approach. While, inte-
gration with motion augmentation further improves the overall accuracy
by 5.2%, achieving 98% of overall accuracy.

Keywords: Wireless sensing · Human activity recognition · COTS


WiFi

1 Introduction
Human activity recognition serves as the crucial part of numerous human-
centered computing services, such as smart home, elderly care, assisted living,
etc. In the past decades, researchers have explored various techniques to achieve
human activity recognition, such as camera-based [2], radar-based [1] and elec-
tronic wearable devices [3,24]. Camera-based approaches are restricted to line-
of-sight (LoS) areas and require a good light condition. Also, the abundant image
information will potentially threat users’ privacy. The low-cost radar system also
suffers the high directionality and a limited coverage area (tens of centimeters).
By attaching devices on the user’s body, researchers can infer the activity he/she
engages in by analyzing data from sensors like accelerator or gyroscope. How-
ever, attached sensors are neither desirable nor available in most applications.
In contrast, WiFi devices provide the opportunity to achieve a low-cost system
and get rid of the above limitations with less security concern.
Proposed Approach. In this work, we propose WiSen, a novel design paradigm
for device-free human activity recognition. WiSen is a passive detection system
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 888–905, 2023.
https://doi.org/10.1007/978-3-031-18461-1_59
Dual-Statistics Analysis with Motion Augmentation 889

loc2

loc1
Fig. 1. Experiment scenario for preliminary study.

built on commodity WiFi devices. The basic idea behind WiSen is to make fully
use of the channel information in the received signals. To achieve this, WiSen
conducts a dual-statistics analysis to exploit the diversity across multiple sub-
carriers across the band, where a new processing methodology is applied to deal
with the high-dimensional data. What’s more, WiSen augments the recognition
performance with motion analysis.
The principle of activity recognition using WiFi signal is that different human
activities would introduce different channel conditions of wireless propagation.
By processing the channel state information (CSI) obtained from NIC card, the
system is able to track changes in surrounding environments and infer activities.
CSI information is spread over multiple subcarriers over the frequency band,
where each subcarrier represents a small spectrum slice. Existing approaches
[22] have analyzed the distribution of CSI coefficient on a single subcarrier,
which is stored as the profile to match the corresponding activity. However,
information on a single subcarrier is lack of the frequency diversity across the
full CSI vector, which other works [21,26] have shown important to distinguish
different channel conditions. Figure 10 shows a simple but representative example
with two-subcarrier CSI. As we can see, without utilizing the diversity between
subcarriers, CSI 1 and CSI 2 are not distinguishable since they have the same
distance with reference CSI ref, which is the sum of distances on all subcarriers.
Inspired by this, WiSen proposes to exploit statistics of the multiple-subcarrier
CSI vector to enhance the system performance.
Technical Challenges. The main challenge comes from the high dimension of
the CSI vector, which is well-known in the literature [5,6]. The essential difficulty
of statistical analysis on high-dimensional data comes from unknown explicit
distribution function and insufficient data with regards to over-fitting to excess
parameters. In the 20 MHz Wi-Fi band, there are 56 subcarriers so each CSI set
is a vector of 56 complex values with a resolution of 10 bits from NIC card.
890 O. Zhang

Fig. 2. Drink.

2 System Design

2.1 Dual-Statistics Analysis

CSI to Activity Profile. In this section, we will introduce how to build profile
of activities from CSI. In wireless environment, the transmit and receive signals
are associated with a channel coefficient hf = |hf |exp−j∠hf on frequent f which
is a complex value. hf reflects channel conditions in that a longer distance results
in more fading which makes hf a smaller value. Moreover, signals of different
frequencies have different characteristics in shattering, fading, and power decay
during propagation. Thus, the full CSI information across the frequency band of
M subcarriers would be an M ×1 matrix H = [h1 , h2 , ..., hM ]T .
With COTS WiFi, we should be aware that phase information ∠hf is unreli-
able due to the unsynchronized clock between the sender and the receiver [20]. To
construct the activity profile, we collect the amplitudes ft = [|h1 |, |h2 |, ..., |h56 |]T
of CSIs during the activity at timestamp t of 56 subcarriers. Each activity gener-
ates a series of CSIs during a period of time which we can sample at t1 , t2 , ..., tT .
The profile for activity is constructed as the following matrix:

F = [f1 , f2 , ..., ft , ..., fT ] (1)

Dual-Statistics Analysis. In activity recognition, each instance of the same


activity will generate a fingerprint of profile because in reality a person cannot
physically duplicate the same trajectory of body without any difference. Thus,
statistics analysis across the collected instances is conducted for recognition. The
question is how to analyze the time-series data. Traditional approach adopts a
two step processing. The first step is to estimate statistics of time-series data.
Dual-Statistics Analysis with Motion Augmentation 891

Fig. 3. Spine stretch.

Fig. 4. Draw.

The next is to apply earth mover’s distance (EMD) algorithm on the estimation
across these instances. However, due to the challenge of high-dimensional data
in our problem, we cannot easily estimate statistics of time-series CSI vectors.
As far as we know, currently there is no good solution for this dual-statistics
analysis problem.
892 O. Zhang

Fig. 5. Bend-over.

To solve this challenge, we borrow the idea from computer vision community.
Recently, the vision processing area has seen a giant advance by adopting neural
network model on images1 . The underlying principle behind these models is
to replace human-craft features with automatically extracted features through
multi-layer neural network. The back-propagation training can push the model
to approximate any statistical function guided by the supervised data [13], thus
eliminating the tedious and error-prone manual heuristic statistical modeling.
Different from CSI-fingerprint based localization [21,26], our problem is based
on time-series data. Thus, the simple feedforward neural network (FNN) is not
feasible because each input data is sample at one timestamp. As such, FNN
training results in a single-statistics model, which does not satisfy the require-
ment. By contract, recurrent neural network (RNN) has shown its strength in
analyzing time-series data like speech recognition. In WiSen, we propose to uti-
lize sequence model - recurrent neural network (RNN) - to learn the statistics
from activity profiles.
Furthermore, the naive RNN model associates the information in proceed-
ing time direction. However, the loosely-defined activity in our problem does
not impose a well-defined order of motions. Thus, WiSen adopts bidirectional
recurrent neural network (BiRNN [16]) to link information back and forth. In
Sect. 4.1, we validate the effectiveness of BiRNN and its superiority over other
models.

1
Well-known models include AlexNet [10], VGG16 [17], Inception [18] and ResNet
[7].
Dual-Statistics Analysis with Motion Augmentation 893

Fig. 6. Drink.

Fig. 7. Spine stretch.


894 O. Zhang

Fig. 8. Draw.

Fig. 9. Bend-over.

CSI_ref CSI_1 CSI_2

Fig. 10. CSI amplitude vectors with two subcarriers.


Dual-Statistics Analysis with Motion Augmentation 895

y[0] y[1] y[t] Output Layer

h't B B B h'0 Backward Layer


h0 F F F ht Forward Layer

x[0] x[1] x[t] Input Layer

Fig. 11. Bi-directional recurrent neural network architecture.

Figure 11 shows the BiRNN [16] model architecture with forward and back-
ward layers. Specifically, our model uses gated recurrent unit cell (GRU) [4],
which has a similar ability with LSTM cell [15] but fewer parameters. The input
dimension is equal to the length of CSI vectors, which is 56*4 with 2 × 2 MIMO.
The dimension of the internal state is set to 120. The batch size is set to 10. To
enhance generality, we use a dropout wrapper with dropout rate as 0.5. Adam
optimizer [9] is used to adaptively change the learning rate to precisely achieve
the minimum cost. We implement this model in TensorFlow framework, accept-
ing 56*4 dimensional CSI sequence as input and connecting the hidden states to
softmax output layer for activity classification.
Reduce Noise in CSI Values. To reduce the random noise in CSI measure-
ments due to chipset imperfection, we average over five consecutive CSIs. With
the packet rate of 1250 p/s, 5 CSIs span over 4 ms.
Inconsistency in Activity Durations. Undoubtedly, the CSI segments of
the activity would have various lengths. Without a fixed sequence length, the
original data is unsuitable for the BiRNN model. To solve this issue, we deploy
a fix-length subset strategy in the awareness that subset of sampling the data
would also represent the statistics. We assume that each activity is longer than
2 s and thus the total number of CSIs at 1250 p/s is larger than 500. In WiSen,
we use 50 as the BiRNN sequence length, i.e., time resolution of 0.04 s.2
Augmenting Training Data. Generally, in the machine learning a larger train-
ing dataset can enhance the generality and increase the accuracy. Here, we pro-
pose a method to augment the training data. The idea is to fully utilize the data
under the above sampling strategy. Specifically, since we evenly sample 50 CSIs
as one instance, multiple training instances can be obtained by shifting the start
of the sampling. A diagram is shown in Fig. 12 to demonstrate the idea. The
rationale underneath is to account for the drifting time of detecting the start of
the activity.

2
This resolution is good enough for human activities, with speed less than 8 m/s.
896 O. Zhang

Fig. 12. Diagram of training sample extraction.

2.2 Motion-Profile Analysis

Preliminary Study. In this section, we propose that the above activity profile
could be augmented by motion profile to further boost the recognition per-
formance. To demonstrate this, we conduct experiments to show the following
observations: activities with similar status (e.g., position and orientation) have
similar distributions of CSIs but different motion (e.g., speed) so motion profile
can help to distinguish. However, only motion profile is insufficient because the
other observation is activities at different positions might have similar motion
(e.g., speed).
In a typical office (Fig. 15), we deploy two laptops equipped with Atheros
WiFi NICs. The user is guided to sit on the chair performing two activities, i.e.,
drinking water and spine-stretch. On another chair, the user performs different
activities, i.e., drawing and bend-over, as shown in Fig. 1. We measure and pro-
cessing the activity profile and motion profile from CSI collection. Figure 6, 7, 8
and 9 show the histograms of CSI amplitude where each dashed line represents
one instance. For brevity, ‘L1A1’ (location 1 and activity 1) represents drink-
ing water, ‘L1A2’ represents spine stretch, ‘L2A1’ represents draw and ‘L2A2’
represents bend over. As we can see, ‘drink’ and ‘spine stretch’ have similar
histograms. This is the above first observation which also applies to ‘draw’ and
‘bend-over’. One the other hand, Fig. 2, 3, 4 and 5 show the earth mover’s dis-
tance (EMD) [11,14] among the motion profiles. It is difficult to distinguish
‘L1A2’ from ‘L2A2’ and ‘L2A1’ from ‘L1A1’ because spine-stretch and bend-
over involve the upper body while drinking and drawing involves just a hand
which is our second observation.
CSI to Motion Profile. WiSen uses CSI-speed model [20] to build the motion
profile. With K dynamic paths from the transmitter to the receiver, the channel
property h(f, t) can be represented as:
K

−j2πΔf t
h(f, t) = e (hs (f, t) + ak (f, t)e−j2πlk (t)/λ )
k=1

where hs (f ) is the contribution of all static paths. lk (t) is the path length and
ak (f, t) is the attenuation, while Δf is the frequency offset. With the target
moving at speed vk , we have lk (t) = lk (0) + vk ∗ t within a small period t. As
such, we can derive the power |h(f, t)|2 as follows:
Dual-Statistics Analysis with Motion Augmentation 897


K
2πvk t 2πlk (0)
|h(f, t)|2 = 2|hs (f )ak (f, t)|cos( + + φsk )
λ λ
k=1


K
2π(vk − vl )t 2π(lk (0) − ll (0))
+ 2|ak (f, t)al (f, t)|cos( + (2)
λ λ
k,l=1
k=l


K
+ φkl ) + |ak (f, t)|2 + |hs (f )|2
k=1

The above reveals that the target’s moving speed is directly related to the
frequency of sinusoid components in |h(f, t)|2 . Therefore, after transforming the
CSIs series to the frequency domain, the speed profile of the target can be
extracted.
PCA-Based Denoising. Based on the Nyquist theorem, sampling frequency
should be twice as large as the variation frequency. The variation frequency
of CSIs is how many wavelengths the target moves over per second, as shown
in Eq. 2. Thus, 150 Hz is the upper bound with regards to human speed less
than 8 m/s. To have a good de-noising effect, we choose 1250 p/s as the rate as
suggested in [20].
The noise of CSIs mainly comes from the internal WiFi state transition,
which induces similar effects across all subcarriers. PCA-based analysis extracts
the correlation across subcarriers and thus is effective in removing the CSI noise,
as shown in Fig. 13. Different from previous works, the noise typically exists in
the fourth component and above which may be due to the Atheros WiFi cards,
thus motivating us to keep the first three principal components.3
Motion Profile Construction. To extract the motion profile, we use discrete
wavelet transform (DWT) to convert |h(f, t)|2 to the frequency domain. With
each CSI segment of 240 ms, WiSen obtains the mean power for each level
after decomposing into 10 levels4 , which have exponentially decreased frequency
range5 . Compared with short time frequency transform (STFT), discrete wavelet
transform (DWT) has the advantage in obtaining high-frequency component
with high time resolution and low-frequency component with high frequency
resolution. We move the segment window with a step of 80 ms to smoothen
the value. Thus, if the activity lasts for 2s, the motion profile is a collection of
25 10-dimension vectors. Figure 13 (right) shows the heatmap of DWT power
distribution over time.
Motion-Profile Analysis. Matching DWT distribution over time as in [19,20]
is infeasible for loosely-defined activities. WiSen use weighted sum as motion
intensity, denoted as I, with DWT level as the weight because higher frequency

3
Since the information mainly exists in the first several components of PCA, WiSen
does not analyze the later components.
4
It is the maximum value due to the boundary effect.
5
For example, level 1 is the range of 150–300 Hz while level 2 is 75–150 Hz.
898 O. Zhang

Fig. 13. Effect of PCA-based de-noising.

Fig. 14. Motion intensities of three activities.

is induced by faster speed (Eq. 2). Figure 14 shows the values of I with activities
‘h’ - ‘j’ (the activity codes are defined in Table 1).

2.3 Profile Augmentation

In this section, we discuss the augmentation strategy with motion profile. WiSen
proposes priority based decision (PBD) for augmenting recognition from dual-
statistics analysis. PBD is based on the observation that dual-statistics analysis
demonstrates much higher accuracy. Thus, PBD puts a higher priority on the
results from dual-statistics analysis. Specifically, we sum up the probabilities of
co-located activities to get the probability of the corresponding location. After
dual-statistics analysis is applied and there are still multiple candidates, we will
apply results from motion-profile analysis.

3 Methodology

3.1 Testbed

We conduct extensive experiments in two environments, shown in Fig. 15. One


is a typical office environment with multiple desks and chairs. The other is a
two-bedroom home environment. Both of these occasions have rich multipath
propagation. The apartment setup provides NLoS scenario for our testing pur-
pose.
Dual-Statistics Analysis with Motion Augmentation 899

Text
Text

Tx1 B2

B1 Rx2 7m
5m

Rx1

6m 9m

Fig. 15. Testbed. The left is an office room and the right is a two-bedroom apartment.

3.2 Infrastructure Setup

We install Qualcomm Atheros chipsets (i.e., AR9382 and AR9462) on HP lap-


tops as the transmitter and receiver. Each Wi-Fi PCIe card can support two
antennas. Thus, with 802.11n WiFi protocol, the signal transmission can sup-
port a 2 × 2 MIMO stream. As such, for each correctly received packet, the WiFi
card will report 4 spatial CSI sets. To enable the CSI calculation and reporting
functionality of the WiFi card, 802.11 WiFi standard specifies the requirement
of setting the sounding flag. Thus, the linux kernal is modified with Atheros CSI
tool [23] in Ubuntu 14.04 LTS environment, which supports up to 9 spatial CSI
sets while 4 of them are valid in our setting. In the 20 MHz Wi-Fi band with 56
subcarriers, each CSI set is a vector of 56 complex values, with a resolution of
10 bits.

Table 1. Codes and locations for tested activities

Codes Activities Location


a Empty N/A
b Sleep (bed) Bedroom 2
c Read (bed) Bedroom 2
d Phone call (bed) Bedroom 2
e Read (chair) Bedroom 2
f Bend-over (chair) Bedroom 2
g Type (chair) Bedroom 2
h Watch TV (sofa) Living Room
i Video Game (sofa) Living Room
j Drink (sofa) Living Room
k Wash Dishes (sink) Kitchen
l Eat (table) Kitchen
m Cook (stove) Kitchen
900 O. Zhang

3.3 Data Collection


The carrier frequency is 2.462 GHz (WLAN channel 11) and the transmission
bandwidth is 20 MHz. With each received packet, the WiFi chipset calculates
CSI and reports it from the kernel space to the user space. However, due to the
burst transmission, packet error and congestion, the actual packet transmissions
may not be evenly distributed over time. Therefore, we utilize the timestamp
information to interpolate between CSI values.
We have five users, including male and female. In the testbed in Fig. 15, we
conduct different sets of activities. In the office, we test a set of four activities,
including drinking, drawing, bend-over and spine-stretch. In the apartment, we
test 13 activities (shown in the Table 1) and each participant is guided to repeat
for 20 times for each activity with an interval of ∼4 s. The data is separated
in the ratio of 7:3 for training and testing respectively. The antenna setting is
introduced in the evaluation.

4 Evaluation

Fig. 16. Confusion matrix of ‘E-eyes’ w.r.t dual-statistics analysis.

4.1 Dual-Statistics Analysis


To evaluate the performance of the proposed approach for dual-statistics analysis
in WiSen, we compare with two alternatives, which are subcarrier-level distri-
bution approach in E-eyes [22] and single-statistics approach in localization area
with feedforward neural network model (FNN). Specifically, E-eyes [22] con-
structs distribution histogram for each subcarrier as the profile and compares
the distance between the detected profile with the reference profile using EMD
and KNN algorithm. FNN model takes each CSI vector collected during the
activity as the input feature and trains the model with fully-connected neural
network.
Dual-Statistics Analysis with Motion Augmentation 901

Fig. 17. Accuracies of three methods w.r.t dual-statistics analysis.

Fig. 18. Confusion matrix of WiSen w.r.t motion profile.

We analyze the result on the Tx1-Rx1 link and six activities (i.e., ‘h’-‘m’)
which have strong multipath links between Tx1 and Rx1. WiSen has an average
accuracy of 96%. In comparison, the ‘E-eyes’ approach has an average accuracy
of 60.3% (Fig. 16) while that of ‘FNN’ is 36.7% (Fig. 17). From the confusion
matrix (Fig. 16), we can see that the performance of ‘E-eyes’ approach is mostly
degraded by co-located activities, which demonstrates that dual-statistics app-
roach of WiSen is more effective. It also demonstrates that our BiRNN model is
effective in handling the dual-statistics analysis task. What’s more, the improve-
ment over FNN model demonstrates that CSI statistics during the activity is
more effective than CSI at single timestamp.

4.2 Motion-Profile Analysis


The motion-profile analysis shows an average accuracy of 44.7% in Fig. 18, which
is worse than that of dual-statistics analysis with the Tx1-Rx1 link. The reason
is that several activities have similar motion pattern. For example, both ‘wash’
and ‘cook’ includes the shaking of the whole upper body and thus they incur
similar motion intensity I. The large error across activities ‘j’ - ‘m’ is because
the ranges of their motion intensity overlap with each other. However, we note
902 O. Zhang

that it performs well in distinguishing co-located activities, i.e., larger than 93%
across ‘h’, ‘i’ and ‘j’. This observation guides us to design the augmentation
strategy based on the priority between multipath and motion profile, which is
elaborated in Sect. 2.3.

4.3 Holistic System Performance


The holistic system evaluation applies the PBD augmentation strategy.

Office Environment. The office area (Fig. 15) provides a line-of-sight (LoS)
scenario with desks and chairs. The experiment in this section is to evaluate
the effectiveness of the proposed approach in a typical office environment. Since
the results are limited to four types of activities, we do not use the results for
in-depth analysis on each profile (Sect. 4.1 and Sect. 4.2).
Figure 19 shows the confusion matrix. We use cross-validation to obtain the
recognition accuracy. The results show that WiSen can reliably detect and rec-
ognize all four activities in an open area with LoS links.

Fig. 19. Confusion matrix

Apartment Environment. To cover the whole area (Fig. 15), we set up two
WiFi links with one transmitter and two receivers.
Figure 20 shows the performance with individual dual-statistics analsysis and
motion analsysiss as well as integration with augmentation strategy. The results
show that the average accuracy of the dual-statistics analysis is 92.8% while that
of the motion-profile analysis is 23.08%. While, the integration of both profiles
enhance the accuracy to 98%.6 Therefore, it demonstrates that the augmentation
strategy boost the performance by combining the strength of both profiles.

5 Related Works
Non-WiFi Based Approaches. Non-WiFi based approaches have their limi-
tation regarding to the application scenario. For examples, vision approaches [2]
6
It is not obvious in the figure due to the scale.
Dual-Statistics Analysis with Motion Augmentation 903

Fig. 20. Recognition accuracy w.r.t multipath (dual-statistics), motion profile analysis
and WiSen.

needs good light condition and are limited to LoS area. The concern on user’s
privacy is more serious compared to WiFi signal due to the abundant informa-
tion captured by camera. RF based approaches like radar-based [1] and other
customized signal [25] suffer extra infrastructure cost. By attaching sensors on
the body to collect data, [8,24] increase the deployment cost and hurt users’
experience.
WiFi-Based Approaches. The ubiquity of WiFi devices attracts researchers
due to the convenient and low cost deployment. In this literature, one group
of approaches utilize the CSI statistics during the activity. E-eyes [22] collected
CSI traces from commodity devices as a database of activity profiles. E-eyes [22]
constructs distribution histogram for each subcarrier as the profile and compares
the distance between the detected profile with the reference profile using EMD
algorithm. In contrast, another group of methods [12,20] utilize the variation
in CSI traces over time. For instance, WiSee [12] infers gestures by looking
into the Doppler shift. CARM [20] built a CSI-activity model by constructing
a mapping from CSI to speed. Unlike subcarrier-level distribution analysis [22],
WiSen exploits the diversity across subcarriers. Further, we propose a priority-
based strategy (PBD) to boost the performance.

6 Conclusion and Future Works


In this work, we propose the design of human activity recognition system WiSen
using commodity WiFi signal. WiSen improves the recognition performance by
making full use of the channel state information. WiSen applies neural network
model to analyze statistics of high-dimension CSI data. Then, the system fur-
ther increases the recognition accuracy by motion augmentation. With WiSen,
we expect to provide a better activity recognition service to the public. In the
future, there are still potential opportunities to improve the overall system per-
formance. In a smart home environment, there are typically more than one WiFi
links. These multiple WiFi transmiss pairs can provide CSI data separately which
have different channel conditions. Since home activities leave footprints all over
the home space, every individual link could be important in accurately detecting
human motion. Therefore, it is promising to improve the overall system perfor-
mance by integrating the data analysis from multiple links together. We would
like to explore the strategy of multi-link integration in the future.
904 O. Zhang

References
1. Google project soli. https://www.youtube.com/watch?v=0qnizfsspc0
2. Microsoft. x-box kinect. http://www.xbox.com
3. Philips lifeline. http://www.lifelinesys.com/content/
4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
5. Donoho, D.L., et al.: High-dimensional data analysis: the curses and blessings of
dimensionality. AMS Math Challenges Lecture 1(2000), 32 (2000)
6. Giraud, C.: Introduction to High-Dimensional Statistics. Chapman and Hall/CRC
(2014)
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
8. Jeong, S., Kim, T., Eskicioglu, R.: Human activity recognition using motion sen-
sors. In: Proceedings of the 16th ACM Conference on Embedded Networked Sensor
Systems, pp. 392–393. ACM (2018)
9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
11. Levina, E., Bickel, P.: The earth mover’s distance is the mallows distance: some
insights from statistics. In: Proceedings Eighth IEEE International Conference on
Computer Vision. ICCV 2001, vol. 2, pp. 251–256. IEEE (2001)
12. Pu, Q., Gupta, S., Gollakota, S., Patel, S.: Whole-home gesture recognition using
wireless signals. In: Proceedings of the 19th Annual International Conference on
Mobile Computing & Networking, pp. 27–38. ACM (2013)
13. Rippel, O., Adams, R.P.: High-dimensional probability estimation with deep den-
sity models. arXiv preprint arXiv:1302.5125 (2013)
14. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications
to image databases. In: Sixth International Conference on Computer Vision (IEEE
Cat. No. 98CH36271), pp. 59–66. IEEE (1998)
15. Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural net-
work architectures for large scale acoustic modeling. In: Fifteenth Annual Confer-
ence of the International Speech Communication Association (2014)
16. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans.
Sig. Process. 45(11), 2673–2681 (1997)
17. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
18. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
19. Venkatnarayan, R.H., Page, G., Shahzad, M.: Multi-user gesture recognition using
wifi. In: Proceedings of the 16th Annual International Conference on Mobile Sys-
tems, Applications, and Services, pp. 401–413. ACM (2018)
20. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling
of wifi signal based human activity recognition. In: Proceedings of the 21st Annual
International Conference on Mobile Computing and Networking, pp. 65–76. ACM
(2015)
Dual-Statistics Analysis with Motion Augmentation 905

21. Wang, X., Gao, L., Mao, S., Pandey, S.: CSI-based fingerprinting for indoor local-
ization: a deep learning approach. IEEE Trans. Veh. Technol. 66(1), 763–776 (2017)
22. Wang, Y., Liu, J., Chen, Y., Gruteser, M., Yang, J., Liu, H.: E-eyes: device-free
location-oriented activity identification using fine-grained wifi signatures. In: Pro-
ceedings of the 20th Annual International Conference on Mobile Computing and
Networking, pp. 617–628. ACM (2014)
23. Xie, Y., Li, Z., Li, M.: Precise power delay profiling with commodity wifi. In:
Proceedings of the 21st Annual International Conference on Mobile Computing
and Networking, MobiCom 2015, New York, NY, USA, 2015, pp. 53–64. ACM
(2015)
24. Yatani, K., Truong, K.N.: Bodyscope: a wearable acoustic sensor for activity recog-
nition. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing,
pp. 341–350. ACM (2012)
25. Zhang, Y., et al.: Vibration-based occupant activity level monitoring system. In:
Proceedings of the 16th ACM Conference on Embedded Networked Sensor Sys-
tems, pp. 349–350. ACM (2018)
26. Zhou, R., Chen, J., Lu, X., Wu, J.: CSI fingerprinting with SVM regression to
achieve device-free passive localization. In: 2017 IEEE 18th International Sympo-
sium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), pp.
1–9. IEEE (2017)
Cascading Failure Risk Analysis of Electrical
Power Grid

Saikat Das and Zhifang Wang(B)

Virginia Commonwealth University, Richmond, VA 23220, USA


{dass10,zfwang}@vcu.edu

Abstract. This paper studies the severity of cascading failure processes of electri-
cal power grids by statistically analyzing the number of line outages per cascade,
the cascade duration, and the load shedding amount of a cascade, based on the
historical utility data of the BPA system and the simulation data of two synthetic
test cases. Both uniform and non-uniform probability distribution functions have
been considered for the initial line trips in the cascading failure simulation in
order to determine which function better approximates the cascading failure risks
of the real-world grid. The obtained simulation data and statistical analysis results
from the two 500-bus synthetic test cases are then compared with those from the
historical utility data.

Keywords: Electrical power system · Cascading failure · Probability


distribution function · Cascade duration · Line outage · Load shedding

1 Introduction
The modern electrical power system is always evolving to meet the growing electricity
demand. Any disruption in power delivery has a severe effect on our daily life. A complex
interconnected infrastructure like the power grid is prone to cascading processes where
failure in one part of the system can affect other parts of the grid. These cascading
processes may initiate by a single line trip and result in a large number of line outages
and uniform blackouts. The best way to analyze these events is to study the past events
documented by various utility and power distribution companies. The existing simulation
models can also be validated by using the statistical parameters of the past events as a
benchmark. In [1], the authors took a complex system approach to analyze the blackout
risk of power transmission systems using the North American Electrical Reliability
Council (NERC) data. In [2], the authors evaluated the statistics of cascading line outages
spreading using utility data. In [3], the authors studied different algorithms for cascading
failure analysis in the power grid. In [4, 5], the authors incorporated a cascading failure
simulation model with ac power flow to show power systems’ vulnerability to cascading
failure with rising renewables integration. In [6], the authors estimated the distribution
of cascaded outages using both historical data and a simulation model.
The severity of a cascading failure process in the power grid can be measured by
the number of lines tripped during the event, the duration of the whole process, and

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 906–923, 2023.
https://doi.org/10.1007/978-3-031-18461-1_60
Cascading Failure Risk Analysis of Electrical Power Grid 907

the amount of load shedding needed to restore the system balance. In our study, these
three parameters of a cascading failure process have been statistically analyzed using
the historical utility data of the BPA system and two synthetic 500-bus test cases. In a
cascading failure simulation model, it is critical to define an appropriate mechanism for
the initial line trip which causes the cascading failure (CF) process. In our previous work
[6], an uniform probability distribution function was adopted for the cascading failure
study. In this study, we will consider both uniform and non-uniform initial line tripping
probability and compare them with the results of the historical utility data, in order to
determine which function may give better approximates of the grid CF risks.

2 Risk Analysis of Cascading Process from Historical Utility Data


In this study, the historical outage data from the Bonneville Power Administration (BPA)
website [7] have been collected and statistically analyzed. The BPA system is an Ameri-
can federal agency operating in the Pacific Northwest. Bonneville is one of four regional
Federal power marketing agencies within the U.S. Department of Energy (DOE). We
examined 21 years of transmission line outage data (1999–2020) publicly available on
the BPA website. Every outage has information on the outage date and time, outage
duration, voltage level, outage type, cause of the outage, and so on. All the outages can
be categorized into automatic outages and planned outages. Only automatic outages are
considered for our study as these outages normally initiate CF processes in the system.
From 1999 to 2020, there are 43,240 automatic outages documented on the BPA data.
These outages are then grouped into different cascades according to their start times: if
two line outages separated by more than an hour, they will be grouped into two different
cascades. Otherwise, if the start time difference of two consecutive outages is less than or
equal to one hour, they will be grouped into the same cascade. This grouping process is
adopted from [2]. After grouping all the outages into separate cascades, we then analyze
the risk associated with each cascade.
In order to measure the severity of CF processes, the number of tripped lines (outage
number) and the duration of each cascade have been studied for the BPA system. The
higher the outage number and the longer the duration of a cascade, the higher risk it will
pose to the system. Since the load shedding is not recorded in the BPA data therefore
we cannot analyze this measure for it.
In Fig. 1(a), we can see the probability distribution of the outage numbers in a cascade
for the BPA data. Here, the probability of a cascade with only one outage is the highest.
After that, with the increasing number of outages, the probability decreases rapidly. The
exponential distribution function is adopted to fit the BPA outage data. According to this
distribution, the mean number of outages in a cascade is 3.53. In Fig. 1(b), the probability
distribution of the cascade duration has been shown. The probability of a cascade lasting
less than 20 min is the highest. After that, with the increasing amount of duration, the
probability decreases rapidly. We have also fitted this duration data with the exponential
distribution function. The mean duration according to this distribution function is 30.9
min.
908 S. Das and Z. Wang

(a) Probability Distribution of Outage Number in a Cascade

(b) Probability Distribution of Cascade duration


Fig. 1. Probability distribution with exponential fitting for BPA data

3 Risk Analysis from Simulated Data


3.1 Simulation Model

The CF simulation model developed in [8] is used for the simulations on two synthetic
power system test cases in order to analyze the risk of CF processes. At first, the optimal
power flow (OPF) algorithm is used on the forecasted load profiles to determine the
initial generation dispatches in normal condition. Then a single branch of the system
has been tripped manually to initiate the CF process. After every branch gets tripped,
AC power flow has been used to determine the overloaded branches of the system. The
Unscented Transformation (UT) method has been used to determine the mean overload
time of the overloaded branches [9]. These values are then used in the relay mechanism
to determine whether any other branches are getting tripped or not. After every new line
trip, a modified version of the OPF algorithm is used to restore the power balance of the
system. Here, the least-square adjusted OPF algorithm is used to mimic the most viable
path of the CF process. The CF process will eventually stabilize and there will be no
more overloaded lines in the system. After every CF simulation, we have determined
the total number of line trips during the cascade, the duration of the cascade, and also
the load shedding amount (if any).
Cascading Failure Risk Analysis of Electrical Power Grid 909

3.2 Initial Line Tripping Mechanism


We have initiated our cascading failure simulation by tripping a single branch of the
system. If a system has n number of branches, n different cascades can be initiated
by tripping each different branch. We have considered both uniform and non-uniform
distribution for these initial line trips.

Uniform Initial-Line-Trip Distribution: In this case, it is assumed that every line get-
ting tripped to initiate a CF process is with the same probability in the system. For that
scenario, a system with n lines has n different cascades and the probability of these
cascades happening is same.

Non-uniform Initial-Line-Trip Distribution: We have also considered the non-


uniform probability of initial line trips. Three different parameters will be considered
to assign the non-uniform probability of initial line trips so that every cascade in the
system may have a different probability.

Branch Flow: First we consider the branch flow as a defining parameter to assign the
probability of initial line trips. The loading level of each branch of the system is deter-
mined as the ratio of the branch power flow under normal operation condition and its
maximum branch capacity. After that, we divided all the branches into ten categories
and assigned them different weights for the CF estimation, as shown in Table 1(a).

Table 1(a). Non-uniform probability definition of initial line trips according to the loading level

Loading level (l) Weight


0 ≤ l ≤ 0.1 1
0.1 < l ≤ 0.2 2
0.2 < l ≤ 0.3 3
0.3 < l ≤ 0.4 4
0.4 < l ≤ 0.5 5
0.5 < l ≤ 0.6 6
0.6 < l ≤ 0.7 7
0.7 < l ≤ 0.8 8
0.8 < l ≤ 0.9 9
0.9 < l ≤ 1.0 10

The higher the loading levels, the higher the probability of a branch initially getting
tripped to start a cascading process.

Shortest Path: In this case, we use the network topology of a power grid to assign the
non-uniform probability of initial line trips. We first define all the generation buses as
the boundary nodes of a system and calculate the shortest path from every bus to the
910 S. Das and Z. Wang

boundary. The distances are measured in terms of the hops on the shortest path. As every
branch is connected with two buses, we have taken the minimum of the shortest paths
associated with the two buses connected to the branch. Hence, for every branch, we
have a corresponding shortest path to the boundary. We have considered this distance to
assign a specific weight for every initial line trip. The longer the distance, the higher is
the probability of a line initially getting tripped, as shown in Table 1(b).

Connectivity: In this case, we have considered the connectivity of every bus as a defining
parameter. In the power grid, every bus is connected to several other buses. For every
bus, we have determined this connectivity number. As every branch is connected by
two buses, for every branch, we have considered the average connectivity number of the
corresponding buses. This connectivity number is considered to assign the non-uniform
probability of initial line trips. The higher the connectivity, the higher the probability of
a line getting tripped initially and starting a CF process, as shown in Table 1(c).

Table 1(b). Non-uniform probability definition of initial line trips according to the shortest path

Shortest path to boundary (d) Weight


d=1 1
d=2 2
d=3 3
d=4 4
d=5 5
d=6 6
d=7 7

Table 1(c). Non-uniform probability definition of initial line trips according to the connectivity

Connectivity number of every line (n) Weight


n=1 1
n = 1.5 2
n=2 3
n = 2.5 4
n=3 5
n = 3.5 6
n=4 7
n = 4.5 8
Cascading Failure Risk Analysis of Electrical Power Grid 911

4 Result
In this study, two different synthetic 500-bus test cases have been used for the CF
simulation and statistical risk analysis.

4.1 Test Case: ACTIVSg500


The ACTIVSg500 case is a 500 bus power system test case that is entirely synthetic,
built from public information and statistical analysis of real power systems. It bears
no relation to the actual grid in this location, except that generation and load profiles
are similar [10]. This synthetic 500-bus test case contains 56 committed generators, 597
transmission lines, and 200 loads, a total online generation capacity of 8863.6 MW, and a
load of 7750.7 MW. As the system has 597 transmission lines, we will have 597 different
CF processes by initially tripping these 597 lines. For each CF simulation process, we
have calculated the number of total lines tripped during the cascade, the duration of
the cascade, and the amount of load shedding in percentage. We have considered both
uniform and non-uniform probability of initial line trips. For the uniform probability of
initial line trips, all the cascades initiated by a single line trip have the same probability.
On the contrary, for the non-uniform probability of initial line trips, all the cascades have
a different probability of occurrence. We have considered three different loading levels
for our CF simulation and every loading level have a different weight according to the
load duration curve of the BPA control area load. These three different loading levels
have been combined according to their weight to mimic the real-world scenario where
loading levels are different during different times of the day, as shown in Table 2.

Table 2. Weights of different loading level

Loading level Weight


85% 0.1
70% 0.7
50% 0.2

In Fig. 2, the probability distributions of the total number of outages in a cascade are
shown. Figure 2(a) considers the uniform probability of any initial line trips and Fig. 2
(b), (c), and (d) consider the non-uniform probability of initial line trips. In Fig. 2(b),
the power flow in a branch is used as a defining parameter to assign different weights on
initial line trips. Similarly, in Fig. 2(c) and Fig. 2(d), the shortest path to the boundary and
connectivity of the systems are used as defining parameters to assign different weights
on the initial line trips respectively. It is obvious from the figure that most of the cascades
have only one outage and did not spread any more. The probability of a cascade having
a large number of outages is very low and this probability decreases with the increasing
number of outages. We have fitted our simulated outage number data with exponential
distribution. In Table 3, the mean number of outages per cascade has been shown. Initial
912 S. Das and Z. Wang

(a) Uniform Initial Line Trip Probability

(b) Initial Line Trip Probability According to Branch Flow

(c) Initial Line Trip Probability According to the Shortest Path to the Boundary

(d) Initial Line Trip Probability According to the Connectivity


Fig. 2. Probability distribution of the number of outages in a cascade for the ACTIVSg500 system
Cascading Failure Risk Analysis of Electrical Power Grid 913

line trip probability according to branch flow has given us the best estimation compared
with the BPA data while probability according to shortest path underestimated the mean
outage number and probability according to connectivity overestimated the number the
most.

Table 3. Mean number of outages in a cascade for ACTIVSg500 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean number of outages 3.44 3.46 3.05 4.07

In Fig. 3, the probability distributions of the duration of a cascade are shown.


Figure 3(a) considers the uniform probability of any initial line trips and Fig. 3(b),
(c), and (d) consider the non-uniform probability of initial line trips. The probability
of a cascade having a long duration is very low and this probability decreases with the
increasing duration. We have fitted our simulated cascade duration data with exponential
distribution. In Table 4, the mean duration of a cascade has been shown which is pretty
close to the number we got from the BPA data. Initial line trip probability according to
the connectivity of the system has given us a more accurate mean duration value com-
pared with the BPA data while probability according to the shortest path underestimated
the mean duration of a cascade the most.
In Fig. 4, the probability distributions of the load shedding percentage during a
cascade are shown. Figure 4(a) considers the uniform probability of any initial line
trips and Fig. 4(b), (c), and (d) consider the non-uniform probability of initial line trips.
The probability of a cascade having a large load shedding amount is very low and this
probability decreases with the increasing load shedding amount. We have fitted our
simulated load shedding percentage data with exponential distribution. In Table 6, the
mean load shedding percentage of a cascade has been shown where probability according
to the shortest path resulted in the lowest amount of load shedding and the probability
according to connectivity resulted in the largest amount of load shedding.
914 S. Das and Z. Wang

(a) Uniform Initial Line Trip Probability

(b) Initial Line Trip Probability According to Branch Flow

(c) Initial Line Trip Probability According to the Shortest Path to Boundary

(d) Initial Line Trip Probability According to the Connectivity


Fig. 3. Probability distribution of the cascade duration for the ACTIVSg500 system
Cascading Failure Risk Analysis of Electrical Power Grid 915

(a) Uniform Initial Line Trip Probability

(b) Initial Line Trip Probability According to Branch Flow

(c) Initial Line Trip Probability According to the Shortest Path to Boundary

(d) Initial Line Trip Probability According to the Connectivity


Fig. 4. Probability distribution of the load shedding percentage of a cascade for the ACTIVSg500
System
916 S. Das and Z. Wang

Table 4. Mean duration of a cascade for ACTIVSg500 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean duration (minutes) 28.86 28.94 27.32 29.26

Table 5. Mean load shedding percentage of a cascade for ACTIVSg500 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean load shedding (%) 1.78 1.91 1.41 2.23

4.2 Asgwecc_500_1

We have used another 500 bus test case for our CF simulation model. This test case
has been developed using AutoSyngrid, a MATLAB-based toolkit for the automatic
generation of synthetic power grids [11]. For our generated test case, WECC (West-
ern Electricity Coordinating Council) system has been utilized as a reference system
for generation and load settings as this system is closely related to the BPA control
area load. We have named this test case ASGWECC_500_1. This test case contains
103 committed generators, 875 transmission lines, 109 loads, a total online generation
capacity of 34,277.8 MW, and a load of 29,313.8 MW. Like the previous test case, as the
system has 875 transmission lines, we will have 875 different CF processes by initially
tripping these 875 lines. For each CF simulation process, we have calculated the number
of total lines tripped during the cascade, the duration of the cascade, and the amount
of load shedding in percentage. In this test case also, we have considered both uniform
and non-uniform probability of initial line trips. Just like the previous test case, we have
considered three different loading levels for our CF simulation, and every loading level
has a different weight according to the load duration curve of the BPA control area load.
These three different loading levels have been combined according to their weight to
mimic the real-world scenario where loading levels are different during different times
of the day.
In Fig. 5, the probability distributions of the total number of outages in a cascade
are shown. Figure 5(a) considers the uniform probability of any initial line trips and
Fig. 5(b), (c), and (d) considers the non-uniform probability of initial line trips. In
Fig. 5(b), the power flow in a branch is used as a defining parameter to assign different
weights on initial line trips. Similarly, in Fig. 5(c) and Fig. 5(d), the shortest path to
the boundary and connectivity of the systems are used as defining parameters to assign
different wights on the initial line trips respectively. It is evident from the figure that most
of the cascades have only one outage and did not spread any further. The probability of a
cascade having a large number of outages is very low and this probability decreases with
the increasing number of outages. We have fitted our simulated outage number data with
exponential distribution. In Table 6, the mean number of outages per cascade has been
Cascading Failure Risk Analysis of Electrical Power Grid 917

(a) Uniform Initial Line Trip Probability

(b) Initial Line Trip Probability According to the Branch Flow

(c) Initial Line Trip Probability According to the Shortest Path to the Boundary

(d) Initial Line Trip Probability According to the Connectivity


Fig. 5. Probability distribution of the number of outages in a cascade for the ASGWECC_500_1
system
918 S. Das and Z. Wang

shown which is higher than the number we got from the BPA data and the previous test
cases. Initial line trip probability according to the shortest path to boundary has given
us the best estimation with the BPA data while probability according to the branch flow
overestimated the number the most.

Table 6. Mean number of outages in a cascade for the ASGWECC_500_1 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean number of outages 4.40 5.91 3.39 4.72

In Fig. 6, the probability distributions of the duration of a cascade are shown.


Figure 6(a) considers the uniform probability of any initial line trips and Fig. 6(b),
(c), and (d) consider the non-uniform probability of initial line trips. The probability
of a cascade having a long duration is very low and this probability decreases with the
increasing duration. We have fitted our simulated cascade duration data with exponential
distribution. In Table 7, the mean duration of a cascade has been shown which is very
close to the number we got from the BPA data. Initial line trip probability according
to the branch flow of the system has given us the best estimation of the mean dura-
tion value compared with the BPA data while probability according to the shortest path
underestimated the mean duration the most.
In Fig. 7, the probability distributions of the load shedding percentage during a
cascade are shown. Figure 7(a) considers the uniform probability of any initial line trips
and Fig. 7(b), (c), and (d) considers the non-uniform probability of initial line trips.
The probability of a cascade having a large load shedding amount is very low and this
probability decreases with the increasing load shedding amount. We have fitted our
simulated load shedding percentage data with exponential distribution. In Table 8, the
mean load shedding percentage of a cascade has been shown which is lower than in the
previous test case. Initial line tripping probability according to the shortest path resulted
in the lowest amount of load shedding and the probability according to branch flow
resulted in the largest amount of load shedding.
In Table 9, the overall summary of the CF processes in terms of the total number of
tripped lines and cascade duration has been presented. For test case ACTIVSg500, the
mean number of tripped lines in cascade is similar to BPA data for the uniform probability
of initial line trips. However, we get the best result while assuming a non-uniform
probability of initial line trips according to branch flow. For test case ASGWECC_500,
all the simulated results overestimated the mean number of tripped lines during a cascade
and we get the best result in comparison with BPA data for a non-uniform probability
assumption for initial line trips according to the shortest path to boundary parameter. For
cascade duration, the results from both the test cases are similar to BPA data. Here, the
non-uniform probability assumption for initial line trips according to the branch flow
gives the best results for ASGWECC_500, and the non-uniform probability assumption
for initial line trips according to the connectivity of the system gives the best results for
the ACTIVSg500 test case.
Cascading Failure Risk Analysis of Electrical Power Grid 919

(a) Uniform Initial Line Trip Probability

(b) Initial Line Trip Probability According to Branch Flow

(c) Initial Line Trip Probability According to the Shortest Path to the Boundary

(d) Initial Line Trip Probability According to Connectivity

Fig. 6. Probability distribution of the cascade duration for the ASGWECC_500_1 system
920 S. Das and Z. Wang

(a) Uniform Initial Line Tripping Probability

(b) Initial Line Tripping Probability According to Branch Flow

(c) Initial Line Tripping Probability According to the Shortest Path to Boundary

(d) Initial Line Tripping Probability According to Connectivity


Fig. 7. Probability distribution of the load shedding percentage of a cascade for the ASG-
WECC_500_1 system
Cascading Failure Risk Analysis of Electrical Power Grid 921

Table 7. Mean duration of a cascade for ASGWECC_500_1 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean duration (minutes) 29.50 31.17 27.78 30.29

Table 8. Mean load shedding percentage of a cascade for ASGWECC_500_1 system

Probability of initial line trip Uniform Non-uniform


Branch flow Shortest path Connectivity
Mean load shedding (%) 0.70 1.16 0.58 0.75

Table 9. Overall comparison of cascading failure risk analysis for the BPA system and the
synthetic test cases

System Number of tripped lines Cascade duration (minutes)


BPA 3.53 30.9
Probability Uniform Non-uniform Uniform Non-uniform
distribution of Branch Shortest Connectivity Branch Shortest Connectivity
initial line trips flow path flow path
ACTIVSg500 3.44 3.46 3.05 4.07 28.86 28.94 27.32 29.26
ASGWECC_500 4.40 5.91 3.89 4.72 29.5 31.17 27.78 30.29

For the load shedding, the BPA system lacks historical data and we do not have
a benchmark for performance comparison. We can only say that from Table 10, the
load shedding percentage is higher for the ACTIVSg500 test case for both uniform and
non-uniform probability distribution of initial line trips. It seems that the non-uniform
probability function of initial line trips will give most consistent results for both test
cases. But in the future, more simulations will be done using more test cases, so that we
will be able to conduct a self-comparison among these test cases and figure out which
probability distribution function is better.

Table 10. Comparison of mean load shedding percentage between two synthetic test cases

System Load shedding (%)


Probability distribution of initial line Uniform Non-uniform
trips Branch flow Shortest path Connectivity
ACTIVSg500 1.78 1.91 1.41 2.23
ASGWECC_500 0.70 1.16 0.58 0.75
922 S. Das and Z. Wang

5 Conclusion
The severity of a cascading failure process in the power grid can be measured by the
number of lines that got tripped during the event, the duration of the whole CF process,
and the amount of load needed to be shed to restore the balance of the system. In this
study, we have analyzed these different parameters of a cascading failure process using
historical utility data and two synthetic test cases. For a cascading failure simulation
model, it is important to recognize the mechanism of initial line trips which initiate the
CF process. In this study, we have considered both uniform and non-uniform probability
distribution of initial line trips and compared our results with the results from historical
data. It is viewed that non-uniform distribution of initial line trips gave better estima-
tions on mean outage number and cascade duration. For the ACTIVSg500 test case, the
probability with branch flow worked better for outage number estimation and probability
with connectivity worked better for estimating cascade duration. On the other hand, for
the ASGWECC_500 test case, the probability with the shortest path worked better for
outage number estimation and the probability with branch flow worked better for esti-
mating cascade duration. For the load shedding, the BPA system lacks information about
load shedding amount and we do not have a benchmark for performance comparison.
We can only say that the load shedding percentage is higher for the ACTIVSg500 test
case than the ASGWECC_500 test case for both uniform and non-uniform probability
distribution of initial line trips. In future work, more simulations will be done using more
test cases, so that we may be able to conduct a self-comparison among these test cases
and figure out which probability distribution function is better in general or if there are
specific cases where we need to choose a specific distribution.

References
1. Dobson, I., Newman, D.E., Carreras, B.A., Lynch, V.E.: An initial complex systems analysis
of the risks of blackouts in power transmission systems. Power Systems and Communications
Infrastructures for the Future, pp. 1–7, September 2002
2. Dobson, I., Carreras, B.A., Newman, D.E., Reynolds-Barredo, J.M.: Obtaining statistics of
cascading line outages spreading in an electric transmission network from standard utility
data. IEEE Trans. Power Syst. 31(6), 4831–4841 (2016). https://doi.org/10.1109/TPWRS.
2016.2523884
3. Soltan, S., Mazauric, D., Zussman, G.: Cascading failures in power grids-analysis and
algorithms, vol. 14. https://doi.org/10.1145/2602044.2602066
4. Athari, M.H., Wang, Z.: Stochastic cascading failure model with uncertain generation using
unscented transform. IEEE Trans. Sustain. Energy 11(2), 1067–1077 (2020). https://doi.org/
10.1109/TSTE.2019.2917842
5. Das, S., Wang, Z.: Power grid vulnerability analysis with rising renewables infiltration. In:
Proceedings of IMCIC 2021 - 12th International Multi-Conference Complexity, Informatics
Cybernetics, vol. 2, pp. 157–162, July 2021
6. Das, S., Wang, Z.: Estimating distribution of cascaded outages using observed utility data and
simulation modeling. In: 2021 North American Power Symposium, NAPS 2021, pp. 5–10
(2021). https://doi.org/10.1109/NAPS52732.2021.9654745
7. BPA.gov - Bonneville Power Administration - Bonneville Power Administration. https://
www.bpa.gov/. Accessed 24 Apr 2022
Cascading Failure Risk Analysis of Electrical Power Grid 923

8. Das, S., Wang, Z.: Power grid vulnerability analysis with rising renewables infiltration. J.
Syst. Cybern. Inform. 19(3), 23–32 (2021)
9. Julier, S.J., Uhlmann, J.K.: Unscented filtering and nonlinear estimation (2004). https://doi.
org/10.1109/JPROC.2003.823141
10. Birchfield, A.B., Xu, T., Gegner, K.M., Shetye, K.S., Overbye, T.J.: Grid structural character-
istics as validation criteria for synthetic networks. IEEE Trans. Power Syst. 32(4), 3258–3265
(2017). https://doi.org/10.1109/TPWRS.2016.2616385
11. Sadeghian, H., Wang, Z.: AutoSynGrid: a MATLAB-based toolkit for automatic generation
of synthetic power grids. Int. J. Electr. Power Energy Syst. 118 (2020). https://doi.org/10.
1016/j.ijepes.2019.105757
A New Metaverse Mobile Application
for Boosting Decision Making of Buying
Furniture

Chutisant Kerdvibulvech(B) and Thitawee Palakawong Na Ayuttaya

Graduate School of Communication Arts and Management Innovation, National Institute of


Development Administration, 118 SeriThai Road, Klong-Chan, Bangkapi,
Bangkok 10240, Thailand
[email protected]

Abstract. In applied computational science, augmented reality (AR) is recently a


very popular technology as it can help people to understand the virtual world and
the real world more clearly. In this paper, we develop a new mobile application
using augmented reality for boosting decision-making in purchasing furniture.
Pixlive Maker is also used to design augmented reality, and then it is used to
integrate into our main system for 3D furniture. After that, we evaluate the system
by distributing questionnaires to people who have used the application about the
respondents’ experience to make a product purchase. The samples are selected
by the stratified random sampling method from people who have smartphones
and have been using online purchasing. Finally, the experimental results after
the respondents used the augmented reality furniture application show that our
proposed system is robust for boosting decision-making of buying furniture.

Keywords: Augmented reality · Buying furniture · Interfaces · Decision making

1 Introduction
Technology continues to become more important for people in everyday life. There
are so many advancements that provide people with greater convenience. People also
want to be creative and see what they imagine become reality. Companies are thus
applying augmented reality (AR) to promote their products in 3D to let customers see the
product in every dimension as it combines a virtual world and physical world together, as
explained by Nunes et al. [1]. There are many technologies to provide people with greater
convenience in everyday life. People want to be creative and let their imagination become
reality. Companies can therefore use augmented reality to promote their products in 3D
to let the customer see the real product in every dimension, so this desire has been the
idea behind the creation of metaverse, focusing on augmented reality. More obviously,
augmented reality is an emerging area that combines the virtual world and physical
world together. In this paper, our contribution is to focus on applying augmented reality
to the decision-making process of buying furniture. This technology is attracting more
interest many companies are attempting to use it in new applications to influence people

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023


K. Arai (Ed.): FTC 2022, LNNS 559, pp. 924–933, 2023.
https://doi.org/10.1007/978-3-031-18461-1_61
A New Metaverse Mobile Application 925

to buy their products. This paper is divided into five main parts. Section 1 explains a
general introduction to augmented reality and the registration method. Section 2 then
discusses the proposed system and the research methodology. Next, Sect. 3 explains
the experiment and shows the results of the proposed system. Section 4 explores the
evaluation, including discussing the results. Finally, Sect. 5 gives a conclusion to this
paper and then predicts the future direction.

1.1 Augmented Reality

Augmented reality integrates graphical items, computer-generated in the form of 3D


graphics, into the physical scene. An important feature is the ability to interact with
them in real-time. For proper operation, augmented reality uses the image from a cam-
era, e.g., a smartphone, to determine the actual plane by detecting color patterns and
unique shapes present on the viewed surface. Although augmented reality is not a new
concept at all, only recently have we witnessed its dynamic development. The meta-
verse is a parallel world, usually based on technology and virtuality. It is intended to
be a completely changed form of the internet, thanks to which the network will cease
to be one-dimensional. Metaverse is to create a virtual universe of users in which they
will be able to perform activities assigned to the real world. Tools that are to enable, for
example, meetings in the virtual world are avatars or, already known to us, VR goggles.
Thanks to them, it will be possible to move to a parallel world in which there will be
other people using this device.
Augmented reality is usually mentioned as a subset of the metaverse, focusing on the
combination of computer-generated virtual imagery and physical objects. This means
that this technology can help people to interact with the virtual scenes using real ele-
ments in a unique way, which is different from virtual reality for a totally immersive
environment, as described in [2] about virtual reality technology. Azuma [3] also gives
a good description of augmented reality that has three main definitions. First, it com-
bines physical and computer-generated virtual scenes. Second, it is usually interactive
in real-time. Third, it registers the computer-generated virtual world with the physical
scene. Therefore, it is recently various ways that are beneficial to the use of metaverses,
including science, innovation, and journalism. In addition, [4] suggested that it can be
applied to more specific fields of artificial intelligence, such as computer vision and
human-computer interaction.
In this paper, this technology has been used in many applications to improve and
make it more interesting. There are several applications used by companies, including
those involved in game development, e-commerce, and tourism that use augmented
reality to create interactive games, advertising, and packaging, enhancing the entire
retail experience For instance, IKEA developed the application “IKEA Place” that uses
metaverse to show items in their shopping catalog in 3D as well as affords the opportunity
to furnish a room using virtual models, as discussed by Kammerer et al. [5], which lets a
customer feel they are looking at a real product and not just a picture in an advertisement,
magazine or catalog. Not only can they see every dimension of the product, but they can
also place it in their home to see if it fits the space and décor of their home. They can also
determine which color and style might be the most suitable. In general, this technology
926 C. Kerdvibulvech and T. P. N. Ayuttaya

can help customers choose the right product for their home, as explained by Saraswati
[6].

1.2 Registration Method


The program used to create this augmented reality project is PixLive Maker. Vidinoti
has implemented an augmented reality CMS tool called PixLive that differentiates itself
by the fact that image-based augmented reality is only part of the platform. In addition
to image-based augmented reality, PixLive also implements beacons and GPS points,
which are both location-based triggers that offer very contextual information to end-
users; in other words, specific content can be linked-to specific locations and a user can
access that content when they are within the range of one of these triggers PixLive is a
very complex augmented reality CMS, bringing with it many features and consequently
a little more of a learning curve. The editing interface is referred to as PixLive Maker
and the scanner is referred to as PixLive Player, as developed by Kerdvibulvech and
Wang [7].

Fig. 1. AR furniture application framework

2 Proposed System and Research Methodology


This paper uses a prototype application developed for this study using the Android Studio
program. Next, the application is tested to determine whether this augmented reality
application can influence people to decide on furniture purchases easier, as explained by
Al-Azzam [8]. After the application was developed, it was tested with a target sample
selected to use the application. The sampling for this research was selected using a
probability sampling technique. The sample group comprised consumers who use a
smartphone regularly and have an interest in buying products online, as presented by
Joshi [9]. They were aged between 20 to 45 years of age and belong to generations X and
A New Metaverse Mobile Application 927

Y because people in this age group are familiar with technology and have a greater chance
to buy furniture online. Huntley [10] suggested that technology devices are essential as
journalism tools. It also can be a good sign of generational identity. These devices include
smartphones, game consoles, tablet computers, and personal computers. Generation Y
has grown up in the digital world, as discussed by Wolburg and Pokrywczynski [11]. The
application was designed using Android Studio and is called “AR Smart Furniture”. The
program can run on both Windows and OSX for developing applications on the Android
platform. The first step was to develop a mind map to see the overall application plan.
The mind map made it possible to see the links between each page and their action.
The application framework can be seen in Fig. 1. The next step was to design each
program page using Adobe Photoshop. After designing the application, each page was
created using Android Studio as applications developed on Android Studio can use many
languages. The language used in AR Smart Furniture is Java. The application has two
features. The first allows users to interact with posters using PixLive Maker to create
the AR Mode. Moreover, consumers can make a purchase more easily when they see
furniture they like. The AR model was developed using PixLive Maker, a very useful
program for people who want to create augmented reality applications as it is easy to
learn and has useful functions. Moreover, developers do not have to code a program but
can instead create only visual art and add it automatically to the program. In addition,
this program is free of charge. A developer can just access the website and begin their
work, whereas many other programs require payment or are free only for a limited period
of time. Furthermore, developers can also carry out coding in Android Studio, but they
must first learn how to code from the basic level, which can take a long time. Therefore,
the PixLive Maker is the best tool for developers from the beginner to professional levels.
It can be employed effectively not just for AR mode, but also for manual mode. If the
user knows the size of a space or room, they can input this. After that, the program
will select the furniture that can fit this size. This function can help customers select a
product more easily. They can also purchase the product from stores where it is sold. It
will automatically link to the page where they can make their purchase, which makes
this even more convenient for customers.
Augmented reality in this application has been developed by PixLive Maker which
is the program that can create augmented reality. This application has the structure as
indicated in Fig. 1. Figure 2 shows an example of an AR Furniture application. It contains
the scanned augmented reality furniture poster which will pop up in the augmented reality
that has been created. The second mode is the manual mode, which allows the user to
choose the furniture they want. After this, the user will go to the page they selected,
which will link to a page according to the selected furniture. The developer tried to
design this page to be as simple as possible to make the application user-friendly. Also,
Fig. 2 shows the real design of the poster when the user picks up the furniture. Note that
the furniture photo icons used in this research are from www.ikea.com/, www.konceptfu
rniture.com/, and www.google.com for research and academic use. The size of the chair
should match the room. So, the user can feel the look of the furniture when the furniture
is placed in the room space.
928 C. Kerdvibulvech and T. P. N. Ayuttaya

Fig. 2. Our application interface. Note that the furniture photo icons are from www.ikea.com/,
www.konceptfurniture.com/, and www.google.com for academic use.

3 Experimental Results
After testing the application to determine it worked properly, it was tested by sampling.
To be sure the application could solve the research problem and meet the objectives,
users were asked to pick different pieces of furniture to see how they looked and if the
system worked smoothly. From Fig. 3, the sampling tried different sofas, and the program
worked properly. The users could pick what they wanted from the picture. After they
picked a sofa, it would appear in the room. The users liked the poster because they could
interact and play with it. After they selected a sofa, they could go directly to the store.
After letting the sampling test the application, they were given an online questionnaire
to determine how much they liked the application. This also helped us to learn what
could be improved in the future. The participants could choose different sofas from the
poster via their smartphone, and then the sofa would appear in the room.

Fig. 3. The different types of sofa (left) and selecting the furniture (right)
A New Metaverse Mobile Application 929

As shown in Fig. 3, the sofa that the user selected would appear in the room. The
user could then see how it looked in the space. The price would also be displayed along
with a link to order the furniture directly from the store. The user showed more interest
in the poster because they could interact with it and then go directly to the process of
purchasing on the website, which can be a strong motivating factor for purchasing the
product as they can take action immediately to buy the furniture.

Fig. 4. Images appearing in the space room (left) and purchasing process (right)

This is a very important part of the purchasing process, as illustrated in Fig. 4.


The website page with the poster will lead consumers to where they can purchase a
product. If the user decides to buy the product, the application is a success. Not only
will the user buy a product, but they will also visit the website more often. Website
traffic will thus increase as users will visit more often to check the website. If companies
can create more interesting, or attractive websites, the number of customers who visit
the page will increase. After the sampling used the application, they were asked to
complete an online questionnaire. There were 30 respondents, both male and female,
aged between 25 and 45 years. They had experience buying furniture online. They mostly
used Android smartphones and were familiar with technology and mobile applications.
According to Huntley’s research [10], they were most familiar with technology and
mobile applications. Moreover, the application interface that has been developed by
Android Studio is shown. As shown in Fig. 5, there are three interfaces on the application
and an AR mode testing on the poster. The first picture is the first page of the application.
The second page is augmented reality mode. The third page is in manual mode. The
pictures on the bottom are tested on the poster when the respondents test the application.
930 C. Kerdvibulvech and T. P. N. Ayuttaya

Fig. 5. Our proposed application interface in each step

4 Evaluation and Result Discussion

To confirm whether this application is useful or not, we have to set the online question-
naire to see the effectiveness of the application. As shown in Table 1, the questions were
set into three parts. The first part is the demographic factor. The second part is the expe-
rience of buying furniture online. And the last part is the experience of using augmented
reality on the mobile application of respondents. We should confirm that our application
is useful for the user and effective. Moreover, we set the open-ended question. Not only
ask the user but also, we want the user to give feedback on what they want to have or
what do they not like about this application. So, we will collect the data and can develop
it in the future. The objectives of this study are 1) to study augmented reality and the way
to use this technology to boost decision-making in purchasing furniture and 2) to study
how to create a mobile application that can serve and provide more channels for con-
sumers who want to buy products online, especially furniture and home decorations. The
questionnaire contained both closed format and open format questions. The first part of
the questionnaire asked about demographic factors to learn if the respondents had been
using a smartphone. The second part asked the respondents if they had been buying
furniture online. The last part aimed to learn respondents’ satisfaction levels after using
the application. An aim of the study was to provide consumers with channels through an
application that lets consumers compare products of different brands. From the research,
A New Metaverse Mobile Application 931

it was found that most respondents, or 52%, had experience buying furniture online, but
for the most part only once or twice a year. Moreover, it was found that customers do
not buy furniture online with a price of 10,000 baht or higher, the reason being they
want to be sure the product is good and satisfies their needs when they pay a sum. In this
research, the respondents had to use the application developed to incorporate AR tech-
nology to influence people to buy furniture. After they used the application, they tended
to like it. They mostly felt they would like it more if it had a greater variety of products.
They did say they would be willing to download this application if it was launched. Even
though the application feedback is good and people like this application, it was found
that the respondents still do not strongly agree with many questions in the questionnaire,
for example, the question about the application being easy to use. The respondents did
agree that it is easy to use, but the application should be further developed to make it
more user-friendly. Moreover, they felt the application could save them time and money
because the application focuses on e-commerce, which means they do not have to go to
a store to make their purchase.

Table 1. Questions and results from the users, with the demographic factor, the experience of
buying furniture online, and the experience of using augmented reality on the mobile application
of respondents

Questions Score
5 4 3 2 1
Application AR Furniture is easy to use 10 17 3 0 0
Application AR Furniture is not complicated 8 18 4 0 0
Helps the customers to decide about buying furniture 15 12 3 0 0
An attractive and interesting application 11 17 2 0 0
Helps the customers to save money over going to the store 21 9 0 0 0
The system of the AR Furniture application is convenient and easy 12 16 2 0 0
The system of the AR Furniture application is accurate 8 19 3 0 0
This technology is up-to-date and convenient to use 18 12 0 0 0
A user is interested in buying more furniture after using this application 15 14 1 0 0
A user is willing to download this application 16 14 0 0 0

Results do show that respondents felt this application must be improved. The accu-
racy of the application must be further developed because the furniture database is
insufficient and needs more products to make it more accurate. Currently, when a cus-
tomer uses the manual mode, they want to find furniture that is suitable and matches
their room. However, the furniture in the database is rather limited and so it is difficult to
find what they are looking for. The augmented reality model also does not have enough
products. If there were more products and different posters, the application would be
considered more accurate. After the respondents used this application, the respondents
were willing to buy more furniture and they were willing to download this application.
932 C. Kerdvibulvech and T. P. N. Ayuttaya

Half of the respondents wanted to buy more furniture and 53.3% of the respondents
were willing to download this application. Moreover, 70% said that this application can
help the customer save more money to go to the store. 60% think that this application
is convenient to use and up-to-date. And 50% answered that this application can help
them to decide about buying furniture.
The research which is similar to our work is about methods for e-commerce using
metaverse, such as Lu and Smith’s work [12]. Their research studies the type of e-
commerce system. In their study, customers of this metaverse e-commerce are able
to bring goods into the real scene visually and interactively. Their research shows the
augmented reality technology in the new way of e-commerce to make the customers
experience the new technology. Their research mentions the furniture of e-commerce
on the website to show that this research can link with any e-commerce platform. Their
research generates the object in the 3D layer. The developer wants to project the real
virtual image in the real space via the computer with the camera. However, our work
has a different goal from their work. Our objective is to create a 3D model of a specific
sofa to convince the user to be interested in buying this sofa, so the shopper can come
to the page since it can bring greater shopping confidence which may help increase the
possibility of selling the goods. Therefore, the difference is in the type of technology
and the way it is used and applied. As technology is developed quickly, people mostly
use smartphones as a tool for communication and can do almost the same things as with
a laptop or personal computer. This research tries to change from the website to the
application to make this application up-to-date. Moreover, the application adds more
information about the products and shows augmented reality objects. In addition, the
user can interact with augmented reality in the application so it can increase the users’
confidence about buying furniture. This research brings the idea of making the process
more up-to-date and more modern.

5 Conclusion and Future Plan


This study has explored how an augmented reality technology can be used to boost
decision-making in buying furniture in the paper. Based on the recommendations of the
respondents, the application should be further developed to become more accurate, and
to accomplish this, the database needs to be expanded with a much wider variety of
products. With this added data, the application can select more furniture for a customer
to compare and make their selection. For the AR mode, more data and posters should be
created as well with a wider variety of products. Then furniture can be categorized and
placed in groups on the same poster, so customers can focus more easily on the furniture
they are looking for. The stores that can use this application are not only big companies
but can also be local furniture outlets that can create their own poster. A local shop can use
PixLive Maker to create this and then use this application to let people access the local
shop’s poster through PixLive Player. Advertising costs can then be minimal because
the local shop can design their posters themselves and develop an augmented reality to
make their poster more interesting using PixLive Maker. The work is easy enough that
they do not have to hire someone to make the poster. Furthermore, the cost of a poster
is relatively cheap and it is easy to make. This can be a tool for many furniture shops
A New Metaverse Mobile Application 933

to promote their own products. The benefit of connecting with local shops is that the
application will have more variety of furniture for a customer to choose from. The local
shop might negotiate more easily than a big company. When an application has access to
a large amount of data, a user can enjoy using the application and, as a result, motivate
users to purchase a product, leading to the growing success of this application. Also,
this application can be further beneficial for related works [13, 14] of conversational
commerce.

Conflict of Interest. The authors declare that they have no conflict of interest.

References
1. Nunes, F.B., et al.: A dynamic approach for teaching algorithms: integrating immersive
environments and virtual learning environments. Comput. Appl. Eng. Educ. 25(5), 732–751
(2017)
2. Siriborvornratanakul, T.: A study of virtual reality headsets and physiological extension pos-
sibilities. In: Gervasi, O., et al. (eds.) ICCSA 2016. LNCS, vol. 9787, pp. 497–508. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-42108-7_38
3. Ronald, T.: Azuma: the most important challenge facing augmented reality. Presence 25(3),
234–238 (2016)
4. Siriborvornratanakul, T.: Through the realities of augmented reality. In: Stephanidis, C. (ed.)
HCII 2019. LNCS, vol. 11786, pp. 253–264. Springer, Cham (2019). https://doi.org/10.1007/
978-3-030-30033-3_20
5. Kammerer, K., Pryss, R., Sommer, K., Reichert, M.: Towards context-aware process guidance
in cyber-physical systems with augmented reality. In: RESACS@RE 2018, pp. 44–51 (2018)
6. Saraswati, T.G.: Driving factors of consumer to purchase furniture online on IKEA Indonesia
website. J. Secr. Bus. Adm. 2(1), 19–28 (2018)
7. Kerdvibulvech, C., Wang, C.-C.: A new 3D augmented reality application for educational
games to help children in communication interactively. In: Gervasi, O., et al. (eds.) ICCSA
2016. LNCS, vol. 9787, pp. 465–473. Springer, Cham (2016). https://doi.org/10.1007/978-3-
319-42108-7_35
8. Al-Azzam, A.F.M.: Evaluating effect of social factors affecting consumer behavior in
purchasing home furnishing products in JORDAN (2014)
9. Joshi, M.S.: A study of online buying behavior among adults in Pune city 13(1) (2017)
10. Huntley, R.: The world according to Y: Inside the New Adult Generation. Australia: Allen &
Unwin (2006)
11. Wolburg, J.M., Pokrywczynski, J.: A psychographic analysis of Generation Y college students.
J. Advert. Res. 41, 33–52 (2001)
12. Lu, Y., Smith, S.: Augmented reality e-commerce assistant system: trying while shopping.
In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 643–652. Springer, Heidelberg (2007).
https://doi.org/10.1007/978-3-540-73107-8_72
13. Rungvithu, T., Kerdvibulvech, C.: Conversational commerce and cryptocurrency research in
urban office employees in Thailand. Int. J. Collab. 15(3), 34–48 (2019)
14. Liew, T.W., Tan, S.-M., Tee, J., Goh, G.G.G.: The effects of designing conversational
commerce chatbots with expertise cues. In: HSI 2021, pp. 1–6 (2021)
Author Index

A D
Ab Rahman, Ab Al-Hadi, 100 Dalvi, Mohsin, 185
Abdalla, Hassan I., 307 Daniel, Azel, 753
Abdellatef, Hamdan, 100 Das, Amit Kumar, 145
Ahmad, Imran Shafiq, 79 Das, Saikat, 906
Ajila, Samuel, 275 Dhaouadi, Rached, 860
Akinrinade, Olusoji, 275 Diaz-Arias, Alec, 230
Ali, Megat Syahirul Amin Megat, 260 Dimitrakopoulos, George, 343
Almutairi, Abdullah, 823 Dodich, Frank, 493
Amazonas, José R., 850 Dombi, József, 719
Amer, Ali A., 307 Domnauer, Colin, 378
An, Peng, 328 Du, Chunglin, 275
Ankargren, Sebastian, 548
Anthes, Christoph, 731 E
Elgarhy, Aya, 578
Ashar, Nur Dalila Khirul, 260
Elseddawy, Ahmed, 578
Ayat, Sayed Omid, 100
Encheva, Sylvia, 880
Ayuttaya, Thitawee Palakawong Na, 924
F
B Fawcett, Nathanel, 317
Baek, Stephen, 230 Ficocelli, Ryan, 493
Bajracharya, Aakriti, 809 Fu, Swee Tee, 32
Bakirov, Akhat, 368
Bin, Chen, 526 G
Bitri, Aida, 669 Gałuszka, Adam, 615
Boufama, Boubakeur, 79 Garyfallou, Antonios, 343
Bui, Len T., 18 Guliashki, Vassil, 669

H
C Hao, Jia, 328
Chiddarwar, Shital, 185 Happonen, Ari, 287
Chov, Bunhov, 563 Harvey, Barron, 809
Churchill, Akubuwe Tochukwu, 439 Heiden, Bernhard, 414

© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2023
K. Arai (Ed.): FTC 2022, LNNS 559, pp. 935–937, 2023.
https://doi.org/10.1007/978-3-031-18461-1
936 Author Index

Heng, Yi Sheng, 596 Nawrat, Aleksander, 615


Hino, Takanori, 133 Ngalamoum, Lucien, 317
Hridoy, Al-Amin Islam, 145 Nguyen, Loc, 307
Huang, Xiao, 696 Ni, Xuelei Sherry, 696
Huong, Nguyen Thi Viet, 426 Nobuhara, Hajime, 133
Hussain, Abrar, 719
O
J Offor, Kennedy John., 439
Jayaramireddy, Charitha Sree, 165 Oghenekaro, Linda Uchenna, 248
J˛edrasiak, Karol, 615 Okengwu, Ugochi Adaku, 248
Jodlbauer, Herbert, 731 Olowookere, Toluwase A., 275
Jung, Ayeon, 644 Omowonuola, Victor, 121
Onyejegbu, Laeticia Nneka, 248
K Orozco-Garibay, J. Jose R., 358
Kagawa, Tomomichi, 133 Osama, Ashrakat, 578
Kamimura, Ryotaro, 1 Ou, Phichhang, 563
Kato, Shigeru, 133 Ozioko, Ekene Frank., 439
Kaur, Harjinder, 770
Kaur, Tarandeep, 770 P
Kerdvibulvech, Chutisant, 924 Panagiotopoulos, Ilias, 343
Khakurel, Utsab, 809 Park, Andrew J., 493
Kher, Shubhalaxmi, 121 Park, Junhong, 625
Khushbu, Sabrina Alam, 145 Patterson, Lee, 493
Kim, Miso, 625 Petersen, Erik Styhr, 880
Kitajima, Ryozo, 1 Politi, Elena, 343
Klimczak, Katarzyna, 615 Pomorski, Krzysztof, 477
Prathibamol, C. P., 217
L Probierz, Eryka, 615
Lau, Bee Theng, 32 Pyeatt, Larry D., 681
Le, Thai H., 18
Lee, Seunghyun, 53 Q
Leventi-Peetz, Anastasia-M., 796 Quezada, Ángeles, 358
Li, Naomi Fengqi, 198
Li, Xiaolin, 511 R
Lima Filho, Diogo F., 850 Rahul, M. R., 185
Loh, Brian Chung Shiong, 32 Randrianasolo, Arisoa S., 681
Long, Yongsong, 328 Rawat, Danda B., 809, 823
Lützhöft, Margareta Holtensdotter, 880 Revanth, A., 217
Riegler, Andreas, 731
M Rizvi, Shahriyar Masud, 100
Magdaleno-Palencia, Jose Sergio, 358 Robinson, Chris, 393
Manan, Shahidatul Sadiah Abdul, 100 Romero Ocampo, Obeth Hernan, 837
Marinova, Galia, 669 Rosenberg, Louis, 378
Marquez, Bogart Yail, 358 Roshan, Ajmal, 860
Mekni, Mehdi, 165 Roza, Felippe Schmoeller, 538
Messmore, Mitchell, 230
Mohammed, Asad, 753 S
Mohammed, Phaedra, 753 Schultzberg, Mårten, 548
Mungal, Jason, 753 Shin, Dmitriy, 230
Musa, Martha Ozohu, 248 Siddique, Shaykh, 145
Sonwane, Satish, 185
N Soun, Sreypich, 563
Naraharisetti, Sree Veera Venkata Sai Saran, 165 Spicer, Valerie, 493
Nassar, Mohamad, 165 Subramanian, Preethi, 596
Author Index 937

Suh, Jeanne, 65 W
Suleimenov, Ibragim, 368 Wang, Yan, 696
Sun, Ting, 656 Wang, Zhifang, 906
Syed, Sameer Akhtar, 79 Wang, Zizhao, 328
Watada, Junzo, 287
T Weber, Kai, 796
Tahir, Nooritawati Md, 260 Wilkerson, Bryce, 121
Tee, Mark Kit Tsun, 32 Willcox, Gregg, 378
Thuan, Nguyen Dinh, 426 Wiśniewski, Tomasz, 615
Tonino-Heiden, Bianca, 414
Tripathi, Shailesh, 731 X
Tsang, Herbert H., 493 Xiao, Hua, 328

Y
U Yamakami, Tomoyuki, 776
Ugbari, Augustine Obolor, 248 Ye, Wenbin, 328
Usmani, Usman Ahmad, 287 Yonggang, Wang, 526

V Z
Varlamis, Iraklis, 343 Zamri, Nurul Farhana Mohamad, 260
Vitulyova, Yelizaveta, 368 Zhang, Alexandros Shikun, 198
Vo, Duy K., 18 Zhang, Ouyang, 888

You might also like