Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines

Industrial and Applied Mathematics
Jamal Amani Rad

Kourosh Parand
Snehashish Chakraverty Editors
Learning
with Fractional
Orthogonal Kernel
Classifiers in
Support Vector
Machines
Theory, Algorithms and Applications
Editors-in-Chief
G. D. Veerappa Gowda, TIFR Centre For Applicable Mathematics, Bengaluru,
Karnataka, India
S. Kesavan, Institute of Mathematical Sciences, Chennai, Tamil Nadu, India
Fahima Nekka, Universite de Montreal, Montréal, QC, Canada
Editorial Board
Akhtar A. Khan, Rochester Institute of Technology, Rochester, USA
Govindan Rangarajan, Indian Institute of Science, Bengaluru, India
K. Balachandran, Bharathiar University, Coimbatore, Tamil Nadu, India
K. R. Sreenivasan, NYU Tandon School of Engineering, Brooklyn, USA
Martin Brokate, Technical University, Munich, Germany
M. Zuhair Nashed, University of Central Florida, Orlando, USA
N. K. Gupta, Indian Institute of Technology Delhi, New Delhi, India
Noore Zahra, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
Pammy Manchanda, Guru Nanak Dev University, Amritsar, India
René Pierre Lozi, University Côte d’Azur, Nice, France
Zafer Aslan, İstanbul Aydın University, İstanbul, Turkey
The Industrial and Applied Mathematics series publishes high-quality research-
level monographs, lecture notes, textbooks, contributed volumes, focusing on areas
where mathematics is used in a fundamental way, such as industrial mathematics,
bio-mathematics, financial mathematics, applied statistics, operations research and
computer science.
Jamal Amani Rad · Kourosh Parand ·
Snehashish Chakraverty
Editors
Learning with Fractional

Orthogonal Kernel Classifiers
in Support Vector Machines
Theory, Algorithms and Applications
Editors
Jamal Amani Rad Kourosh Parand
Department of Cognitive Modeling Department of Computer and Data
Institute for Cognitive and Brain Sciences Sciences, Faculty of Mathematical Sciences
Shahid Beheshti University Shahid Beheshti University
Tehran, Iran Tehran, Iran
Department of Statistics and Actuarial
Snehashish Chakraverty
Science
Department of Mathematics
University of Waterloo
National Institute of Technology Rourkela
Waterloo, ON, Canada
Rourkela, Odisha, India
ISSN 2364-6837 ISSN 2364-6845 (electronic)

ISBN 978-981-19-6552-4 ISBN 978-981-19-6553-1 (eBook)
https://doi.org/10.1007/978-981-19-6553-1
Mathematics Subject Classification: 68T05, 65-04, 33C45, 65Lxx, 65Mxx, 65Nxx, 65Rxx
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
In recent years, machine learning has been applied in different areas of science
and engineering including computer science, medical science, cognitive science,
psychology, and so on. In these fields, machine learning non-specialists utilize it to
address their problems.
One of the most popular families of algorithms in machine learning is support
vector machine (SVM) algorithms. Traditionally, these algorithms are used for binary
classification problems. But recently, the SVM algorithms have been utilized in
various areas including numerical analysis, computer vision, and so on. Therefore,
the popularity of SVM algorithms has risen in recent years.
The main part of the SVM algorithms is the kernel function and the performance
of a given SVM is related to the power of the kernel function. Different kernels
provide different capabilities to the SVM algorithms; therefore, understanding the
properties of the kernel functions is a crucial part of utilizing the SVM algorithms.
Up until now, various kernel functions have been developed by researchers. One of
the most significant families of the kernel functions is the orthogonal kernel function
which has been attracting much attention. The computational power of this family
of kernel functions has been illustrated by the researchers in the last few years. But
despite the computational power of orthogonal kernel functions they have not been
used in real application problems. This issue has various reasons, some of which are
summarized in the following:
1. The mathematical complexity of orthogonal kernel functions formulation.
2. Lack of a simple and comprehensive resource for expressing the orthogonal
kernels and their properties.
3. Implementation difficulties of these kernels and lack of a convenient package
that implements these kernels.
For the purpose of solving the aforementioned issues, in this book, we are going
to present a simple and comprehensive tutorial on orthogonal kernel functions and a
Python package that is named ORSVM that contains the orthogonal kernel functions.
The reason we chose Python as the language of the ORSVM package is Python is
open source, very popular, easy to learn, and there are lots of tutorial for it:
v
vi Preface
1. Python has a lot of packages for manipulating data and they can be used besides
the ORSVM for solving a machine learning problem.
2. Python is a multi-platform language and can be launched on different operating
systems.
In addition to the developed orthogonal kernels, we aim to introduce some new
kernel functions which are called fractional orthogonal kernels. The name fractional
comes from the order x being a positive real number instead of being an integer.
In fact, the fractional orthogonal kernels are extensions of integer order orthogonal
functions. All introduced fractional orthogonal kernels in this book are implemented
in the ORSVM package and their performance is illustrated by testing on some real
datasets.
This book contains 12 chapters, including an appendix at the end of the book to
cover any programming preliminaries. Chapter 1 includes the fundamental concepts
of machine learning. In this chapter, we explain the definitions of pattern and simi-
larity and then a geometrical intuition of the SVM algorithm is presented. At the end
of this chapter, a historical review of the SVM and the current applications of SVM
are discussed.
In Chap. 2, we present the basics of SVM and least-square SVM (LS-SVM). The
mathematical background of SVM is presented in this chapter in detail. Moreover,
Mercer’s theorem and kernel trick are discussed too. In the last part of this chapter,
function approximation using the SVM is illustrated.
In Chap. 3, the discussion is about Chebyshev polynomials. At first, the properties
of Chebyshev polynomials and fractional Chebyshev functions are explained. After
that, a review of Chebyshev kernel functions is presented and the fractional Cheby-
shev kernel functions are introduced. In the final section of this chapter, the perfor-
mance of fractional Chebyshev kernels on real datasets is illustrated and compared
with other state-of-the-art kernels.
In Chap. 4, the Legendre polynomials are considered. In the beginning, the proper-
ties of the Legendre polynomials and fractional Legendre functions are explained. In
the next step after reviewing the Legendre kernel functions, the fractional Legendre
kernel functions are introduced. Finally, the performance of fractional Legendre
kernels is illustrated by applying them to real datasets.
Another orthogonal polynomial series is discussed in Chap. 5 (the Gegenbauer
polynomials). Similar to the previous chapters, this chapter includes properties of the
Gegenbauer polynomials, properties of the fractional Gegenbauer functions, a review
on Gegenbauer kernels, introducing fractional Gegenbauer kernels, and showing the
performance of fractional Gegenbauer kernels on real datasets.
In Chap. 6, we focus on Jacobi polynomials. This family of polynomials is the
general form of the previous polynomials which are presented in Chaps. 3–5. There-
fore, the relations between the polynomials and the kernels are discussed in this
chapter. In addition to the relations between the polynomials, other parts of this
chapter are similar to the three previous chapters.
Preface vii
In Chaps. 7 and 8, some applications of the SVM in scientific computing are

presented and the procedure of using LS-SVM for solving ordinary/partial differ-
ential equations is explained. Mathematical basics of ordinary/partial differential
equations and traditional numerical algorithms for approximating the solution of
ordinary/partial differential equations are discussed in these chapters too.
Chapter 9 consists of the procedure of using the LS-SVM algorithm for solving
integral equations, basics of integral equations, traditional analytical and numerical
algorithms for solving integral equations, and a numerical algorithm based on LS-
SVR for solving various kinds of integral equations.
Another group of dynamic models in real-world problems is distributed-order
fractional equations, which have received much attention recently. In Chap. 10, we
discuss in detail the numerical simulation of these equations using LS-SVM based
on orthogonal kernels and evaluate the power of this method in fractional dynamic
models.
The aim of Chap. 11 is parallelizing and accelerating the SVM algorithms that
are using orthogonal kernels using graphical processing units (GPUs).
Finally, we have also released an online and free software package called orthog-
onal SVM (ORSVM). In fact, ORSVM is a free package that provides an SVM
classifier with some novel orthogonal kernels. This library provides a complete path
of using the SVM classifier from normalization to calculation of SVM equation and
the final evaluation. In the last chapter, a comprehensive tutorial on the ORSVM
package is presented. Also, for more information and to use this package please
visit: orsvm.readthedocs.io.
Since the goal of this book is that the readers use the book without needing any
other resources, a short tutorial about the Python programming preliminaries of the
book is presented here. In Appendix, we present a tutorial on Python programming
for those who are not familiar with it. In this appendix, some popular packages that
are used for working with data such as NumPy, Pandas, and Matplotlib are explained
with some examples.
The present book is written for graduate students and researchers who want to
work with the SVM method for their machine learning problems. Moreover, it can be
useful to scientists who also work on applications of orthogonal functions in different
fields of science and engineering. The book will be self-contained and there is no
need to have any advanced background in mathematics or programming.
Tehran, Iran Jamal Amani Rad

Waterloo, Canada Kourosh Parand
Rourkela, India Snehashish Chakraverty
Contents
Part I Basics of Support Vector Machines

1 Introduction to SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Hadi Veisi
2 Basics of SVM Method and Least Squares SVM . . . . . . . . . . . . . . . . . . 19
Kourosh Parand, Fatemeh Baharifard, Alireza Afzal Aghaei,
and Mostafa Jani
Part II Special Kernel Classifiers

3 Fractional Chebyshev Kernel Functions: Theory
and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Amir Hosein Hadian Rasanan, Sherwin Nedaei Janbesaraei,
and Dumitru Baleanu
4 Fractional Legendre Kernel Functions: Theory
and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Amirreza Azmoon, Snehashish Chakraverty, and Sunil Kumar
5 Fractional Gegenbauer Kernel Functions: Theory
and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Sherwin Nedaei Janbesaraei, Amirreza Azmoon,
and Dumitru Baleanu
6 Fractional Jacobi Kernel Functions: Theory and Application . . . . . . 119
Amir Hosein Hadian Rasanan, Jamal Amani Rad,
Malihe Shaban Tameh, and Abdon Atangana
Part III Applications of Orthogonal Kernels

7 Solving Ordinary Differential Equations by LS-SVM . . . . . . . . . . . . . 147
Mohsen Razzaghi, Simin Shekarpaz, and Alireza Rajabi
ix
x Contents
8 Solving Partial Differential Equations by LS-SVM . . . . . . . . . . . . . . . 171

Mohammad Mahdi Moayeri and Mohammad Hemami
9 Solving Integral Equations by LS-SVR . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Kourosh Parand, Alireza Afzal Aghaei, Mostafa Jani,
and Reza Sahleh
10 Solving Distributed-Order Fractional Equations by LS-SVR . . . . . . 225
Amir Hosein Hadian Rasanan, Arsham Gholamzadeh Khoee,
and Mostafa Jani
Part IV Orthogonal Kernels in Action

11 GPU Acceleration of LS-SVM, Based on Fractional
Orthogonal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Armin Ahmadzadeh, Mohsen Asghari, Dara Rahmati,
Saeid Gorgin, and Behzad Salami
12 Classification Using Orthogonal Kernel Functions: Tutorial
on ORSVM Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Amirreza Azmoon, Mohammad Akhavan, and Jamal Amani Rad
Appendix: Python Programming Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . 279

Editors and Contributors
About the Editors
Jamal Amani Rad is Assistant Professor at the Department of Cognitive Modeling,

Institute for Cognitive and Brain Sciences, Shahid Beheshti University (SBU),
Tehran, Iran. He received his Ph.D. in numerical analysis (scientific computing)
from SBU in 2015. Following his Ph.D., he started a one-year postdoctoral fellow-
ship at SBU in May 2016 where he is currently midway through his fifth year as
Assistant Professor. He is currently focused on the development of mathematical
models for cognitive processes in mathematical psychology, especially in risky or
perceptual decision making. With H-index 20, he has published 94 research papers
with 1320 citations. He also has contributed a chapter to the book, Mathematical
Methods in Interdisciplinary Sciences, as well as published two books in Persian.
He has supervised 21 M.Sc. theses and 9 Ph.D. theses. He has so far developed a
good scientific collaboration with leading international researchers in mathematical
modelling: E. Larsson and L. Sydow at Uppsala University, L.V. Ballestra at the
University of Bologna, and E. Scalas at the University of Sussex. He is a reviewer
of a few reputed journals as well as has organized quite a few international level
conferences and workshops on deep learning and neural network.
Kourosh Parand is Professor at the Department of Computer and Data Sciences,

Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran. He
received his Ph.D. in numerical analysis from the Amirkabir University of Tech-
nology, Iran, in 2007. He has published more than 250 research papers in reputed jour-
nals and conferences and has more than 4261 citations. His fields of interest include
partial differential equations, ordinary differential equations, fractional calculus,
spectral methods, numerical methods and mathematical physics. Currently, he is
working on machine learning techniques such as least squares support vector regres-
sion and deep learning for some engineering and neuroscience problems. He also
collaborates, as a reviewer, with different prestigious international journals.
xi
xii Editors and Contributors
Snehashish Chakraverty is Professor at the Department of Mathematics (Applied

Mathematics Group), National Institute of Technology Rourkela, Odisha, and Dean
of Student Welfare. Earlier, he worked with the CSIR-Central Building Research
Institute, Roorkee, India. He had been Visiting Professor at Concordia University and
McGill University, Canada, during 1997–1999, and the University of Johannesburg,
South Africa, during 2011–2014.
After completing his graduation from St. Columba’s College (now Ranchi Univer-
sity, Jharkhand, India), he did his M.Sc. (Mathematics), M.Phil. (Computer Appli-
cations) and Ph.D. from the University of Roorkee (now, the Indian Institute of
Technology Roorkee), securing the first position in 1992. Thereafter, he did his post-
doctoral research at the Institute of Sound and Vibration Research (ISVR), University
of Southampton, the U.K. and at the Faculty of Engineering and Computer Science,
Concordia University, Canada. Professor Chakraverty is a recipient of several presti-
gious awards: Indian National Science Academy (INSA) nomination under Interna-
tional Collaboration/Bilateral Exchange Program (with the Czech Republic), Plat-
inum Jubilee ISCA Lecture Award (2014), CSIR Young Scientist Award (1997),
BOYSCAST Fellow (DST), UCOST Young Scientist Award (2007, 2008), Golden
Jubilee Director’s (CBRI) Award (2001), INSA International Bilateral Exchange
Award (2015), Roorkee University Gold Medals (1987, 1988) for securing the first
positions in M.Sc. and M.Phil. (Computer Applications). His present research area
includes differential equations (ordinary, partial and fractional), numerical analysis
and computational methods, structural dynamics (FGM, nano) and fluid dynamics,
mathematical and uncertainty modeling, soft computing and machine intelligence
(artificial neural network, fuzzy, interval and affine computations).
With more than 30 years of experience as a researcher and teacher, he has authored
23 books, published 382 research papers (till date) in journals and conferences. He is
on the editorial boards of various international journals, book series and conference
proceedings. Professor Chakraverty is the Chief Editor of the International Journal
of Fuzzy Computation and Modelling, Associate Editor of the journal, Computational
Methods in Structural Engineering, Frontiers in Built Environment, and on the edito-
rial board of several other book series and journals: Modeling and Optimization in
Science and Technologies (Springer Nature), Coupled Systems Mechanics, Curved
and Layered Structures, Journal of Composites Science, Engineering Research
Express, and Applications and Applied Mathematics: An International Journal. He
also is a reviewer of around 50 international journals of repute and he was the Presi-
dent of the Section of Mathematical Sciences (including Statistics) of “Indian Science
Congress” (2015–2016) and was the Vice President of Orissa Mathematical Society
(from 2011–2013).
Editors and Contributors xiii
Contributors
Alireza Afzal Aghaei Department of Computer and Data Science, Faculty of

Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
Armin Ahmadzadeh School of Computer Science, Institute for Research in
Fundamental Sciences, Tehran, Iran
Mohammad Akhavan School of Computer Science, Institute for Research in
Fundamental Sciences, Tajrish, Iran
Mohsen Asghari School of Computer Science, Institute for Research in Funda-
mental Sciences, Tehran, Iran
Abdon Atangana Faculty of Natural and Agricultural Sciences, Institute for
Groundwater Studies, University of the Free State, Bloemfontein, South Africa
Amirreza Azmoon Department of Computer Science, The Institute for Advance
Studies in Basic Sciences (IASBS), Zanjan, Iran
Fatemeh Baharifard School of Computer Science, Institute for Research in
Fundamental Sciences (IPM), Tehran, Iran
Dumitru Baleanu Department of Mathematics, Cankaya University, Ankara,
Turkey
Snehashish Chakraverty Department of Mathematics, National Institute of Tech-
nology Rourkela, Sundargarh, OR, India
Saeid Gorgin Department of Electrical Engineering and Information Technology,
Iranian Research Organization for Science and Technology (IROST), Tehran, Iran
Amir Hosein Hadian Rasanan Department of Cognitive Modeling, Institute for
Cognitive and Brain Sciences, Shahid Beheshti University, G.C. Tehran, Iran
Mohammad Hemami Department of Cognitive Modeling, Institute for Cognitive
and Brain Sciences, Shahid Beheshti University, Tehran, Iran
Mostafa Jani Department of Computer and Data Science, Faculty of Mathematical
Sciences, Shahid Beheshti University, Tehran, Iran
Arsham Gholamzadeh Khoee Department of Computer Science, School of Math-
ematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran
Sunil Kumar Department of Mathematics, National Institute of Technology,
Jamshedpur, JH, India
Mohammad Mahdi Moayeri Department of Cognitive Modeling, Institute for
Cognitive and Brain Sciences, Shahid Beheshti University, Tehran, Iran
Sherwin Nedaei Janbesaraei School of Computer Science, Institute for Research
in Fundamental Sciences (IPM), Tehran, Iran
xiv Editors and Contributors
Kourosh Parand Department of Computer and Data Sciences, Faculty of Mathe-

matical Sciences, Shahid Beheshti University, Tehran, Iran;
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo,
ON, Canada
Jamal Amani Rad Department of Cognitive Modeling, Institute for Cognitive and
Brain Sciences, Shahid Beheshti University, Tehran, Iran
Dara Rahmati Computer Science and Engineering (CSE) Department, Shahid
Beheshti University, Tehran, Iran
Alireza Rajabi Department of Computer and Data Science, Faculty of Mathemat-
ical Sciences, Shahid Beheshti University, Tehran, Iran
Mohsen Razzaghi Department of Mathematics and Statistics, Mississippi State
University, Starkville, USA
Behzad Salami Barcelona Supercomputing Center (BSC), Barcelona, Spain
Simin Shekarpaz Department of Applied Mathematics, Brown University, Provi-
dence, RI, USA
Malihe Shaban Tameh Department of Chemistry, University of Minnesota,
Minneapolis, MN, USA
Hadi Veisi Faculty of New Sciences and Technologies, University of Tehran,
Tehran, Iran
Part I
Basics of Support Vector Machines
Chapter 1
Introduction to SVM
Hadi Veisi
Abstract In this chapter, a review of the machine learning (ML) and pattern recog-
nition concepts is given, and basic ML techniques (supervised, unsupervised, and
reinforcement learning) are described. Also, a brief history of ML development from
the primary works before the 1950s (including Bayesian theory) up to the most recent
approaches (including deep learning) is presented. Then, an introduction to the sup-
port vector machine (SVM) with a geometric interpretation is given, and its basic
concepts and formulations are described. A history of SVM progress (from Vap-
nik’s primary works in the 1960s up to now) is also reviewed. Finally, various ML
applications of SVM in several fields such as medical, text classification, and image
classification are presented.
Keywords Machine leaning · Pattern recognition · Support vector machine ·

History
1.1 What Is Machine Learning?
Recognizing a person from his/her face, reading a handwritten letter, understanding

a speech lecture, deciding to buy the stock of a company after analyzing the com-
pany’s profile, and driving a car in a busy street are some examples of using human
intelligence. Artificial Intelligence (AI) refers to the simulation of human intelli-
gence in machines, i.e., computers. Machine Learning (ML) as a subset of AI is the
science of automatically learning computers from experiences to do intelligent and
human-like tasks. Similar to other actions in computer science and engineering, ML
is realized by computer algorithms that can learn from their environment (i.e., data)
and can generalize this training to act intelligently in new environments. Nowadays,
computers can recognize people from their face using face recognition algorithms,
converting a handwritten letter to its editable form using handwritten recognition,
H. Veisi (B)
Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 3
J. A. Rad et al. (eds.), Learning with Fractional Orthogonal Kernel Classifiers in Support
Vector Machines, Industrial and Applied Mathematics,
https://doi.org/10.1007/978-981-19-6553-1_1
4 H. Veisi
understand a speech lecture using speech recognition and natural language under-
standing, buy a stock of a company using algorithmic trading methods, and can drive
a car automatically in self-driving cars. The term machine learning was coined by
Arthur Samuel in 1959, an American pioneer in the field of computer gaming and
artificial intelligence that defines this term as “it gives computers the ability to learn
without being explicitly programmed” Samuel (2000). In 1997, Tom Mitchell, an
American computer scientist and a former Chair of the Machine Learning Depart-
ment at the Carnegie Mellon University (CMU) gave a mathematical and relational
definition that “A computer program is said to learn from experience E with respect
to some task T and some performance measure P, if its performance on T, as mea-
sured by P, improves with experience E” Mitchell (1997). So, if you want your ML
program to predict the growth of a stock (task T) to decide for buying it, you can
run a machine learning algorithm with data about past price patterns of this stock
(experience E, which is called training data) and, if it has successfully “learned”, it
will then do better at predicting future price (performance measure P). The primary
works in ML return to the 1950s and this field has received several improvements
during the last 70 years. There is a short history of ML:
• Before the 1950s: Several ML-related theories have been developed including
Bayesian theory Bayes (1991), Markov chain Gagniuc (2017), regression, and esti-
mation theories Fisher (1922). Also, Donald Hebb in 1949 Hebb (1949) presented
his model of brain neuron interactions which is the basis of McCulloch-Pitts’s
neural networks McCulloch and Pitts (1943).
• The 1950s: In this decade, ML pioneers have proposed the primary ideas and
algorithms for machine learning. The Turing test, originally called the imitation
game, was proposed by Alan Turing as a test of a machine’s ability to exhibit
intelligent behavior equivalent to, or indistinguishable from, that of a human.
Arthur Samuel of IBM developed a computer program for playing checkers Samuel
(1959). Frank Rosenblatt extended Hebb’s learning model of brain cell interaction
with Arthur Samuel’s ideas and created the perceptron Rosenblatt (1958).
• The 1960s: Bayesian methods are introduced for probabilistic inference
Solomonoff (1964). The primary idea of Support Vector Machines (SVMs) is
given by Vapnik and Lerner Vapnik (1963). Widrow and Hoff developed the delta
learning rules for neural networks which was the precursor of the backpropagation
algorithm Vapnik (1963). Sebestyen Sebestyen (1962) and Nilsson Nilsson (1965)
proposed the nearest neighbor idea. Donald Michie used reinforcement learning
to play Tic-tac-toe Michie (1963). The decision tree was introduced by Morgan
and Sonquist Morgan and Sonquist (1963).
• The 1970s: The quit years which is also known as AI Winter caused by pessimism
about machine learning effectiveness due to the limitation of the ML methods in
solving only linearly separable problems Minsky and Papert (1969).
• The 1980s: The birth of brilliant ideas resulted in renewed enthusiasm. Backprop-
agation publicized by Rumelhart et al. (1986) causes a resurgence in machine
learning. Hopfield popularizes his recurrent neural networks Hopfield (Hop-
field). Watkins develops Q-learning in reinforcement learning Watkins (1989).
1 Introduction to SVM 5
Fukushima published his work on the neocognitron neural network Fukushima

(1988) which later inspires Convolutional Neural Networks (CNNs). Boltzmann
machine Hinton (1983) was proposed which was later used in Deep Belief Net-
works (DBNs).
• The 1990s: This is the decade for the birth of today’s form of SVM by Vap-
nik and his colleagues in Boser et al. (1992) and is extended in Vapnik (1995)
and Cortes and Vapnik (1995). In these years, ML works shift from knowledge-
driven and rule-based approaches to the data-driven approach, and other learning
algorithms such as Recurrent Neural Networks (RNNs) are introduced. Hochre-
iter and Schmidhuber invent long short-term memory (LSTM) recurrent neural
networks Hochreiter and Schmidhuber (1997) which became a practicality suc-
cessful method for sequential data modeling. IBM’s Deep Blue beats the world
champion at chess, the grand master Garry Kasparov Warwick (2017). Tin Kam Ho
introduced random decision forests Ho (1995). Boosting algorithms are proposed
Schapire (1990).
• The 2000s: Using ML methods in real applications, dataset creation, and orga-
nizing ML challenges become widespread. Support Vector Clustering and other
Kernel methods were introduced Ben-Hur et al. (2001). A deep belief network was
proposed by Hinton which is among the starting points for deep learning Hinton
et al. (2006).
• The 2010s: Deep learning becomes popular and has overcome most ML methods,
results in becoming integral to many real-world applications Goodfellow et al.
(2016). Various deep neural networks such as autoencoders Liu et al. (2017), Con-
volutional Neural Networks (CNNs) Khan et al. (2020), and Generative Adver-
sarial Networks (GANs) Pan et al. (2019) were introduced. ML achieved higher
performance than human in various fields such as lipreading (e.g., LipNet Assael
et al. (2016)), playing Go (e.g., Google’s AlphaGo and AlphaGo Zero programs
Silver et al. (2017)), and information retrieval using natural language processing
(e.g., IBM’s Watson in Jeopardy competition Ferrucci et al. (2013)).
The highly complex nature of most real-world problems often means that invent-
ing specialized algorithms that will solve them perfectly every time is impractical, if
not impossible. Examples of machine learning problems include “Can we recognize
the spoken words by only looking at the lip movements?”, “Is this cancer in this
mammogram?”, “Which of these people are good friends with each other?”, and
“Will this person like this movie?”. Such problems are excellent targets for ML, and,
in fact, machine learning has been applied to such problems with great success, as
mentioned in the history. Machine learning is also highly related to another similar
topic, pattern recognition. These two terms can now be viewed as two facets of the
same fields; however, machine learning grew out of computer science whereas pat-
tern recognition has its origins in engineering Bishop (2006). Another similar topic
is data mining which utilizes ML techniques in discovering patterns in large data
and transforming row data into information and decision. Within the field of data
analytics, machine learning is used to devise complex models and algorithms that
lend themselves to prediction which is known as predictive analytics in commercial
6 H. Veisi
applications. These analytical models allow researchers, data scientists, engineers,

and analysts to “produce reliable, repeatable decisions and results” and uncover “hid-
den insights” through learning from historical relationships and trends in the data
(i.e., input).
1.1.1 Classification of Machine Learning Techniques
Machine learning techniques are classified into three following categories, depending
on the nature of the learning data or learning process:
1. Supervised learning: In this type of learning, there is a supervisor to teach
machines in learning a concept. This means that the algorithm learns from labeled
data (i.e., training data) which include example data and related target responses,
i.e., input and output pairs. If we assume the ML algorithm as a system (e.g., a
face identification), in the training phase of the system, we provide both input
sample (e.g., an image from a face) and the corresponding output (e.g., the ID of
the person to whom face belongs) to the system. The collection of labeled data
requires skilled human agents (e.g., a translator to translate a text from a language
to another) or a physical experiment (e.g., determining whether there is rock or
metal near to a sonar system of a submarine) that is costly and time-consuming.
The supervised learning methods can be divided into classification and regres-
sion. When the number of classes of the data is limited (i.e., the output label of the
data is a discrete variable) the learning is called classification (e.g., classifying an
email to spam and not-spam classes), and when the output label is a continuous
variable (e.g., the price of a stock index) the topic is called regression. Examples
of the most widely used supervised learning algorithms are SVM Boser et al.
(1992), Vapnik (1995), Cortes and Vapnik (1995), artificial neural networks (e.g.,
multi-layer perceptron Rumelhart et al. (1986), LSTM Hochreiter and Schmid-
huber (1997), Gers et al. (1999)), linear and logistic regression Cramer (2002),
Naïve Bayes Hand and Yu (2001), decision trees Morgan and Sonquist (1963),
and K-Nearest Neighbor (KNN) Sebestyen (1962), Nilsson (1965).
2. Unsupervised learning: In this case, there is not any supervision in the learning
and the ML algorithm works on the unlabeled data. It means that the algorithm
learns from plain examples without any associated response, leaving to the algo-
rithm to determine the data patterns based on the similarities in the data. This
type of algorithm tends to restructure the data and cluster them. From a sys-
tem viewpoint, this kind of learning receives sample data as the input (e.g., the
human faces) without the corresponding output and groups with similar samples
in the same clusters. The categorization of unlabeled data is commonly called
clustering. Also, association as another type of unsupervised learning refers to
methods which can discover rules that describe large portions of data, such as
people who buy product X also tend to buy the other product Y. Dimensionality
Classification
Supervised
Machine Learning Regression
Clustering
Unsupervised Association
Dimensionality Reduction
Reinforcement
Fig. 1.1 Various types of machine learning
reduction as another type of unsupervised learning denotes the methods trans-

form data from a high-dimensional space into a low-dimensional space. Exam-
ples of well-known unsupervised learning methods Xu and Wunsch (2005) are
k-means, hierarchical clustering, density-based spatial clustering of applications
with noise (DBSCAN), neural networks (e.g., autoencoders, self-organizing map),
Expectation-Maximization (EM), and Principal Component Analysis (PCA).
3. Reinforcement learning: In this category of learning, an agent can learn from an
action-reward mechanism by interacting with an environment just like the learning
process of a human to play chess game by exercising and training by trial and
error. Reinforcement algorithms are presented with examples and without labels
but receiving positive (e.g., reward) or negative (e.g., penalty) feedback from
the environment. Two widely used reinforcement models are Markov Decision
Process (MDP) and Q-learning Sutton and Barto (2018).
A summary of machine learning types is summarized in Fig. 1.1. In addition to the
mentioned three categories, there is another type called semi-supervised learning
(that is also referred to as either transductive learning or inductive learning) which
falls between supervised and unsupervised learning methods. It combines a small
amount of labeled data with a large amount of unlabeled data for training.
From another point of view, machine learning techniques are classified into
the generative approach and discriminative approach. Generative models explic-
itly model the actual distribution of each class of data while discriminative models
learn the (hard or soft) boundary between classes. From the statistical viewpoint,
both of these approaches finally predict the conditional probability P(Class|Data)
but both models learn different probabilities. In generative methods, joint distribu-
8 H. Veisi
tion P(Class, Data) is learned and the prediction is performed according to this
distribution. On the other side, discriminative models do predictions by estimating
conditional probability P(Class|Data).
Examples of generative methods are deep generative models (DGMs) such as
Variational Autoencoder (VAE) and GANs, Naïve Bayes, Markov random fields,
and Hidden Markov Models (HMM). SVM is a discriminative method that learns
the decision boundary like some other methods such as logistic regression, traditional
neural networks such as multi-layer perceptron (MLP), KNN, and Conditional Ran-
dom Fields (CRFs).
1.2 What Is the Pattern?
In ML, we seek to design and build machines that can learn and recognize patterns,
as is also called pattern recognition. To do this, the data need to have regularity
or arrangement, called pattern, to be learned by ML algorithms. The data may be
created by humans such as stock price or a signature or have a natural nature such
as speech signals or DNS. Therefore, a pattern includes elements that are repeated
in a predictable manner. The patterns in natural data, e.g., speech signals, are often
chaotic and stochastic and do not exactly repeat. There are various types of natural
patterns which include spirals, meanders, waves, foams, tilings, cracks, and those
created by symmetries of rotation and reflection. Some types of patterns such as a
geometric pattern in an image can be directly observed while abstract patterns in
a huge amount of data or a language may become observable after analyzing the
data using pattern discovery methods. In both cases, the underlying mathematical
structure of a pattern can be analyzed by machine learning techniques which are
mainly empowered by mathematical tools. The techniques can learn the patterns to
predict or recognize them or can search them to find the regularities. Accordingly, if
a dataset suffers from any regularities and repeatable templates, the modeling result
of ML techniques will not be promising.
1.3 An Introduction to SVM with a Geometric

Interpretation
Support Vector Machine (SVM), also known as support vector network, is a super-
vised learning approach used for classification and regression. Given a set of training
labeled examples belonging to two classes, the SVM training algorithm builds a deci-
sion boundary between the samples of these classes. SVM does this in such a way
that optimally discriminates between two classes by maximizing the margin between
two data categories. For data samples in an N -dimensional space, SVM constructs an
N − 1-dimensional separating hyperplane to discriminate two classes. To describe
(a) Samples of data in the vector space. (b) Several possible separating lines.
Fig. 1.2 a A binary classification example, b Some possible decision boundaries
the SVM, assume a binary classification example in a two-dimensional space, e.g.,

distinguishing a cat from a dog using the values of their height (x1 ) and weight (x2 ).
An example of the training labeled samples for these classes in the feature space is
given in Fig. 1.2a, in which the samples for the dog are considered as the positive
samples and the cat samples are presented as the negative ones. To do the classifica-
tion, probably the primary intuitive approach is to draw a separative line between the
positive and negative samples. However, as it is shown in Fig. 1.2b, there are many
possible lines to be the decision boundary between these classes. Now, the question
is which line is better and should be chosen as the boundary?
Although all the lines given in Fig. 1.2b are a true answer to do the classification,
neither of them seems the best fit. Alternatively, a line that is drawn between the two
classes which have the maximum distance from both classes is the better choice. To
do this, the data points that lie closest to the decision boundary are the most difficult
samples to classify and they have a direct bearing on the optimum location of the
boundary. These samples are called support vectors that are closer to the decision
boundary and influence the position and orientation (see Fig. 1.3). According to
these vectors, the maximum distance between the two classes is determined. This
distance is called margin and a decision line that is half this margin seems to be the
optimum boundary. This line is such that the margin is maximized which is called
the maximum-margin line between the two classes. SVM classifier attempts to find
this optimal line.
In SVM, to construct the optimal decision boundary, a vector w is considered to be
perpendicular to the margin. Now, to classifying an unknown vector x, we can project
it onto w by computing w.x and determine on which side of the decision boundary
x lies by calculating w.x ≥ t for a constant threshold t. It means that if the values of
w.x are more than t, i.e., it is far away, sample x is classified as a positive example. By
assuming t = −w0 , the given decision rule can be given as w.x + w0 ≥ 0. Now, the
question is, how we can determine the values of w and w0 ? To do this, the following
constraints are considered which means a sample is classified as positive if the value
is equal or greater than 1, it is classified as negative if the value of −1 or less:
10 H. Veisi
(a) Support vectors and optimum decision (b) Decision boundary equation and max-
boundary. imum margin.
Fig. 1.3 a SVM optimum decision boundary, b Definition of the notations
w.x+ + w0 ≥ 1, and w.x− + w0 ≤ −1. (1.1)
These two equations can be integrated into an inequality, as in Eq. 1.2 by intro-
ducing a supplementary variable, yi which is equal to +1 for positive samples and
is equal to −1 for negative samples. This inequality is considered as equality, i.e.,
yi (w.xi + w0 ) − 1 = 0, to define the main constraint of the problem which means
the examples lying on the margins (i.e., support vectors) to be constrained to 0. This
equation is equivalence to a line that is the answer to our problem. This decision
boundary line in this example becomes a hyperplane in the general N -dimensional
case.
yi (w.xi + w0 ) − 1 ≥ 0. (1.2)
To find the maximum margin that separating positive and negative examples,
we need to know the width of the margin. To calculate the width of the margin,
(x+ − x− ) need to be projected onto unit normalized w w
. Therefore, the width is
computed as (x+ − x− ). w in which by using yi (w.xi + w0 ) − 1 = 0 to substituting
w
w.x+ = 1 − w0 and w.x− = 1 + w0 in that, the final value for width is obtained as
2w (see Fig. 1.3). Finally, maximizing this margin is equivalent to Eq. 1.3 which
is a quadratic function:
1
min w2 ,
w,w0 2
yi (w.xi + w0 ) − 1 ≥ 0. (1.3)
This is a constrained optimization problem and can be solved by the Lagrange

multiplier method. After writing the Lagrangian equation as in Eq. 1.4, and computing

it to zero results in w = i αi yi xi
the partial derivative with respect to w and setting
and with respect to w0 and setting it to zero gives i αi yi . It means that w is a linear
combination of the samples. Using these values in Eq. 1.4 results in a Lagrangian
Fig. 1.4 Transforming data from a nonlinear space into a linear higher dimensional space
equation in which the problem depends only on dot products of pairs of data samples.
Also, αi = 0 for the training examples that are not support vectors that means these
examples do not affect the decision boundary. Another interesting fact about this
optimization problem is that it is a convex problem and it is guaranteed to always
find a global optimum.
1
L(w, w0 ) = w2 + [yi (w.xi + w0 ) − 1]αi . (1.4)
2 i
The above-described classification problem and its solution using the SVM
assumes the data is linearly separable. However, in most real-life applications, this
assumption is not correct and most problems are not classified simply using a linear
boundary. The SVM decision boundary is originally linear Vapnik (1963) but has
been extended to handle nonlinear cases as well Boser et al. (1992). To do this, SVM
proposes a method called kernel trick in which an input vector is transformed using a
nonlinear function like φ(.) into a higher dimensional space. Then, in this new space,
the maximum-margin linear boundary is found. It means that a nonlinear problem
is converted into a linearly separable problem in the new higher dimensional space
without affecting the convexity of the problem. A simple example of this technique
is given in Fig. 1.4 in which one-dimensional data samples, xi , are transformed into
two-dimensional space using (xi , xi × xi ) transform. In this case, the dot product of
two samples, i.e., xi .x j , in the optimization problem is replaced with φ(xi ).φ(x j ). In
practice, if we have a function like such that K (xi , x j ) = φ(xi ).φ(x j ), then we do
not need to know the transformation φ(.) and only function K (.) (which is called the
kernel function) is required. Some common kernel functions are linear, polynomial,
sigmoid, and radial basis functions (RBF).
Although the kernel trick is a clever method to handle the nonlinearity, the SVM
still assumes that the data is linearly separable in this transformed space. This assump-
tion is not true in most real-world applications. Therefore, another type of SVM is
proposed which is called soft-margin SVM Cortes and Vapnik (1995). The described
SVM method up to now is known as hard-margin SVM. As given, hard-margin SVMs
assume the data is linearly separable without any errors, whereas soft-margin SVMs
allow some misclassification and results in a more robust decision in nonlinearly
12 H. Veisi
separable data. Today, soft-margin SVMs are the most common SVM techniques in
ML which utilize so-called slack variables in the optimization problem to control the
amount of misclassification.
1.4 History of SVMs
Vladimir Vapnik, the Russian statistician, is the main originator of the SVM tech-
nique. The primary work on the SVM algorithm was proposed by Vapnik (1963) as
the Generalized Portrait algorithm for pattern recognition. However, it was not the
first algorithm for pattern recognition, and Fisher in 1936 had proposed a method
for this purpose Fisher (1936). Also, Frank Rosenblatt had been proposed the per-
ceptron linear classifier which was an early feedforward neural network Rosenblatt
(1958). One year after the primary work of Vapnik and Lerner, in 1964, Vapnik fur-
ther developed the Generalized Portrait algorithm Vapnik and Chervonenkis (1964).
In this year, a geometrical interpretation of the kernels was introduced in Aizerman
et al. (1964) as inner products in a feature space. The kernel theory which is the
main concept in the development of SVM and is called “kernel trick” was previously
proposed in Aronszajn (1950). In 1965, a large margin hyperplane in the input space
was introduced in Cover (1965) which is another key idea of the SVM algorithm.
At the same time, a similar optimization concept was used in pattern recognition
by Mangasarian (1965). Another important research that defines the basic idea of
the soft-margin concept in SVM was introduced by Smith (1968). This idea was
presented as the use of slack variables to overcome the problem of noisy samples
that are not linearly separable. In the history of SVM development, the breakthrough
work is the formulation of statistical learning framework or VC theory proposed
by Vapnik and Chervonenkis (1974) which presents one of the most robust predic-
tion methods. It is not surprising to say that the rising of SVM was in this decade
and this reference has been translated from Russian into other languages such as
German Vapnik and Chervonenkis (1979) and English Vapnik (1982). The use of
polynomial kernel in SVM was proposed by Poggio (1975) and the improvement
of kernel techniques for regression was presented by Wahba (1990). Studying the
connection between neural networks and kernel regression was done by Poggio and
Girosi (1990). The improvement of the previous work on slack variables in Smith
(1968) was done by Bennett and Mangasarian (1992). Another main milestone in the
development of SVM is in 1992 in which SVM has been presented in its today’s form
by Boser et al. (1992). In this work, the optimal margin classifier of linear classifiers
(from Vapnik (1963)) was extended to nonlinear cases by utilizing the kernel trick
to maximum-margin hyperplanes Aizerman et al. (1964). In 1995, soft margin of
SVM classifiers to handle noisy and not linearly separable data was introduced using
slack variables by Cortes and Vapnik (1995). In 1996, the algorithm was extended to
the case of regression Drucker et al. (1996) which is called Support Vector Regres-
sion (SVR). The rapid growth of SVM and using this technique in various applica-
tions has been increased after 1995. Also, the theoretical aspects of SVM have been
Table 1.1 A brief history of SVM development

Decade Year Researcher(s) SVM development
1950 1950 Aronszajn (1950) Introducing the “Theory of Reproducing Kernels”
1960 1963 Vapnik and Lerner (1963) Introducing the Generalized Portrait algorithm (the
algorithm implemented by support vector machines
is a nonlinear generalization of the Generalized
Portrait algorithm)
1964 Vapnik (1964) Developing the Generalized Portrait algorithm
1964 Aizerman et al. (1964) Introducing the geometrical interpretation of the
kernels as inner products in a feature space
1965 Cover (1965) Discussing large margin hyperplanes in the input
space and also sparseness
1965 Mangasarian (1965) Studding optimization techniques for pattern
recognition similar to large margin hyperplanes
1968 Smith (1968) Introducing the use of slack variables to overcome
the problem of noise and non-separability
1970 1973 Duda and Hart (1973) Discussing large margin hyperplanes in the input
space
1974 Vapnik and Chervonenkis Writing a book on “statistical learning theory” (in
(1974) Russian) which can be viewed as the starting of
SVMs
1975 Poggio (1975) Proposing the use of polynomial kernel in SVM
1979 Vapnik and Chervonenkis Translating of Vapnik and Chervonenkis’s 1974
(1979) book to German
1980 1982 Vapnik (1982) Writing an English translation of his 1979 book
1990 1990 Poggio and Girosi (1990) Studying the connection between neural networks
and kernel regression
1990 Wahba (1990) Improving the kernel method for regressing
1992 Bennett and Mangasarian Improving Smith’s 1968 work on slack variables
(1992)
1992 Boser et al. (1992) Presenting SVM in today’s form at the COLT 1992
conference
1995 Cortes and Vapnik (1995) Introducing the soft-margin classifier
1996 Drucker et al. (1996) Extending the algorithm to the case of regression,
called SVR
1997 Muller et al. (1997) Extending SVM for time-series prediction
1998 Bartlett (1998) Providing the statistical bounds on the
generalization of hard margin
2000 2000 Shawe-Taylor and Giving the statistical bounds on the generalization
Cristianini (2000) of soft margin and the regression case
2001 Ben-Hur et al. (2001) Extending SVM to the unsupervised case
2005 Duan and Keerthi (2005) Extending SVM from the binary classification into
multiclass SVM
2010 2011 Polson and Scott (2011) Studding graphical model representation of SVM
and its Bayesian interpretation
2017 Wenzel et al. (2017) Developing a scalable version of Bayesian SVM
for big data applications
14 H. Veisi
studied and it has been extended in other domains than the classification. The statis-
tical bounds on the generalization of hard margin were given by Bartlett (1998) and
it was presented for soft margin and the regression case in 2000 by Shawe-Taylor
and Cristianini (2000). The SVM was originally developed for supervised learning
which has been extended to the unsupervised case in Ben-Hur et al. (2001) called
support vector clustering. Another improvement of SVM was its extension from the
binary classification into multiclass SVM by Duan and Keerthi in Duan and Keerthi
(2005) by distinguishing between one of the labels and the rest (one-versus-all) or
between every pair of classes (one-versus-one). In 2011, SVM was analyzed as a
graphical model and it was shown that it admits a Bayesian interpretation using data
augmentation technique Polson and Scott (2011). Accordingly, a scalable version of
the Bayesian SVM was developed in Wenzel et al. (2017) enabling the application
of Bayesian SVMs in big data applications. A summary of the related researches to
SVM development is given in Table 1.1.
1.5 SVM Applications
Today, machine learning algorithms are on the rise and are widely used in real
applications. Every year new techniques are proposed that overcome the current
leading algorithms. Some of them are only little advances or combinations of existing
algorithms and others are newly created and lead to astonishing progress. Although
deep learning techniques are dominant in many real applications such as image
processing (e.g., for image classification) and sequential data modeling (e.g., in
natural language processing tasks such as machine translation), these techniques
require a huge amount of training data for success modeling. Large-scale labeled
datasets are not available in many applications in which other ML techniques (called
classical ML methods) such as SVM, decision tree, and Bayesian family methods
have higher performance than deep learning techniques. Furthermore, there is another
fact in ML. Each task in ML applications can be solved using various methods;
however, there is no single algorithm that will work well for all tasks. This fact is
known as the No Free Lunch Theorem in ML Wolpert (1996). Each task that we
want to solve has its idiosyncrasies and there are various ML algorithms to suit the
problem. Among the non-deep learning methods, SVM, as a well-known machine
learning technique, is widely used in various classification and regression tasks today
due to its high performance and reliability across a wide variety of problem domains
and datasets Cervantes et al. (2020). Generally, SVM can be applied to any ML task
in any application such as computer vision and image processing, natural language
processing (NLP), medical applications, biometrics, and cognitive science. In the
following, some common applications of SVM are reviewed.
• ML Applications in Medical: SVM is widely applied in medical applications

including medical image processing (i.e., a cancer diagnosis in mammography
Azar and El-Said (2014)), bioinformatics (i.e., patients and gene classification)
Byvatov and Schneider (2003), and health signal analysis (i.e., electrocardiogram
signal classification for detection for identifying heart anomalies Melgani and
Bazi (2008) and electroencephalogram signal processing in psychology and neu-
roscience Li et al. (2013)). SVM which is among the successful methods in this
field is used in diagnosis and prognosis of various types of diseases. Also, SVM
is utilized for identifying the classification of genes, patients based on genes, and
other biological problems Pavlidis et al. (2001).
• Text Classification: There are various applications for text classification including
topic identification (i.e., categorization of a given document into predefined classes
such as scientific, sport, or political); author identification/verification of a written
document, spam, and fake news/email/comment detection; language identification
(i.e., determining the language of a document); polarity detection (e.g., finding that
a given comment in a social media or e-commerce website is positive or negative);
and word sense disambiguation (i.e., determining the meaning of a disembogues
word in a sentence such as “bank”). SVM is a competitive method for the other
ML techniques in this area Joachims (1999), Aggarwal and Zhai (2012).
• Image Classification: The process of assigning a label to an image is generally
known as image recognition which can be used in various fields such as in biomet-
rics (e.g., face detection/identification/verification), medical (e.g., processing MRI
and CT images for disease diagnosis), object recognition, remote sensing classifi-
cation (e.g., categorization of satellite images), automated image organization for
social networks and websites, visual searches, and image retrieval. Although deep
learning techniques, specially CNNs are leading this area for the representation
and feature extraction, SVM is utilized commonly as the classifier Miranda et al.
(2016), Tuia et al. (2011), both with classical image processing-based techniques
and deep neural network methods Li (2019). The image recognition services are
now generally offered by the AI teams of technology companies such as Microsoft,
IBM, Amazon, and Google.
• Steganography Detection: Steganography is the practice of concealing a message
in an appropriate multimedia carrier such as an image, an audio file, or a video file.
This technique is used for secure communication in security-based organizations
to concealing both the fact that a secret message is being sent and its contents.
This method has the advantage over cryptography in which only the content of a
message is protected. On the other hand, the act of detecting whether an image
(or other files) is stego or not is called steganalysis. The analysis of an image to
determining if it is a stego image or not is a binary classification problem in which
SVM is widely used Li (2011).
16 H. Veisi
References
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C.
(eds.) Mining Text Data, 163–222. Springer, Boston (2012)
Aizerman, M.A., Braverman, E.M., Rozonoer, L.I.: Theoretical foundations of the potential function
method in pattern recognition learning. Autom. Remote. 25, 821–837 (1964)
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404 (1950)
Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: Lipnet: End-to-end Sentence-Level
Lipreading (2016). arXiv:1611.01599
Azar, A.T., El-Said, S.A.: Performance analysis of support vector machines classifiers in breast
cancer mammography recognition. Neural. Comput. Appl. 24, 1163–1177 (2014)
Bartlett, P.L.: The sample complexity of pattern classification with neural networks: the size of the
weights is more important than the size of the network. IEEE Trans. Inf. Theory 44, 525–536
(1998)
Bayes, T.: An essay towards solving a problem in the doctrine of chances. MD Comput. 8, 157
(1991)
Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support vector clustering. J. Mach. Learn.
Res. 2, 125–137 (2001)
Bennett, K.P., Mangasarian, O.L.: Robust linear programming discrimination of two linearly insep-
arable sets. Optim. Methods. Softw. 1, 23–34 (1992)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Singapore (2006)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In:
COLT92: 5th Annual Workshop Computers Learning Theory, PA (1992)
Byvatov, E., Schneider, G.: Support vector machine applications in bioinformatics. Appl. Bioinf.
2, 67–77 (2003)
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., Lopez, A.: A comprehensive survey on
support vector machine classification: applications, challenges and trends. Neurocomputing 408,
189–215 (2020)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Cover, T.M.: Geometrical and statistical properties of systems of linear inequalities with applications
in pattern recognition. IEEE Trans. Elect. Comput. 3, 326–334 (1965)
Cramer, J.S.: The origins of logistic regression (Technical report). Tinbergen Inst. 167–178 (2002)
Drucker, H., Burges, C.J., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines.
Adv. Neural Inf. Process Syst. 9, 155–161 (1996)
Duan, K. B., Keerthi, S. S.: Which is the best multiclass SVM method? An empirical study. In:
International Workshop on Multiple Classifier Systems, pp. 278–285. Springer, Heidelberg (2005)
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., Mueller, E.T.: Watson: beyond jeopardy! Artif.
Intell. 199, 93–105 (2013)
Fisher, R.A.: The goodness of fit of regression formulae, and the distribution of regression coeffi-
cients. J. R. Stat. Soc. 85, 597–612 (1922)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188
(1936)
Fukushima, K.: Neocognitron: a hierarchical neural network capable of visual pattern recognition.
Neural Netw. 1, 119–130 (1988)
Gagniuc, P.A.: Markov Chains: from Theory to Implementation and Experimentation. Wiley, Hobo-
ken, NJ (2017)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. In:
International Conference on Artificial Neural Networks, Edinburgh (1999)
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning. MIT Press, England (2016)
Hand, D.J., Yu, K.: Idiot’s Bayes-not so stupid after all? Int. Stat. Rev. 69, 385–398 (2001)
Hebb, D.: The Organization of Behavior. Wiley, New York (1949)
Hinton, G.E.: Analyzing Cooperative Computation. 5th COGSCI, Rochester (1983)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput.
18, 1527–1554 (2006)
Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference. on Document
Analysis and Recognition, Montreal (1995)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Hopfield, J.J.: Neurons with graded response have collective computational properties like those of
two-state neurons. Proc. Natl. Acad. Sci. 81, 3088–3092 (1984)
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML
99: Proceedings of the Sixteenth International Conference on Machine Learning, 200–209 (1999)
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep
convolutional neural networks. Artif. Intell. Rev. 53, 5455–5516 (2020)
Li, B., He, J., Huang, J., Shi, Y.Q.: A survey on image steganography and steganalysis. J. Inf. Hiding
Multimedia Signal Process. 2, 142–172 (2011)
Li, S., Zhou, W., Yuan, Q., Geng, S., Cai, D.: Feature extraction and recognition of ictal EEG using
EMD and SVM. Comput. Biol. Med. 43, 807–816 (2013)
Li, Y., Li, J., Pan, J.S.: Hyperspectral image recognition using SVM combined deep learning. J.
Internet Technol. 20, 851–859 (2019)
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network
architectures and their applications. Neurocomputing 234, 11–26 (2017)
Mangasarian, O.L.: Linear and nonlinear separation of patterns by linear programming. Oper. Res.
13, 444–452 (1965)
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biol. 5, 115–133 (1943)
Melgani, F., Bazi, Y.: Classification of electrocardiogram signals with support vector machines and
particle swarm optimization. IEEE Trans. Inf. Technol. Biomed. 12, 667–677 (2008)
Michie, D.: Experiments on the mechanization of game-learning Part I. Characterization of the
model and its parameters. Comput. J. 6, 232–236 (1963)
Minsky, M., Papert, S.A.: Perceptrons: An Introduction to Computational Geometry. MIT Press,
England (1969)
Miranda, E., Aryuni, M., Irwansyah, E.: A survey of medical image classification techniques. In:
International Conference on Information Management and Technology, pp. 56–61 (2016)
Mitchell, T.M.: Machine Learning. McGraw-Hill Higher Education, New York (1997)
Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat.
Assoc. 58, 415–434 (1963)
Müller, K.R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time
series with support vector machines. In: International Conference on Artificial Neural Networks,
pp. 999–1004. Springer, Heidelberg (1997)
Nilsson, N.J.: Learning Machines. McGraw-Hill, New York (1965)
Pan, Z., Yu, W., Yi, X., Khan, A., Yuan, F., Zheng, Y.: Recent progress on generative adversarial
networks (GANs): a survey. IEEE Access 7, 36322–36333 (2019)
Pavlidis, P., Weston, J., Cai, J., Grundy, W N.: Gene functional classification from heterogeneous
data. In: Proceedings of the Fifth Annual International Conference on Computational Biology,
pp. 249–255 (2001)
Poggio, T.: On optimal nonlinear associative recall. Biol. Cybern. 19, 201–209 (1975)
Poggio, T., Girosi, F.: Networks for approximation and learning. Proc. IEEE 78, 1481–1497 (1990)
Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6, 1–23
(2011)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in
the brain. Psychol. Rev. 65, 386–408 (1958)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors.
Nature 323, 533–536 (1986)
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3,
210–229 (1959)
18 H. Veisi
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 44,
206–226 (2000)
Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
Sebestyen, G.S.: Decision-Making Processes in Pattern Recognition. Macmillan, New York (1962)
Shawe-Taylor, J., Cristianini, N.: Margin distribution and soft margin. In: Smola, A.J., Bartlett, P.,
Scholkopf, B., Schuurmans, D., (eds.), Advances in Large Margin Classifiers, pp. 349–358. MIT
Press, England (2000)
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker,
L., Lai, M., Bolton, A., Chen, Y.: Mastering the game of go without human knowledge. Nature
550, 354–359 (2017)
Smith, F.W.: Pattern classifier design by linear programming. IEEE Trans. Comput. 100, 367–372
(1968)
Solomonoff, R.J.: A formal theory of inductive inference. Part II. Inf. Control. 7, 224–254 (1964)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, USA (2018)
Tuia, D., Volpi, M., Copa, L., Kanevski, M., Munoz-Mari, J.: A survey of active learning algorithms
for supervised remote sensing image classification. IEEE J. Sel. Topics Signal Process. 5, 606–617
(2011)
Vapnik, V.: Pattern recognition using generalized portrait method. Autom. Remote. Control. 24,
774–780 (1963)
Vapnik, V.N.: Estimation of Dependencies Based on Empirical Data. Springer, New York (1982)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Vapnik, V.N., Chervonenkis, A.Y.: On a class of perceptrons. Autom. Remote. 25, 103–109 (1964)
Vapnik, V., Chervonenkis, A.: Theory of pattern recognition: statistical problems of learning (Rus-
sian). Nauka, Moscow (1974)
Vapnik, V., Chervonenkis, A.: Theory of Pattern Recognition (German). Akademie, Berlin (1979)
Wahba, G.: Spline Models for Observational Data. SIAM, PA (1990)
Warwick, K.: A Brief History of Deep Blue. IBM’s Chess Computer, Mental Floss (2017)
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England
(1989)
Wenzel, F., Galy-Fajou, T., Deutsch, M., Kloft, M.: Bayesian nonlinear support vector machines for
big data. In: European Conference on Machine Learning and Knowledge Discovery in Databases,
pp. 307–322. Springer, Cham (2017)
Widrow, B., Hoff, M.E.: Adaptive Switching Circuits (No. TR-1553-1). Stanford University Cali-
fornia Stanford Electronics Labs (1960)
Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Comput. 8,
1341–1390 (1996)
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. Learn. Syst. 16,
645–678 (2005)
Chapter 2
Basics of SVM Method and Least
Squares SVM
Kourosh Parand, Fatemeh Baharifard, Alireza Afzal Aghaei,

and Mostafa Jani
Abstract The learning process of support vector machine algorithms leads to solv-
ing a convex quadratic programming problem. Since this optimization problem has
a unique solution and also satisfies the Karush–Kuhn–Tucker conditions, it can be
solved very efficiently. In this chapter, the formulation of optimization problems
which have arisen in the various forms of support vector machine algorithms is
discussed.
Keywords Support vector machine · Classification · Regression · Kernel trick
2.1 Linear SVM Classifiers
As mentioned in the previous chapter, the linear support vector machine method
for categorizing separable data was introduced by Vapnik and Chervonenkis (1964).
This method finds the best discriminator hyperplane, which
separates samples of
two classes among a training dataset. Consider D = (xi , yi )| i = 1 . . . N , xi ∈
Rd and yi ∈ {−1, +1} as a set of training data, where the samples in classes C1 and
C2 have +1 and −1 labels, respectively.
In this section, the SVM method is explained for two cases. The first case occurs
when the training samples are linearly separable and the goal is to find a linear
separator by the hard margin SVM method. The second is about the training samples
K. Parand (B)
Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti
University, Tehran, Iran
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada
F. Baharifard
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
A. A. Aghaei · M. Jani
Department of Computer and Data Science, Faculty of Mathematical Sciences,
Shahid Beheshti University, Tehran, Iran
20 K. Parand et al.
which are not linearly separable (e.g., due to noise) and so have to use the soft margin
SVM method.
2.1.1 Hard Margin SVM
Assume that the input data is linearly separable. Figure 2.1 shows an example of this
data as well as indicates that the separator hyperplane is not unique. The aim of the
SVM method is to find a unique hyperplane that has a maximum distance to the
closest points of both classes. The equation of this hyperplane can be considered as
follows:
d
w, x + w0 = w T x + w0 = wi xi + w0 = 0, (2.1)
i=1
where wi ’s are unknown weights of the problem.

The margin of a separator is the distance between the hyperplane and the closest
training samples. The larger margin provides better generalization to unseen data.
The hyperplane with the largest margin has equal distances to the closest samples
of both classes. Consider the distance from the separating hyperplane to the nearest
sample of each of classes C1 and C2 is equal to w D
, where w is the Euclidean
2D
norm of w. So, the margin will be equal to w . The purpose of the hard margin SVM
method is to maximize this margin so that all training data is categorized correctly.
Therefore, the following conditions must be satisfied:

∀xi ∈ C1 (yi = +1) → (w T xi + w0 ) ≥ D,
(2.2)
∀xi ∈ C2 (yi = −1) → (w T xi + w0 ) ≤ −D.
Fig. 2.1 Different

separating hyperplanes for x2
the same dataset in a
classification problem
x1
2 Basics of SVM Method and Least Squares SVM 21
wT x + w0 = 0
x2
2D
||w||
wT x + w0 = D
D
||w||
wT x + w0 = −D
x1
Fig. 2.2 Hard margin SVM method: a unique hyperplane with the maximum margin
So, the following optimization problem is obtained to find the separator hyper-
plane, which is depicted in Fig. 2.2 for two-dimensional space
2D
max (2.3)
D,w,w0 w
s.t. (w T xi + w0 ) ≥ D, ∀xi ∈ C1 ,
(w T xi + w0 ) ≤ −D, ∀xi ∈ C2 .
One can set w = wD and w0 = wD0 and combine the above two inequalities as an
inequality yi (w T xi + w0 ) ≥ 1 to have
2
max (2.4)
w ,w0 w
T
s.t. yi (w xi + w0 ) ≥ 1, i = 1, . . . , N .
Moreover, in order to maximize the margin, it can equivalently minimize w .

This gives us the following primal problem for normalized vectors w and w0 (here
w is substituted with w2 = w T w for simplicity)
1
min w T w (2.5)
w,w0 2
s.t. yi (w T xi + w0 ) ≥ 1, i = 1, . . . , N .
The Quadratic Programming (QP) problem as a special case of nonlinear pro-

gramming is a problem of minimizing or maximizing a quadratic objective function
of several variables, subject to some linear constraints defined on these variables. The
22 K. Parand et al.
advantage of QP problems is that there are some efficient computational methods to

solve them Frank and Wolfe (1956), Murty and Yu (1988). The general format of
this problem is shown below. Here, Q is a real symmetric matrix.
1
min xT Qx + c T x (2.6)
x 2
s.t. Ax ≤ b,
Ex = d.
According to the above definition, the problem Eq. 2.5 can be considered as a
convex QP problem (Q be an identity matrix, vector c equals to zero and constraints
are reformulated to become Ax ≤ b form). If the problem has a feasible solution, it
is a global minimum and can be obtained by an efficient method.
On the other hand, instead of solving the primal form of the problem, the dual
form of it can be solved. The dual problem is often easier and helps us to have
better intuition about the optimal hyperplane. More importantly, it enables us to take
advantage of the kernel trick which will be explained later.
A common method for solving constrained optimization problems is using the
technique of Lagrangian multipliers. In the Lagrangian multipliers method, a new
function, namely, the Lagrangian function is formed from the objective function and
constraints, and the goal is to obtain the stationary point of this function. Consider
the minimization problem as follows:
p ∗ = min f (x) (2.7)

x
s.t. gi (x) ≤ 0, i = 1, . . . , m,
h i (x) = 0, i = 1, . . . , p.
The Lagrangian function of this problem is

m
p
L(x, α, λ) = f (x) + αi gi (x) + λi h i (x), (2.8)
i=1 i=1
where α = [α1 , . . . , αm ] and λ = [λ1 , . . . , λ p ] are Lagrangian multiplier vectors.

According to the following equation, by setting αi ≥ 0, the maximum of L(x, α, λ)
is equivalent to f (x)
⎧
⎪
⎨∞ ∀gi (x) > 0,
max L(x, α, λ) = ∞ ∀h i (x) = 0, (2.9)
αi ≥0,λi ⎪
⎩
f (x) otherwise.
Therefore, the problem Eq. 2.7 can be written as
p ∗ = min max L(x, α, λ). (2.10)

x αi ≥0,λi
The dual form of Eq. 2.10 will be obtained by swapping the order of max
and min
d ∗ = max min L(x, α, λ). (2.11)
αi ≥0,λi x
The weak duality is always held and so d ∗ ≤ p ∗ . In addition, because the primal
problem is convex, the strong duality is also established and d ∗ = p ∗ . So, the primal
optimal objective and the dual optimal objective are equal and instead of solving the
primal problem, the dual problem can be solved.
Here, the dual formulation of the problem Eq. 2.5 is discussed. Combining the
objective function and the constraints gives us
1 N
L(w, w0 , α) = w2 + αi (1 − yi (w T xi + w0 )), (2.12)
2 i=1
which leads to the following optimization problem:
1 N
min max w2 + αi (1 − yi (w T xi + w0 )) . (2.13)
w,w0 αi ≥0 2 i=1
Corresponding to the strong duality, the following dual optimization problem can
be examined to find the optimal solution of the above problem
1 N
max min w2 + αi (1 − yi (w T xi + w0 )) . (2.14)
αi ≥0 w,w0 2 i=1
The solution is characterized by the saddle point of the problem. To find
min L(w, w0 , α),

w,w0
taking the first-order partial derivative of L with respect to w and w0 obtain

N
∇w L(w, w0 , α) = 0, → w = i=1 αi yi xi ,
∂L(w,w0 ,α) N (2.15)
∂w0
= 0, → i=1 αi yi = 0.
In these equations, w0 has been removed and a global constraint has been set for
α. By substituting w from the above equation in the Lagrangian function of Eq. 2.12,
we have
N
1
N N
L(α) = αi − αi α j yi y j xiT x j . (2.16)
i=1
2 i=1 j=1
24 K. Parand et al.
So, the dual form of problem Eq. 2.5 is as follows:

N
1
N N
max αi − αi α j yi y j xiT x j (2.17)
α 2
i=1 i=1 j=1

N
s.t. αi yi = 0,
i=1
αi ≥ 0, i = 1, . . . , N .
The equivalent of the problem Eq. 2.17 can be considered as follows, which is a
quadratic programming due to Eq. 2.6
⎡ ⎤
y1 y1 x1T x1 . . . y1 y N x1T x N
1 ⎢ .. .. .. ⎥
min α T ⎣ . . . ⎦ α + (−1) α
T
(2.18)
α 2
y N y1 xTN x1 . . . y N y N xTN x N
s.t. − α ≤ 0,
yT α = 0.
After finding α by QP process, vector of w can be calculated from equality w =

N
αi yi xi and the only unknown variable that remains is w0 . For computing w0 ,
i=1
support vectors must be introduced first. To express the concept of support vectors,
we need to explain the Karush–Kuhn–Tucker (KKT) conditions Karush (1939), Kuhn
and Tucker (1951) beforehand.
The necessary conditions for the optimal solution of nonlinear programming are
called the Karush–Kuhn–Tucker (KKT) conditions. In fact, if there exists some saddle
point (w∗ , w0∗ , α ∗ ) of L(w, w0 , α), then it satisfies the following KKT conditions:
⎧
⎪
⎪ 1. yi (w∗ T xi + w0∗ ) ≥ 1, i = 1, . . . N ,
⎪
⎪
⎪
⎪
⎨2. ∇w L(w, w0 , α)|w∗ ,w0∗ ,α∗ = 0,
∂L(w,w0 ,α)
|w∗ ,w0∗ ,α∗ = 0, (2.19)
⎪
⎪ ∂w0
⎪
⎪ 3. αi 1 − yi (w∗ T xi + w0∗ ) = 0, i = 1, . . . N ,
∗
⎪
⎪
⎩4. α ∗ ≥ 0, i = 1, . . . N .
i
The first condition indicates the feasibility of the solution and states that the
conditions at the optimal point should not be violated. The second condition ensures
that there is no direction that can both improve the objective function and be feasible.
The third condition is related to the complementary slackness, which together with the
fourth condition indicates that the Lagrangian multipliers of the inequality constraints
are zero and the Lagrangian multipliers of the equality constraints are positive. In
other words, for active constraint yi (w T xi + w0 ) = 1, αi can be greater than zero
and the corresponding xi is defined as a support vector. But for inactive constraints
yi (w T xi + w0 ) > 1, αi = 0 and xi is not a support vector (see Fig. 2.3).
wT x + w0 = 0
x2
wT x + w0 = 1
wT x + w0 = −1
, : Support Vectors (α > 0)
x1
Fig. 2.3 The support vectors in the hard margin SVM method
If we define a set of support vectors as SV = {xi |αi > 0} and consequently define
S = {i|xi ∈ SV }, the direction of the hyperplane which is related to w can be found
as follows:

w= αs ys xs . (2.20)
s∈S
Moreover, any sample whose Lagrangian multiplier is greater than zero is on the
margin and can be used to compute w0 . Using the sample xs j that s j ∈ S for which
the equality ys j (w T xs j + w0 ) = 1 is established, we have
w0 = ys j − w T xs j . (2.21)
By assuming that the linear classifier is y = sign(w0 + w T x), new samples can
be classified using only support vectors of the problem. Consider a new sample x,
thus the label of this sample ( ŷ) can be computed as follows:
ŷ = sign(w0 + w T x) (2.22)

= sign ys j − ( αs ys xs )T xs j + ( αs ys xs )T x
s∈S s∈S

= sign ys j − αs ys xsT xs j + αs ys xsT x .
s∈S s∈S
26 K. Parand et al.
2.1.2 Soft Margin SVM
Soft margin SVM is a method for obtaining a linear classifier for some training sam-
ples which are not actually linearly separable. The overlapping classes or separable
classes that include some noise data are examples of these problems. In these cases,
the hard margin SVM does not work well or even does not get the right answer.
One solution for these problems is trying to minimize the number of misclassified
points as well as maximize the number of samples that are categorized correctly.
But this counting solution falls into the category of NP-complete problems Cortes
and Vapnik (1995). So, an efficient solution is defining a continuous problem that
is solvable deterministically in a polynomial time by some optimization techniques.
The extension of the hard margin method, called the soft margin SVM method, was
introduced by Cortes and Vapnik (1995) for this purpose.
In the soft margin method, the samples are allowed to violate the conditions,
while the total amount of these violations should not be increased. In fact, the soft
margin method tries to maximize the margin while minimizing the total violations.
Therefore, for each sample xi , a slack variable ξi ≥ 0, which indicates the amount of
violation of it from the correct margin, should be defined and the inequality of this
sample should be relaxed to
yi (w T xi + w0 ) ≥ 1 − ξi . (2.23)
N
Moreover, the total violation is i=1 ξi , which should be minimized. So, the
primal optimization problem to obtain the separator hyperplane becomes
1 T N
min w w+C ξi (2.24)
w,w0 ,{ξi }i=1
N 2 i=1
s.t. yi (w T xi + w0 ) ≥ 1 − ξi , i = 1, . . . , N ,
ξi ≥ 0,
where C is a regularization parameter. A large amount for C leads to ignoring condi-

tions hardly so if C → ∞, this method is equivalent to hard margin SVM. Conversely,
if the amount of C is small, conditions are easily relaxed and a large margin will be
achieved. Therefore, the appropriate value of C should be used in proportion to the
problem and the purpose.
If sample xi is categorized correctly, but located inside the margin, then 0 < ξi <
1. Otherwise ξi > 1, which is because sample xi is in the wrong category and is a
misclassified point (see Fig. 2.4).
As can be seen in Eq. 2.24, this problem is still a convex QP and unlike the hard
margin method, it has always obtained a feasible solution.
Here, the dual formulation of the soft margin SVM is discussed. The Lagrange
formulation is
wT x + w0 = 0
x2
ξ>1
0<ξ<1
ξ>1
wT x + w0 = 1
wT x + w0 = −1
, : Margin Support Vectors x1
(α > 0, ξ = 0)
Fig. 2.4 Hard margin SVM method: types of support vectors and their corresponding ξ values
1
N
N
N
L(w, w0 , ξ, α, β) = w2 + C ξi + αi (1 − ξi − yi w T xi + w0 ) − βi ξi ,
2
i=1 i=1 i=1
(2.25)
where αi ’s and βi ’s are the Lagrangian multipliers. The Lagrange formulation should
be minimized with respect to w, w0 , and ξi ’s while be maximized with respect to
positive Lagrangian multipliers αi ’s and βi ’s. To find minw,w0 ,ξ L(w, w0 , ξ, α, β),
we have
⎧ N
⎪
⎨∇w L(w, w0 , ξ, α, β) = 0 → w = i=1 αi yi xi ,
∂L(w,w0 ,ξ,α,β) N
∂w0
=0 → i=1 αi yi = 0, (2.26)
⎪
⎩ ∂L(w,w0 ,ξ,α,β)
∂ξi
=0 → C − αi − βi = 0.
By substituting w from the above equation in the Lagrangian function of Eq. 2.25,
the same equation as Eq. 2.16 is achieved. But, here two constraints on α are created.
One is the same as before and the other is 0 ≤ αi ≤ C. It should be noted that βi does
not appear in L(α) of Eq. 2.16 and just need to consider that βi ≥ 0. Therefore, the
condition C − αi − βi = 0 can be replaced by condition 0 ≤ αi ≤ C. So, the dual
form of problem Eq. 2.24 becomes as follows:

N
1
N N
max αi − αi α j yi y j xiT x j (2.27)
α 2 i=1 j=1
i=1

N
s.t. αi yi = 0,
i=1
0 ≤ αi ≤ C, i = 1, . . . , N .
28 K. Parand et al.
After solving the above QP problem, w and w0 can be obtained based on Eqs. (2.20)
and (2.21), respectively, while still defined S = {i|xi ∈ SV }. As mentioned before,
the set of support
vectors (SV ) is obtained
based on the complementary slackness
criterion of αi∗ 1 − yi (w∗ T xi + w0∗ ) = 0 for sample xi . Another complementary
slackness condition of Eq. 2.24 is βi∗ ξi = 0. According to it, the support vectors are
divided into two categories, the margin support vectors and the non-margin support
vectors:
• If 0 < αi < C, based on condition C − αi − βi = 0, we have βi = 0 and so ξi

should be zero. Therefore, yi (w T xi + w0 ) = 1 is hold and sample xi located on
the margin and be a margin support vector.
• If αi = C, we have βi = 0 and so ξi can be greater than zero. So, yi (w T xi + w0 ) <
1 and sample xi is on or over the margin and be a non-margin support vector.
Also, the classification formula of a new sample in the soft margin SVM is the same
as hard margin SVM which is discussed in Eq. 2.22.
2.2 Nonlinear SVM Classifiers
Some problems have nonlinear decision surfaces and therefore linear methods do not
provide a suitable answer for them. The nonlinear SVM classifier was introduced by
Vapnik Vapnik (2000) for these problems. In this method, the input data is mapped
to a new feature space by a nonlinear function and a hyperplane is found in the
transformed feature space. By applying reverse mapping, the hyperplane becomes a
curve or a surface in the input data space (see Fig. 2.5).
Applying a transformation φ : Rd → Rk on the input space, sample x = [x1 , . . .
xd ] can be represented by φ(x) = [φ1 (x), . . . , φk (x)] in the transformed space, where
φi (x) : Rd → R. The primal problem of soft margin SVM in the transformed space
is as follows:
x2 φ(x2 )
φ : x → φ(x)
wT φ(x) + w0 = 0
x1 φ(x1 )
Fig. 2.5 Mapping input data to a high-dimensional feature space to have a linear separator hyper-
plane in the transformed space
1 T N
min w w+C ξi (2.28)
w,w0 ,{ξi }i=1
N 2 i=1
s.t. yi (w T φ(xi ) + w0 ) ≥ 1 − ξi , i = 1, . . . , N ,
ξi ≥ 0,
and w ∈ Rk are the separator parameters that should be obtained.

The dual problem Eq. 2.27 also changes in a transformed space, as follows:

N
1
N N
max αi − αi α j yi y j φ(xi )T φ(x j ) (2.29)
α 2 i=1 j=1
i=1

N
s.t. αi yi = 0,
i=1
0 ≤ αi ≤ C, i = 1, . . . , N .
In this case, the classifier is defined as
ŷ = sign(w0 + w T φ(x)), (2.30)
where w = αi >0 αi yi φ(xi ) and w0 = ys j − w T φ(xs j ) such that xs j is a sample from

the support vectors set.
The challenge of this method is choosing a suitable transformation. If the proper
mapping is used, the problem can be modeled with good accuracy in the transformed
space. But, if k d, then there are more parameters to learn and this takes more
computational time. In this case, it is better to use the kernel trick which is mentioned
in the next section. The kernel trick is a technique to use feature mapping without
explicitly applying it to the input data.
2.2.1 Kernel Trick and Mercer Condition
In the kernel-based approach, a linear separator in a high-dimensional space is

obtained without applying the transformation function to the input data. In the dual
form of the optimization problem Eq. 2.29 only the dot product of each pair of train-
ing samples exists. By computing the value of these inner products, the calculation
of α = [α1 , . . . , α N ] only remains. So, unlike the primal problem Eq. 2.28, learning
k parameters is not necessary here. The importance of this is well seen when k d.
Let us define K (x, t) = φ(x)T φ(t) as the kernel function. The basis of the kernel
trick is the calculation of K (x, t) without mapping samples x and t to the transformed
space. For example, consider a two-dimensional input dataset and define a kernel
function as follows:
30 K. Parand et al.
K (x, t) = (1 + xT t)2 (2.31)

= (1 + x1 t1 + x2 t2 ) 2
= 1 + 2x1 t1 + 2x2 t2 + x12 t12 + x22 t22 + 2x1 t1 x2 t2 .
As can be seen, the expansion of this kernel function equals to the inner product
of the following second-order φ’s:
√√ √
φ(x) = [1, 2x1 , 2x2 , x12 , x22 , 2x1 x2 ], (2.32)
√ √ √
φ(t) = [1, 2t1 , 2t2 , t12 , t22 , 2t1 t2 ].
So, we can substitute the dot product φ(x)T φ(t) with the kernel function K (x, t) =
(1 + xT t)2 without directly calculating φ(x) and φ(t).
The polynomial kernel function can similarly be generalized to d-dimensional
feature space x = [x1 , . . . , xd ] where φ’s are polynomials of order M as follows:
√ √ √
φ(x) = [1, 2x1 , . . . , 2xd , x12 , . . . , xd2 , 2x1 x2 , . . . ,
√ √ √
2x1 xd , 2x2 x3 , . . . , 2xd−1 xd ]T .
The following polynomial kernel function can indeed be efficiently computed

with a cost proportional to d (the dimension of x) instead of k (the dimension of
φ(x))
K (x, t) = (1 + xT t) M = (1 + x1 t1 + x2 t2 + · · · , xd td ) M . (2.33)
In many cases, the inner product in the embedding space can be computed effi-
ciently by defining a kernel function. Some common kernel functions are listed
below:
• K (x, t) = xT t,
• K (x, t) = (xT t + 1) M ,
K (x, t) = exp(− x−t
2
• γ
),
• K (x, t) = tanh(ax t + b).
T
These functions are known as linear, polynomial, Gaussian, and sigmoid kernels,
respectively Cheng et al. (2017).
A valid kernel corresponds to an inner product in some feature space. A necessary
and sufficient condition to check the validity of a kernel function is the Mercer
condition Mercer (1909). This condition states that any symmetric positive definite
matrix can be regarded as a kernel matrix. By restricting a kernel function to a set
of points {x1 , . . . x N }, the corresponding kernel matrix K N ×N is a matrix that the
element in row i and column j is K (xi , x j ) as follows:
⎡ ⎤
K (x1 , x1 ) . . . K (x1 , x N )
⎢ .. .. .. ⎥
K =⎣ . . . ⎦. (2.34)
K (x N , x1 ) . . . K (x N , x N )
In other words, a real-valued function K (x, t) satisfies Mercer condition if for

any square-integrable function g(x), we have

g(x)K (x, t)g(t)dxdt ≥ 0. (2.35)
Therefore, using a valid and suitable kernel function, the optimization problem
Eq. 2.29 becomes the following problem:

N
1
N N
max αi − αi α j yi y j K (xi , x j ) (2.36)
α 2 i=1 j=1
i=1

N
s.t. αi yi = 0,
i=1
0 ≤ αi ≤ C, i = 1, . . . , N ,
which is still a convex QP problem and is solved to find αi ’s. Moreover, for classifying
new data, the similarity of the input sample x is compared with all training data
corresponding to the support vectors by the following formula:

ŷ = sign w0 + αi yi K (xi , x) , (2.37)
αi >0
where
w0 = ys j − αi yi K (xi , xs j ), (2.38)
αi >0
such that xs j is an arbitrary sample from the support vectors set.

Now it is important to mention some theorems about producing new kernels based
on known kernel functions.
Theorem 2.1 A non-negative linear combination of some valid kernel functions is
also a valid kernel function [Zanaty and Afifi (2011)].
Proof Let’s K 1 , . . . , K m satisfy the Mercer condition, we are going to show, K new =
m
i=1 ai K i where ai ≥ 0 satisfies the Mercer condition too. So, we have

g(x)K new (x, t)g(t)dxdt =

m
g(x) ai K i (x, t)g(t)dxdt =
i=1

m
ai g(x)K i (x, t)g(t)dxdt ≥ 0.
i=1

32 K. Parand et al.
Theorem 2.2 The product of some valid kernel functions is also a valid kernel
function [Zanaty and Afifi (2011)].
Proof A similar way can be applied to this theorem.
The first implication of Theorems 2.1 and 2.2 is if K be a valid Mercer kernel
function, then any polynomial function with positive coefficients of K makes also
a valid Mercer kernel function. On the other hand, the exp(K ) also makes a valid
Mercer kernel function (i.e., consider Maclaurin expansion of exp() for the proof)
Genton (2001), Shawe-Taylor and Cristianini (2004).
2.3 SVM Regressors
The support vector machine model for classification tasks can be modified for the
regression problems. This can be achieved by applying the -insensitive loss function
to the model Drucker et al. (1997). This loss function is defined as

0 |y − ŷ| ≤ ,
loss(y, ŷ; ) = |y − ŷ| =
|y − ŷ| − otherwise,
where ∈ R+ is a hyper-parameter which specifies the -tube. Every predicted point

outside this tube will incorporate in the model with a penalty. The unknown function
is also defined as
y(x) = w T φ(x) + w0 ,
where x ∈ Rd and y ∈ R. The formulation of the primal form of the support vector
regression model is constructed as
1 T N

min w w+C ξi + ξi∗
w,w0 ,ξ,ξ ∗ 2 i=1
s.t. yi − w T φ(xi ) − w0 ≤ ε + ξi , i = 1, . . . , N , (2.39)

w T φ(xi ) + w0 − yi ≤ ε + ξi∗ , i = 1, . . . , N ,
ξi , ξi∗ ≥ 0, i = 1, . . . , N .
Using the Lagrangian multipliers, the dual form of this optimization problem leads
to
1
N
max∗ − αi − αi∗ α j − α ∗j K (xi , x j )
α,α 2 i, j=1

N
N

− αi + αi∗ + yi αi − αi∗
i=1 i=1

N

s.t. αi − αi∗ = 0,
i=1
αi , αi∗ ∈ [0, C].
The bias variable w0 follows the KKT conditions

1
N
w0 = ys − − αi − αi∗ K (xs , xi ) ,
|S| s∈S i=1
where |S| is the cardinality of support vectors set. The unknown function y(x) in the
dual form can be computed as

N

y(x) = αi − αi∗ K (x, xi ) + w0 .
i=1
2.4 LS-SVM Classifiers
The least squares support vector machine (LS-SVM) is a modification of SVM for-
mulation for machine learning tasks which was originally proposed by Suykens and
Vandewalle (1999). The LS-SVM replaces the inequality constraints of SVM’s pri-
mal model with equality ones. Also, the slack variables loss function changes to the
squared error loss function. Taking advantage of these changes, the dual problem
leads to a system of linear equations. Solving this system of linear equations can
be more computationally efficient than a quadratic programming problem in some
cases. Although this reformulation preserves the kernel trick property, the sparseness
of the model is lost. Here, the formulation of LS-SVM for two-class classification
tasks will be described.
Same as support vector machines, the LS-SVM considers the separating hyper-
plane in a feature space:

y(x) = sign w T φ(x) + w0 .
34 K. Parand et al.
Then, the primal form is formulated as
γ 2
N
1 T
min w w+ e
w,e,w0 2 2 i=1 i
T
s.t. yi w φ (xi ) + w0 = 1 − ei , i = 1, . . . , N ,
where γ is the regularization parameter and ei is the slack variables which can be
positive or negative. The corresponding Lagrangian function is
γ 2 T
N N
1 T
L(w, w0 , e, α) = w w+ ei − αi yi w φ (xi ) + w0 − 1 + ei .
2 2 i=1 i=1
The optimality conditions yield

⎧
⎪
⎪ ∂L N
⎪
⎪ =0→w= αi yi φ (xi ) ,
⎪
⎪ ∂w
⎪
⎪
⎪
⎪
i=1
⎪
⎪ ∂L N
⎨ =0→ αi yi = 0,
∂w0 (2.40)
⎪
⎪ i=1
⎪
⎪ ∂L
⎪
⎪ = 0 → αi = γ ei , i = 1, . . . , N ,
⎪
⎪ ∂ei
⎪
⎪
⎪
⎪ ∂L
⎩ = 0 → yi w T φ (xi ) + w0 − 1 + ei = 0, i = 1, . . . , N ,
∂αi
which can be written in the matrix form

0 yT w0 0
= ,
y + I /γ α 1v
where
Z T = φ (x1 )T y1 ; . . . ; φ (x N )T y N ,
y = [y1 ; . . . ; y N ] ,
1v = [1; . . . ; 1],
e = [e1 ; . . . ; e N ] ,
α = [α1 ; . . . ; α N ] ,
and
i, j = yi y j φ (xi )T φ x j

= yi y j K xi , x j , i, j = 1, . . . , N .
The classifier in the dual form takes the form

N

y(x) = sign αi yi K (xi , x) + w0 ,
i=1
where K (x, t) is a valid kernel function.
2.5 LS-SVM Regressors
The LS-SVM for function estimation, known as LS-SVR, is another type of LS-SVM
which deals with regression problems Suykens et al. (2002). The LS-SVR primal
form can be related to the ridge regression model, but for a not-explicitly defined
feature map, the dual form with a kernel trick sense can be constructed. The LS-SVM
for function estimation considers the unknown function as
y(x) = w T φ(x) + w0 ,
and then formulates the primal optimization problem as
1 2
N
1 T
min w w+γ e
w,e,w0 2 2 i=1 i
s.t. yi = w T φ(xi ) + w0 + ei , i = 1, . . . , N .
By applying the Lagrangian function and computing the optimality conditions, the
dual form of the problem takes the form

0 1T w0 0
= ,
1 + I /γ α y
where
y = [y1 ; . . . ; y N ] ,
1v = [1; . . . ; 1],
α = [α1 ; . . . ; α N ] ,

i, j = φ (xi ) φ x j = K xi , x j .
T
Also, the unknown function in the dual form can be described as

N
y(x) = αi K (xi , x) + w0 ,
i=1
with a valid kernel function K (x, t).

36 K. Parand et al.
References
Cheng, K., Lu, Z., Wei, Y., Shi, Y., Zhou, Y.: Mixed kernel function support vector regression for
global sensitivity analysis. Mech. Syst. Signal Process. 96, 201–214 (2017)
Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression
machines. Adv. Neural Inf. Process. Syst. 9, 155–161 (1997)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 95–110
(1956)
Genton, M.G.: Classes of kernels for machine learning: a statistics perspective. J. Mach. Learn.
Res. 2, 299–312 (2001)
Karush, W.: Minima of functions of several variables with inequalities as side constraints. M.Sc.
Dissertation. Department of Mathematics, University of Chicago (1939)
Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Berkeley Symposium on Mathematical
Statistics and Probability. University of California Press, Berkeley (1951)
Mercer, J.: XVI. Functions of positive and negative type, and their connection the theory of integral
equations. In: Philosophical Transactions of the Royal Society of London. Series A, Containing
Papers of a Mathematical or Physical Character, vol. 209, pp. 415–446 (1909)
Murty, K.G., Yu, F.T.: Linear Complementarity, Linear and Nonlinear Programming. Helderman,
Berlin (1988)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press,
UK (2004)
Suykens, J.A.K., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process.
Lett. 9, 293–300 (1999)
Suykens, J.A.K., Gestel, T.V., Brabanter, J.D., Moor, B.D., Vandewalle, J.: Least Squares Support
Vector Machines. World Scientific, Singapore (2002)
Vapnik, V., Chervonenkis, A.: A note one class of perceptrons. Autom. Remote. Control. 44, 103–
109 (1964)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Berlin (2000)
Zanaty, E.A., Afifi, A.: Support vector machines (SVMs) with universal kernels. Appl Artif. Intell.
25, 575–589 (2011)
Part II
Special Kernel Classifiers
Chapter 3
Fractional Chebyshev Kernel Functions:
Theory and Application

and Dumitru Baleanu
Abstract Orthogonal functions have many useful properties and can be used for
different purposes in machine learning. One of the main applications of the orthogonal
functions is producing powerful kernel functions for the support vector machine
algorithm. Maybe the simplest orthogonal function that can be used for producing
kernel functions is the Chebyshev polynomials. In this chapter, after reviewing some
essential properties of Chebyshev polynomials and fractional Chebyshev functions,
various Chebyshev kernel functions are presented, and fractional Chebyshev kernel
functions are introduced. Finally, the performance of the various Chebyshev kernel
functions is illustrated on two sample datasets.
Keywords Chebyshev polynomial · Fractional Chebyshev functions · Kernel

trick · Orthogonal functions · Mercer’s theorem
3.1 Introduction
As mentioned in the previous chapters, the kernel function plays a crucial role in the
performance of the SVM algorithms. In the literature of the SVM, various kernel
functions have been developed and applied on several datasets (Hussain et al. 2011;
Ozer et al. 2011; An-na et al. 2010; Padierna et al. 2018; Tian and Wang 2017), but
each kernel function has its own benefits and limitations (Achirul Nanda et al. 2018;
Hussain et al. 2011). The radial basis function (RBF) (Musavi et al. 1992; Scholkopf
A. H. Hadian Rasanan (B)

Department of Cognitive Modeling, Institute for Cognitive and Brain Sciences, Shahid Beheshti
S. Nedaei Janbesaraei
D. Baleanu
Department of Mathematics, Cankaya University, Ankara 06530, Turkey
40 A. H. Hadian Rasanan et al.
et al. 1997) and polynomial kernel functions (Reddy et al. 2014; Yaman and Pele-
canos 2013) perhaps are the most popular ones, because they are easy to learn,
have acceptable performance in pattern classification, and are very computationally
efficient. However, there are many other examples where the performance of those
kernels is not satisfactory (Moghaddam and Hamidzadeh 2016). One of the well-
established alternatives to these two kernels is the orthogonal kernel functions which
have many useful properties embedded in their nature. These orthogonal functions
are very useful in various fields of science as well as machine learning (Hajimo-
hammadi et al. 2021; Tian and Wang 2017; Sun et al. 2015). It can be said that the
simplest family of these functions is Chebyshev. This family of orthogonal functions
has been used in different cases such as signal and image processing (Shuman et al.
2018), digital filtering (Pavlović et al. 2013), spectral graph theory (Hadian Rasanan
et al. 2019), astronomy (Capozziello et al. 2018), numerical analysis (Sedaghat et al.
2012; Zhao et al. 2017; Shaban et al. 2013; Kazem et al. 2012; Parand et al. 2019;
Hadian-Rasanan and Rad 2020; Kazem et al. 2017), and machine learning (Mall
and Chakraverty 2020). On the other hand, in the literature of numerical analysis
and scientific computing, the Chebyshev polynomials have been used for solving
various problems in fluid dynamics (Parand et al. 2017), theoretical physics (Parand
and Delkhosh 2017), control (Hassani et al. 2019), and finance (Glau et al. 2019;
Mesgarani et al. 2021). One of the exciting applications of the Chebyshev polyno-
mials has been introduced by Mall and Chakraverty (Mall and Chakraverty 2015),
where they used the Chebyshev polynomials as an activation function of functional
link neural network. The functional link neural network is a kind of single-layer neu-
ral network which utilizes orthogonal polynomials as the activation function. This
framework based on Chebyshev polynomials has been used for solving various types
of equations such as ordinary, partial, or system of differential equations (Mall and
Chakraverty 2017; Chakraverty and Mall 2020; Omidi et al. 2021).
Chebyshev polynomials are named after Russian mathematician Pafnuty

Lvovich Chebyshev (1821–1894). P.L. Chebyshev, the “extraordinary Rus-
sian mathematician”, had some significant contributions to mathematics dur-
ing his career. His papers on the calculation of equations root, multiple inte-
grals, the convergence of Taylor series, theory of probability, Poisson’s weak
law of large numbers, and also integration by means of logarithms brought
to him worldwide fame. Chebyshev wrote an important book titled “Teoria
sravneny” which was submitted for his doctorate in 1849 and won a prize
from Académie des Sciences of Paris. It was 1854 when he introduced the
famous Chebyshev orthogonal Polynomials. He was appointed to a profes-
sorship at St. Petersburg University officially, for 22 years. Finally, he died
in St. Petersburg on November 26, 1894.a
aFor more information about Chebyshev and his contribution in orthogonal polynomials
visit: http://mathshistory.st-andrews.ac.uk/Biographies/Chebyshev.html.
3 Fractional Chebyshev Kernel Functions: Theory and Application 41
Utilizing the properties of orthogonal functions in machine learning has always

been attractive for researchers. The first Chebyshev kernel function is introduced
in 2006 by Ye et al. (2006). Thereafter, the use of orthogonal Chebyshev kernel
has been under investigation by many researchers (Ozer and Chen 2008; Ozer et al.
2011; Jafarzadeh et al. 2013; Zhao et al. 2013). In 2011, Ozer et al. proposed a set
of generalized Chebyshev kernels which made it possible for the input to be a vector
rather than a single variable (Ozer et al. 2011). The generalized Chebyshev kernel
made it easy to have modified Chebyshev kernels in which weight function can be
an exponential function, for example, Ozer et al. used the Gaussian kernel function
specifically. In fact, this brings more nonlinearity to the kernel function. Hence, the
combination of Chebyshev kernel with other well-known kernels has been investi-
gated too, for instance, the Chebyshev-Gaussian and Chebyshev-Wavelet (Jafarzadeh
et al. 2013) kernel functions are bases on this combination. Also, Jinwei Zhao et al.
(2013) proposed the unified Chebyshev polynomials, a new sequence of orthogonal
polynomials and consequently unified Chebyshev kernel through combining Cheby-
shev polynomials of the first and second kinds, which has shown to have excellent
generalization performance and prediction accuracy.
The classical orthogonal polynomials proved to be highly efficient in many

applied problems. The uniform proximity and orthogonality have attracted
researchers in the kernel learning domain. In fact, Vladimir Vapnik, the
approximation of real-valued functions using n-dimensional Hermit poly-
nomials of classical orthogonal polynomials was first introduced (Statistical
Learning Theory, Wiley, USA, 1998). Then, a new perspective opened, later
in 2006, Ye and colleagues Ye et al. (2006) constructed an orthogonal Cheby-
shev kernel based on the Chebyshev polynomials of the first kind. Based on
what is said in Ye et al. (2006), “As Chebyshev polynomial has the best
uniform proximity and its orthogonality promises the minimum data redun-
dancy in feature space, it is possible to represent the data with fewer support
vectors.”
One of the recent improvements in the field of special functions is the development
of fractional orthogonal polynomials (Kazem et al. 2013). The fractional orthogonal
functions can be obtained by some nonlinear transformation function and have much
better performance in function approximation (Dabiri et al. 2017; Kheyrinataj and
Nazemi 2020; Habibli and Noori Skandari 2019). However, these functions have not
been used as a kernel. Thus, this chapter first aims to present essential backgrounds
for Chebyshev polynomial and its fractional version, then bring all the Chebyshev
kernel functions together, and finally introduce and examine the fractional Chebyshev
kernel functions.
This chapter is organized as follows. The basic definitions and properties of
orthogonal Chebyshev polynomials and the fractional Chebyshev functions are pre-
sented in Sect. 3.2. Then, the ordinary Chebyshev kernel function and two previously
proposed kernels based on these polynomials are discussed and the novel fractional
Chebyshev kernel functions are introduced in Sect. 3.3. In Sect. 3.4, the results of
experiments of both the ordinary Chebyshev kernel and the fractional one are cov-
ered and then a comparison between the obtained accuracy results of the mentioned
kernels and RBF and polynomial kernel functions in the SVM algorithm on well-
known datasets is exhibited to specify the validity and efficiency of the fractional
kernel functions. Finally, in Sect. 3.5, conclusion remarks of this chapter are pre-
sented.
3.2 Preliminaries
Definition and basic properties of Chebyshev orthogonal polynomials of the first

kind are presented in this section. In addition to the basics of these polynomials, the
fractional form of this family is presented and discussed.
3.2.1 Properties of Chebyshev Polynomials
There are four kinds of Chebyshev polynomials, but the focus of this chapter is on
introducing the first kind of Chebyshev polynomials which is denoted by Tn (x). The
interested readers can investigate the other types of Chebyshev polynomials in Boyd
(2001).
The power of polynomials originally comes from their relation with trigonometric
functions (sine and cosine) that are very useful in describing all kinds of natural
phenomena (Mason and Handscomb 2002). Since Tn (x) is a polynomial, then it is
possible to define Tn (x) using trigonometric relations.
Let us assume z be a complex number over a unit circle |z| = 1, in which θ is the
argument of z and θ ∈ [0, 2π ], in other words:
1
x = z = (z + z −1 ) = cos θ ∈ [−1, 1], (3.1)
2
where is the real axis of a complex number.
Considering Chebyshev polynomials of the first kind is denoted by Tn (x), so one
can define (Mason and Handscomb 2002; Boyd 2001; Shen et al. 2011; Asghari et al.
2022):
1
Tn (x) = z n = (z n + z −n ) = cos nθ. (3.2)
2
Therefore, the n-th order of Chebyshev polynomial can be obtained by
Tn (x) = cos nθ, wher e x = cos θ. (3.3)
The relation between x, z and θ is illustrated in Fig. 3.1:

Fig. 3.1 The plot of the

relation between x, z, and θ
Based on definition (3.2), the Chebyshev polynomials for some n are defined as
follows:
1 0
z 0 = (z + z 0 ), =⇒ T0 (x) = 1, (3.4)
2
1 1
z 1 = (z + z −1 ), =⇒ T1 (x) = x, (3.5)
2
1
z 2 = (z 2 + z −2 ), =⇒ T2 (x) = 2x 2 − 1, (3.6)
2
1
z 3 = (z 3 + z −3 ), =⇒ T3 (x) = 4x 3 − 3x, (3.7)
2
1 4
z = (z + z −4 ),
4
=⇒ T4 (x) = 8x 4 − 8x 2 + 1. (3.8)
2
Consequently, by considering this definition for Tn+1 (x) as follows:
1 n+1
Tn+1 (x) = (z + z −n−1 )
2
1 1
= (z n + z −n )(z + z −1 ) − (z n−1 + z 1−n )
2 2
= 2 cos(nθ ) cos(θ ) − cos((n − 1)θ ), (3.9)
the following recursive relation can be obtained for the Chebyshev polynomials:
T0 (x) = 1,
T1 (x) = x,
Tn (x) = 2x Tn−1 (x) − Tn−2 (x), n ≥ 2. (3.10)
Thus, any order of the Chebyshev polynomials can be generated using this recursive
formula.
Additional to the recursive formula, the Chebyshev polynomials can be obtained
directly by the following expansion for n ∈ Z+ (Mason and Handscomb 2002; Boyd
2001; Shen et al. 2011; Asghari et al. 2022):
[ 2 n]1
n!
Tn (x) = (−1)k (1 − x 2 )k x n−2k . (3.11)
k=0
(2k)!(n − 2k)!
The other way to obtain the Chebyshev polynomials is by solving their Sturm-
Liouville differential equation.
Theorem 3.1 (Mason and Handscomb (2002); Boyd (2001); Shen et al. (2011);
Asghari et al. (2022)) Tn (x) is the solution of the following second-order linear
Sturm-Liouville differential equation:
d2 y dy
(1 − x 2 ) −x + n 2 y = 0, (3.12)
dx2 dx
where −1 < x < 1 and n is an integer number.
Proof By considering d
dx
cos−1 (x) = − √1−x
1
2
, we have
d d
Tn (x) = cos(n cos−1 x),
dx dx
1
= n. √ sin(n cos−1 x).
1 − x2
(3.13)
In a similar way, we can write that

d2 d n
Tn (x) = √ sin(n cos−1 x) ,
dx2 dx 1 − x2
nx n −n
= sin(n cos−1 x) + √ cos(n cos−1 x). √ ,
3
(1 − x 2 )2 1−x 2 1 − x2
nx −1 n2
=√ sin(n cos (x)) − cos(n cos−1 ).
3
1 − x2 1 − x2
So, by replacing the derivatives with their obtained formula in (3.12), the following
is yielded:
d 2 Tn (x) dTn (x)

(1 − x 2 ) −x + n 2 Tn (x) = √ nx 2 sin(n cos−1 x) − n −2 cos(n cos−1 x),
dx2 dx 1−x
− √ nx 2 sin(n cos−1 ) + n 2 cos(n cos−1 x),
1−x
= 0,
which yields that Tn (x) is a solution to (3.12).

There are several ways to use these polynomials in Python such as using the
exact formula (3.3), using the NumPy package,1 and also implementing the recursive
formula (3.10). So, the symbolic Python implementation of this recursive formula
using the sympy library is
Program Code
import sympy
x = sympy.Symbol("x")
def Tn(x, n):

if n == 0:
return 1
elif n == 1:
return x
elif n >= 2:
return (2 * x*(Tn(x, n - 1))) - Tn(x, n - 2)
sympy.expand(sympy.simplify(Tn(x, 3)))
> 4x 3 − 3x
In the above code, the third order of first-kind Chebyshev polynomial is generated
which is equal to 4x 3 − 3x.
In order to explore the behavior of the Chebyshev polynomials, there are some
useful properties that are available. The first property of Chebyshev polynomials is
their orthogonality.
Theorem 3.2 (Mason and Handscomb (2002); Boyd (2001); Shen et al. (2011))
{Tn (x)} forms a sequence of orthogonal polynomials which are orthogonal to each
other over the interval [−1, 1], with respect to the following weight function:
1
w(x) = √ , (3.14)
1 − x2
so it can be concluded that
1 https://numpy.org/doc/stable/reference/routines.polynomials.chebyshev.html.
1
π cn
Tn (x)Tm (x)w(x)d x = δn,m , (3.15)
−1 2
where c0 = 2, cn = 1 for n ≥ 1, and δn,m is the Kronecker delta function.
Proof Suppose x = cos θ and n = m, so we have

1
1
Tn (x)Tm (x) √ d x, (3.16)
−1 1 − x2
0
1
= Tn (cos θ )Tm (cos θ ) √ (− sin θ )dθ, (3.17)
1 − cos2 θ
π π
= cos nθ cos mθ dθ,
0 π
1
= [cos(n + m)θ + cos(n − m)θ ]dθ,
0 2
π
1 1 1
= sin(n + m)θ + sin(m − n)θ = 0.
2 n+m m−n 0
In case of n = m = 0, we have
1
1
Tn (x)Tm (x) √ d x,
−1 1 − x2
π
= cos nθ cos nθ dθ,
0 π π
1
= cos nθ dθ =
2
(1 + cos 2nθ )dθ,
0 0 2
π
1 −1 π
= θ+ sin 2nθ = . (3.18)
2 2n 0
2
Also in the case where n = m = 0, we can write that

1 π
1
Tn (x)Tm (x) √ dx = 1dθ = π. (3.19)
−1 1 − x2 0
The orthogonality of these polynomials provides an efficient framework for com-

puting the kernel matrix which will be discussed later. Computationally efficiency is
not the only property of the Chebyshev polynomials, but there is even more about it.
For example, the maximum value for Tn (x) is no more than 1 since for −1 x 1,
Tn (x) is defined as the cosine of an argument. This property prevents overflow dur-
ing computation. Besides, the Chebyshev polynomial of order n > 0 has exactly n
roots and n + 1 local extremum in the interval [−1, 1], in which two extremums are
Fig. 3.2 Chebyshev polynomials of first kind up to sixth order
endpoints at ±1 points (Mason and Handscomb 2002). So, for k = 1, 2, . . . , n, roots

of Tn (x) are obtained as follows:
(2k − 1)π
xk = cos . (3.20)
2n
Note that x = 0 is a root of Tn (x) for all odd orders n, other roots are symmetri-
cally placed on either side of x = 0. For k = 0, 1, . . . , n extremums of Chebyshev
polynomial of first kind can be found using
πk
x = cos . (3.21)
n
Chebyshev polynomials are even and odd symmetrical, meaning even orders only
have the even powers of x and odd orders only the odd powers of x. Therefore we
have
Tn (−x), n even
Tn (x) = (−1)n Tn (−x) = . (3.22)
−Tn (−x), n odd
Additionally, there exist some other properties of the Chebyshev polynomials at

the boundaries:
• Tn (1) = 1,
• Tn (−1) = (−1)n ,
• T2n (0) = (−1)n ,
• T2n+1 (0) = 0.
Figure 3.2 demonstrates Chebyshev polynomials of the first kind up to sixth order.
3.2.2 Properties of Fractional Chebyshev Functions
Fractional Chebyshev functions are a general form of the Chebyshev polynomials.

The main difference between the Chebyshev polynomial and the fractional Cheby-
shev functions is that the order of x can be any positive real number in the latter.
This generalization seems to improve the approximation power of the Chebyshev
polynomials. To introduce the fractional Chebyshev functions, first, it is needed to
introduce a mapping function which is used to define fractional Chebyshev function
of order α over a finite interval [a, b], which is Parand and Delkhosh (2016):
α
x −a
x = 2 − 1. (3.23)
b−a
By utilizing this transformation the fractional Chebyshev functions can be obtained

as follows (Parand and Delkhosh 2016):
α
x −a
F Tnα (x) = Tn (x ) = Tn 2 − 1 , a x b, (3.24)
b−a
where α ∈ R+ is the “fractional order” of function which is chosen relevant to the

context. Considering 3.10, the recursive form of fractional Chebyshev functions is
defined as Parand and Delkhosh (2016)
α
x −a
F T0α (x) = 1, F T1α (x) = 2 − 1,
b−a
α
x −a
F Tnα (x) = 2 2 α
− 1 F Tn−1 α
(x) − F Tn−2 (x), n 1. (3.25)
b−a
The formulation of some of these functions is presented here:
F T0α (x) = 1,
α
x −a
F T1α (x) = 2 − 1,
b−a
2α α
x −a x −a
F T2α (x) = 8 −8 + 1,
b−a b−a
3α 2α α
x −a x −a x −a
F T3α (x) = 32 − 48 + 18 − 1,
b−a b−a b−a
4α 3α 2α α
x −a x −a x −a x −a
F T4α (x) = 128 − 256 + 160 − 32 + 1. (3.26)
b−a b−a b−a b−a
The readers can use the following Python code to generate any order of Chebyshev
Polynomials of Fractional Order symbolically:
Program Code
import sympy
eta = sympy.Symbol(r’\eta’)
alpha = sympy.Symbol(r’\alpha’)
a = sympy.Symbol("a")
b = sympy.Symbol("b")
x=sympy.sympify(2*((x-a)/(b-a))**alpha -1)
def FTn(x, n):

if n == 0:
return 1
elif n == 1:
return x
elif n >= 2:
return (2 * x*(FTn(x, n - 1))) - FTn(x, n - 2)
For example, the fifth order can be generated as
Program Code
sympy.expand(sympy.simplify(FTn(x, 5)))
x−a α
> 512( b−a ) − 1280( b−a
x−a 5α
) + 1120( b−a
x−a 4α
) − 400( b−a
x−a 3α
) + 50( b−a
x−a 2α
) −1
1
As Chebyshev polynomials are orthogonal respecting the weight function √1−x 2
over [−1, 1], using x = 2( b−a

x−a α
) − 1 one can conclude that fractional Chebyshev
polynomials also are orthogonal over a finite interval respecting the following weight
function:
1
w(x) = . (3.27)
1 − (2( b−a ) − 1)α
x−a α
Therefore, we can define the orthogonality relation of the fractional Chebyshev poly-
nomial as
1 b
π
Tn (x )Tm (x )w(x )d x = F Tnα (x)F Tmα (x)w(x)d x = cn δmn , (3.28)
−1 a 2
where c0 = 2, cn = 1 for n 1, and δmn is the Kronecker delta function.

The fractional Chebyshev function F Tnα (x) has exactly n distinct zeros over finite
interval (0, η) which are
Fig. 3.3 Fractional Chebyshev functions of the first kind up to sixth order where a = 0, b = 5, and
α = 0.5
Fig. 3.4 Fractional Chebyshev functions of the first kind of order 5, for different α values
α1
1 − cos( (2k−1)π )
xk = 2n
, f or k = 1, 2, . . . , n. (3.29)
2
Moreover, the derivative of the fractional Chebyshev function has exactly n − 1

of distinct zeros over the finite interval (0, η), which are the Exterma:
α1
1 − cos( kπ )
xk = n
, f or k = 1, 2, . . . , n − 1. (3.30)
2
Figure 3.3 shows the fractional Chebyshev functions of the first kind up to sixth
order where a = 0, b = 5, and α is 0.5.
Also, Fig. 3.4 depicts the fractional Chebyshev functions of the first kind of order
5 for different values of α while η is fixed at 5.
3.3 Chebyshev Kernel Functions
Following the Chebyshev polynomial principles and their fractional type discussed
in the previous sections, this section presents the formulation of Chebyshev kernels.
Therefore, in the following, first, the ordinary Chebyshev kernel function, then some
other versions of this kernel function, and finally the fractional Chebyshev kernels
will be explained.
3.3.1 Ordinary Chebyshev Kernel Function
Many kernels constructed from orthogonal polynomials have been proposed for
SVM and other kernel-based learning algorithms (Ye et al. 2006; Moghaddam and
Hamidzadeh 2016; Tian and Wang 2017). Some characteristics of such kernels have
attracted attention such as lower data redundancy in feature space or on some occa-
sions these kernels need fewer support vectors during the fitting procedure which
thereby leads to less execution time (Jung and Kim 2013). On the other hand, orthog-
onal kernels have shown superior performance in classification problems than tradi-
tional kernels like RBF and polynomial (Moghaddam and Hamidzadeh 2016; Ozer
et al. 2011; Padierna et al. 2018).
Now, using the fundamental definition of the orthogonal kernel we will formulate
the Chebyshev kernel. As we know, the unweighted orthogonal polynomial kernel
function for SVM, for scalar inputs, is defined as

n
K (x, z) = Ti (x)Ti (z), (3.31)
i=0
where T (.) denotes the evaluation of the polynomial; x, z are kernel’s input argu-
ments; and n is the highest polynomial order. In most applications of the real world,
input data is a multidimensional vector, hence two approaches have been proposed to
extend one-dimensional polynomial kernel to multidimensional vector input (Ozer
et al. 2011; Vert et al. 2007):
1. Pairwise
According to Vapnik’s theorem (Vapnik 2013):
Let a multidimensional set of functions be defined by the basis functions that are the
tensor products of the coordinate-wise basis functions. Then the kernel that defines the
inner product in the n-dimensional basis is the product of one-dimensional kernels.
The multidimensional orthogonal polynomial kernel is constructed as a tensor

product of one-dimensional kernel as

m
K (x, z) = K j (x j , z j ). (3.32)
j=1
In this approach, the function evaluation of element pair of each input vectors x
and z is multiplied in pairs, which means multiplying the corresponding elements
of x and z and then overall kernel output is the multiplication of all outcomes of
the previous step. Therefore, for an m-dimensional input vector x ∈ Rm , given
x = {x1 , x2 , ..., xm }, z ∈ Rm , and z = {z 1 , z 2 , . . . , z m }, the unweighted Cheby-
shev kernel is defined as

m
m
n
K Cheb (x, z) = K j (x j , z j ) = Ti (x j )Ti (z j ), (3.33)
j=1 j=1 i=0
where {Ti (.)} are Chebyshev polynomials. Simply by multiplying the weight
function defined in (3.14) which the orthogonality of the Chebyshev polynomial
of first kind is defined with (see (3.15)), one can construct the pairwise orthogonal
Chebyshev kernel function as
n
Ti (x)Ti (z)
K Cheb (x, z) = √
i=0
, (3.34)
1 − xz
where x and z are scalar-valued inputs and for vector input we have
n

m
Ti (x j )Ti (z j )
K Cheb (x, z) =
i=0
, (3.35)
j=1
1 − xjzj
where m is a dimension of vectors x and z.

However, this method was first proposed and then many kernels have been con-
structed using a pairwise method (Ye et al. 2006; Vert et al. 2007; Zhou et al. 2007;
Pan et al. 2021), but kernels of this method suffer from two drawbacks. First, the
time complexity or better to say computational complexity is considerably high,
thus this can be a serious problem when input vectors are of high dimension (Ozer
et al. 2011). Second, those kernels constructed on pairwise method lead to poor
generalization if one of the kernels (Ti (x) or Ti (z)) is close to zero, at the time
when two vectors x and z are quite similar to each other (Ozer and Chen 2008).
2. Vectorized
In this method proposed by Ozer et al. (2008), the generalization problem of
pairwise is tackled by applying the input vectors as a whole rather than per element
and this is done by means of evaluation of the inner product of input vectors. Based
on the fact that the inner product of two vectors x and z is defined as x, z =
x z T and considering Eq. 3.31, one can construct the unweighted Generalized
Chebyshev kernel as Ozer and Chen (2008); Ozer et al. (2011)

n
n
K (x, z) = Ti (x)TiT (z) = Ti (x)Ti (z) . (3.36)
i=0 i=0
In a similar fashion to Eq. 3.35, the Generalized Chebyshev Kernel is defined as

Ozer and Chen (2008); Ozer et al. (2011)
n
Ti (x), Ti (z)
K G_Cheb (x, z) = i=0
√ , (3.37)
m − x, z
where m is the dimensional of input vectors x and z.

Note that as the Chebyshev polynomials are orthogonal within [−1, 1], the input data
should be normalized according to
2(x − Min)
x new = − 1, (3.38)
Max − Min
where Min and Max are minimum and maximum values of the input data, respectively.
It is clear that if the input data is a dataset, normalization should be done column-wise.
3.3.1.1 Validation of Chebyshev Kernel
According to Mercer Theorem introduced at 2.2.12, a valid kernel function needs

to be positive semi-definite or equivalently must satisfy the necessary and sufficient
conditions of Mercer’s theorem. Here the validity of the generalized Chebyshev
kernel will be proved.
Theorem 3.3 (Ozer et al. (2011)) The generalized Chebyshev kernel introduced at
Eq. 3.37 is a valid Mercer kernel.
Proof As Mercer’s theorem states, any SVM kernel as a valid kernel must be non-
negative, in a precise way:

K (x, z) f (x) f (z)d xdz ≥ 0. (3.39)
The multiplication of two valid kernels is also a valid kernel. Therefore, one can
express any order of the Chebyshev kernel as a product of two kernel functions:
K (x, z) = k(1) (x, z)k(2) (x, z), (3.40)
where

n
k(1) (x, z) = T j (x)T jT (z) = T0 (x)T0T (z) + T1 (x)T1T (z) + · · · + Tn (x)TnT (z)
j=0
(3.41)
and
1
k(2) (x, z) = √ . (3.42)
m − x, z
Considering f : Rm −→ R and assuming that each element is independent of the

other elements of the kernel, we can evaluate the Mercer condition for k(1) (x, z) as
follows:

n
k(1) (x, z) f (x) f (z)d xdz = T j (x)T jT (z) f (x) f (z)d xdz,
j=0
n

= T j (x)T jT (z) f (x) f (z)d xdz,
j=0
n
= T j (x) f (x)d x T jT (z) f (z)dz ,
j=0
n

= T j (x) f (x)d x T jT (z) f (z)dz 0.
j=0
(3.43)
Therefore, the kernel k(1) (x, z) is a valid Mercer kernel. To prove that k(2) (x, z) =
√ 1
m− x,z
(in the simplest form where m = 1) is a valid kernel function, we can say
that < x, y > is the linear kernel and the Taylor series of √ 1
1−x.y
is
1 3 5 35
1+ < x, y > + < x, y >2 + < x, y >3 + < x, y >4 + · · · ,
2 8 16 128

n(2i−1)
which for each coefficient i=1
n!2n
0, k(2) (x, z) = √ 1
m− x,z
is also a valid
kernel.
3.3.2 Other Chebyshev Kernel Functions
3.3.2.1 Chebyshev-Wavelet Kernel
A valid kernel should satisfy the Mercer conditions (Vapnik 2013; Schölkopf et al.
2002), and based on Mercer’s theorem, a kernel made from multiplication or summa-
tion of two valid Mercer kernels is also a valid kernel. This idea motivated Jafarzadeh
et al. (2013) to construct Chebyshev-Wavelet kernel which is in fact the multiplication
of generalized Chebyshev and Wavelet kernels as follows:
n
i=0 Ti (x), Ti (z)
k G_Cheb (x, z) = √ ,
m − x, z
m
xj − zj x−z 2
kwavelet (x, z) = cos 1.75 ex p − . (3.44)
i=1
α 2α 2
Thus, the Chebyshev-Wavelet kernel is introduced as Jafarzadeh et al. (2013)

n
m
xj − zj
i=0 Ti (x), Ti (z) x−z 2
kCheb−wavelet (x, z) = √ × cos 1.75 ex p − ,
m − x, z α 2α 2
i=1
(3.45)
where n is the Chebyshev kernel parameter (the order), a is the wavelet kernel
parameter, and m is the dimension of the input vector.
3.3.2.2 Unified Chebyshev Kernel (UCK)
As it is already mentioned, there are four kinds of Chebyshev polynomials. Zhao

et al. (2013) proposed a new kernel by combining Chebyshev polynomials of the
first (Tn ) and second (Un ) kind of Chebyshev polynomials. It should be noted that
the Chebyshev polynomials of the second kind are orthogonal on the interval (–1, 1)
with respect to the following weight function (Mason and Handscomb 2002):

wUn (x) = 1 − x 2, (3.46)
while the trigonometric relation of the Chebyshev polynomials of the first kind
is introduced at 3.3, polynomials of the second kind are described by Mason and
Handscomb (2002)
sin((n + 1)θ )
Un (cosθ ) = , n = 0, 1, 2, . . . . (3.47)
sin(θ )
Similar tothe generating

∞ function of Chebyshev polynomials of the first kind
1−t x
1−2t x+t 2 = n=0 Tn (x)t n
, the generating function for the second kind is defined
as Mason and Handscomb (2002)
∞
1
= Un (x)t n , |x| 1, |t| 1. (3.48)
1 − 2xt + t 2
n=0
According to defined weight function Eq. 3.46, Zhao et al. (2013) defined the
orthogonality relation for the Chebyshev polynomials of the second kind as
π
1
, (m = n = 0)
1 − x 2 Um (x)Un (x)d x = 2 . (3.49)
−1 0, ((m = n)or (m = n = 0))
Moreover, they used Pn (.) to denote unified Chebyshev polynomials (UPC), but
in this book, notation Pn (.) is used for Legendre Orthogonal Polynomials. Therefore,
in order to avoid any ambiguity, we consciously use U C Pn (.) instead. Considering
the generating function of the first and the second kind, Zhao et al. (2013) constructed
the generating function of UCP:
∞
1 − xt 1
× = U C Pn (x)t n , (3.50)
1 − axt 1 − 2xt + t 2 n=0
where x ∈ [−1, 1], t ∈ [−1, 1], n = 0, 1, 2, 3 . . ., and U C Pn (x) is the UCP of the
nth order. It is clear that the Chebyshev polynomials of the first and the second kinds
are special cases of UCP where a = 0 and a = 1, respectively. Also, Zhao et al.
(2013) introduced the recurrence relation of the nth order of UCP as
U C Pn (x) = (a + 2)xU C Pn−1 (x) − (2ax 2 + 1)U C Pn−2 (x) + axU C Pn−3 (x).
(3.51)
Therefore using (3.51) some instances of these polynomials are
U C P0 (x) = 1,
U C P1 (x) = ax + x,
U C P2 (x) = (a 2 + a + 2)x 2 − 1,
U C P3 (x) = (a 3 + a 2 + 2a + 4)x 3 − (a + 3)x,
U C P4 (x) = (a 4 + a 3 + 2a 2 + 4a + 8)x 4 − (a 2 + 3a + 8)x 2 + 1.
For sake of simplicity, we do not go deeper into formulations provided in the

corresponding paper. However, the details and proofs are investigated in the relevant
paper (Zhao et al. 2013).
On the other hand, Zhao et al. (2013) constructed the UCP kernel as

n
(m − x, z + r )− 2
1
U C P j (x), U C P jT (z) f (x) f (z)d xdz 0. (3.52)
j=0
Hence: n
j=0 U C P j (x)U C P j (z)
kU C P (X, Z ) = √ , (3.53)
(m − x, z + r )
where m is the dimension of the input vector and r is the minimum positive value to
prevent the equation from being 0.
3.3.3 Fractional Chebyshev Kernel
Using the weight function Eq. 3.27, with the same approach performed in Eqs. 3.35
and 3.37, we can introduce the corresponding fractional kernels as follows:
n

m
F T α (x j )F Tiα (z j )
k FCheb (x, z) = i=0 i α α , (3.54)
x j z j −a
j=1 1 − 2 b−a −1
n
F T α (x), F Tiα (z)
k Gen−FCheb (x, z) = i=0 i α α , (3.55)
−a
m − 2 x,z b−a
− 1
where m is the dimension of vectors x and z.
3.3.3.1 Validation of Fractional Chebyshev Kernel
To prove validation of fractional Chebyshev kernels introduced in Eq. 3.55, a similar

procedure like the proof of the generalized Chebyshev kernel Eq. 3.37 is followed.
Theorem 3.4 The generalized Chebyshev kernel of fractional order introduced in

Proof Considering Mercer theorem states any SVM kernel to be a valid kernel must
be non-negative (see (3.39)) and by use of the fact that multiplication of two valid
kernels is also a kernel (see (3.40)), we have

n
k(1) (x, z) = F T jα (x)F T jα (z),
j=0
1
k(2) (x, z) = w(x) = x−a α α .
1 − 2 b−a − 1
Since f is a function where f : Rm −→ R, we have


n
K (1) (x, z) f (x) f (z)d xdz = (F T jα )(x)(F T jα )T (z) f (x) f (z)d xdz,
j=0
n

= (F T jα )(x)(F T jα )T (z) f (x) f (z)d xdz,
j=0
n

= (F T jα )(x) f (x)d x (F T jα )T (z) f (z)dz ,
j=0
n

= (F T jα )(x) f (x)d x (F T jα )T (z) f (z)dz 0.
j=0
(3.56)
Therefore, the kernel k(1) (x, z) is a valid Mercer kernel. In order to prove that
k(2) (x, z) is a valid Mercer kernel too, one can show that k(2) (x, z) is positive semi-
definite. According to definition Eq. 3.24, both x and b are positive because 0
x b, b ∈ R+ and α ∈ R+ ; in other words, by mapping defined in Eq. 3.24, we are
sure the output is always positive. Hence, the second kernel Eq. 3.56 or precisely the
weight function is positive semi-definite, so k(2) (x, z) 0.
3.4 Application of Chebyshev Kernel Functions on Real

Datasets
In this section, we will illustrate the use of fractional Chebyshev kernel in SVM
applied to some real datasets which are widely used by machine learning experts to
examine the accuracy of any given model. Then, we will compare the accuracy of
the fractional Chebyshev kernel with the accuracy of the Chebyshev kernel, and two
of other well-known kernels used in kernel-based learning algorithms, such as RBF
and polynomial kernel. As we know, applying SVM to a dataset involves a number of
pre-processing tasks, such as data cleansing and generation of the training/test. We
do not dwell in these steps except for the normalization of the dataset that already
has been mentioned that is mandatory using Chebyshev polynomials of any kind as
the kernel function.
There are some online data stores available for public use, such a widely used
datastore is the UCI Machine Learning Repository2 of the University of California,
Irvine, and also Kaggle.3 In this section, four datasets are selected from UCI, which
are well known and widely used for machine learning practitioners.
The polynomial kernel is widely used in kernel-based models and, in general, is
defined as
K (X 1 , X 2 ) = (a + X 1T X 2 )b . (3.57)
2 https://archive.ics.uci.edu/ml/datasets.php.
3 https://www.kaggle.com/datasets.
Also, RBF kernel a.k.a Gaussian RBF kernel is another popular kernel for SVM,
and is defined as
x−x 2
K (X 1 , X 2 ) = ex p − . (3.58)
2σ 2
3.4.1 Spiral Dataset
The Spiral dataset is a famous classic dataset, but there are many datasets that have
a spiraling distribution of data points such as the one that is considered here; how-
ever, there are simulated ones with similar characteristics and even better controlled
distributional properties. The Spiral dataset is consisted of 1000 data points, equally
clustered into 3 numerical labels with 2 other attributes “Chem 1” and “Chem2” of
float type. It is known that SVM is a binary classification algorithm; therefore, to deal
with multi-class datasets the common method is to split such datasets into multiple
binary class datasets. Two examples of such methods are
• One-vs-All (OVA),
• One-vs-One (OVO).
Using the OVA, Spiral classification task with numerical classes ∈ {1, 2, 3} will be
divided into 3 classifications of the following forms:
• Binary Classification task 1: class 1 versus class {2, 3},
• Binary Classification task 2: class 2 versus class {1, 3},
• Binary Classification task 3: class 3 versus class {1, 2}.
Also, using the OVO method, the classification task is divided into any possible
binary classification; then, for the Spiral dataset, we have
• Binary Classification task 1: class 1 versus class 2,
• Binary Classification task 3: class 2 versus class 3.
However, in our example, final number of split datasets are equally in both methods,
but generally OVO method generates more datasets to classify. Assume a multi-class
dataset with 4 numerical labels {1, 2, 3, 4}; afterward, using the first method binary
classification datasets are
• Binary Classification task 1: class 1 versus class {2, 3, 4},
• Binary Classification task 4: class 4 versus class {1, 2, 3}.
While using the second method, we have the following binary datasets to classify:
Fig. 3.5 Spiral dataset, different fractional order

• Binary Classification task 6: class 3 versus class 4.
Generally, the OVA method is often preferred due to computational and time con-
siderations.
The following figures show how the transformation to fractional order affects
the original dataset’s distribution, and how the Chebyshev kernel chooses decision
boundaries in different orders. Figure 3.5 depicts the transformed Spiral dataset in its
original state and fractional form where 0.1 α 0.9. As fractional order gets lower
and closer to 0.1, transformation moves more data points from the original dataset
to the upright side of the space into the positive quarter of the new space in precise,
where x1 > 0 and x2 > 0. Although it seems that data points are conglomerated
on one corner of the axis on a 2D plot, this is not necessarily what happens in 3D
space when the kernel function is applied. In other words, when a kernel function
such as the Chebyshev kernel function is applied, data points may spread over a new
axis while concentrating on a 2D axis and making it possible to classify those points
considering the higher dimension instead. This kind of transformation combined
with a kernel function creates a new dimension for the original dataset and makes it
possible for the kernel function’s decision boundary to capture and classify the data
points with a better chance of accuracy.
Fig. 3.6 Normal and fractional Chebyshev kernel of order 3 and alpha = 0.3 applied on Spiral
dataset
In order to have a better intuition of how a kernel function takes the dataset to a
higher dimension, see Fig. 3.6, which shows the Chebyshev kernel function of the
first kind (both normal and fractional forms) applied to the Spiral dataset. The figure
on the left depicts the dataset in a higher dimension when the normal Chebyshev
kernel of order 3 is applied, and the one at right demonstrates how the data points are
changed when the fractional Chebyshev kernel of order 3 and alpha 0.3 is applied to
the original dataset. It is clear that the transformation has moved the data points to
the positive side of the axes x, y, and also z.
After the transformation of the dataset and taking the data points to a higher
dimension, it is time to determine the decision boundaries of the classifier.4 The
decision boundary is highly dependent on the kernel function. In the case of orthog-
onal polynomials, the order of these functions has a critical role. For example, the
Chebyshev kernel function with order 3 outcomes a different decision boundary in
comparison to the same kernel function with the order of 6. There is no general rule
to know which order outcomes the most suitable decision boundary. Hence, trying
different decision boundaries gives a useful metric to compare and find the best order
or even the kernel function itself.
In order to have an intuition of how the decision boundaries are different, take
a look at Fig. 3.7 which depicts corresponding Chebyshev classifiers of different
orders {3, 4, 5, 6} on the original Spiral dataset where the binary classification of
1-versus-{2, 3} is chosen. From these plots, one can see that decision boundaries are
getting more twisted and twirled as we approach higher orders. In order 6 (the last
subplot on the right), the decision boundary is slightly more complex in comparison
with order 3 (the first subplot on the right). The complexity of the decision boundary
even gets worse in fractional space. Figure 3.8 demonstrates the decision boundary
of the fractional Chebyshev classifier with different orders of 3, 4, 5, and 6 where
4 Please note that “decision boundary” refers to the 2D space and the “decision surface” refers to
the 3D space.
Fig. 3.7 Chebyshev kernel with orders of 3, 4, 5, and 6 on Spiral dataset
α = 0.3. Again there is a complicated decision boundary as one approaches from

order 3 to order 6.
The purpose is not to specify which one has a better or worse accuracy or final
result, it is all about which one is most suitable considering to dataset’s distribution.
Each decision boundary (or decision surface) classifies the data points in a specific
form. As the decision boundary/surface of a kernel function of each order is fixed
and determined, it is the degree of correlation between that specific shape and the
data points that determine the final accuracy.
Finally, the following tables provide a comparison of the experiments on the
Spiral dataset. All decision boundaries depicted in previous figures are applied on
the dataset and the accuracy score is adopted as the final metric, by the way, three
possible binary classifications have been examined according to the One-versus-All
method. As it is clear from Table 3.1, RBF kernel and the fractional Chebyshev
kernel show better performance for the class 1-versus-{2, 3}. However, for other
binary classifications, Tables 3.2 and 3.3, RBF and polynomials kernels outperform
both normal and fractional kinds of Chebyshev kernels. It is worth mentioning that
the accuracy scores are not definitely the best outcomes of those kernels, because the
accuracy of classification with SVM heavily depends on multiple parameters such
as the regularization parameter C, random state. Here we are not intended to gain
the best outcome rather we consider a comparison over similar conditions.
Fig. 3.8 Fractional Chebyshev kernel with orders of 3, 4, 5, and 6 on Spiral dataset (α = 0.3)
Table 3.1 Comparison of RBF, polynomial, Chebyshev, and fractional Chebyshev kernels on Spiral
dataset. It is clear that the RBF kernel outperforms other kernels
Sigma Power Order Alpha(α) Accuracy
RBF 0.73 – – – 0.97
Polynomial – 8 – – 0.9533
Chebyshev – – 5 – 0.9667
Fractional – – 3 0.3 0.9733
Chebyshev
dataset. It is clear that RBF kernel outperforms other kernels
RBF 0.1 – – – 0.9867
Chebyshev – – 8 – 0.9411
Fractional – – 8 0.5 0.9467
Chebyshev
dataset. It is clear that RBF and polynomial kernels are better than others
RBF 0.73 – – – 0.9856
Chebyshev – – 6 – 0.9622
Fractional – – 6 0.6 0.9578
Chebyshev
3.4.2 Three Monks’ Dataset
The Monk’s problem is a classic one introduced in 1991, according to “Monk’s

Problems—A Performance Comparison of Different Learning algorithms” by Thrun
et al. (1991); Sun et al. (2015):
The Monk’s problems rely on an artificial robot domain in which robots are described by
six different attributes:
x1 : head shape ∈ {r ound, squar e, octagon}

x2 : body shape ∈ {r ound, squar e, octagon}
x3 : is smiling ∈ {yes, no}
x4 : holding ∈ {swor d, balloon, f lag}
x5 : jacket color ∈ {r ed, yellow, gr een, blue}
x6 : has tie ∈ {yes, no}
The learning task is a binary classification one. Each problem is given by a logical description
of a class. Robots belong either to this class or not, but instead of providing a complete
class description to the learning problem, only a subset of all 432 possible robots with its
classification is given. The learning task is to generalize over these examples and if the
particular learning technique at hand allows this to derive a simple class description.
Then the three problems are

• Problem M1 : (head shape = body shape) or (jacket color = red ) From 432
possible examples, 124 are randomly selected for the training set. There are no
misclassifications.
• Problem M2 : Exactly two of the six attributes have their first value. (e.g., body
shape = head shape = round implies that the robot is not smiling, holding no
sword, jacket color is not red, and has no tie, since then exactly two (body shape
and head shape) attributes have their first value). From 432 possible examples,
169 are randomly selected. Again, there is no noise.
• Problem M3 : (jacket color is green and holding a sword) or (jacket color is
not blue and body shape is not octagon) From 432 examples, 122 are selected
randomly, and among them, there are 5% misclassifications, i.e., noise in the
training set.
Table 3.4 Comparison of RBF, polynomial, Chebyshev, and fractional Chebyshev kernels on the
Monk’s first problem. It is clear that the RBF kernel outperforms other kernels
RBF 2.844 – – – 0.8819
Chebyshev – – 3 – 0.8472
Fractional – – 3 1/16 0.8588
Chebyshev
Monk’s second problem. Fractional Chebyshev Kernel outperforms other kernels with highest
accuracy at 96%
RBF 5.5896 – – – 0.875
Chebyshev – – 3 – 0.8426
Fractional – – 6 1/15 0.9653
Chebyshev
Considering the above definitions, we applied RBF, Polynomial, Chebyshev, and

Fractional Chebyshev kernels to all three Monk’s problems. Table 3.4 illustrates
the output accuracy of each model where the RBF kernel has the best accuracy
by σ ≈ 2.84 at 0.8819 and the Chebyshev kernel has the worst among them at
0.8472. Table 3.5 shows the results for Monk’s second problem, where fractional
Chebyshev kernel has the best accuracy at 0.9653 while Chebyshev Kernel has the
worst performance with an accuracy of 0.8426. Finally, Table 3.6 represents the
results for the third Monk’s problem where both RBF and fractional Chebyshev
kernel have the same best accuracy at 0.91.
Monk’s third problem. Fractional Chebyshev Kernel and RBF kernel had same accuracy result at
0.91
RBF 2.1586 – – – 0.91
Chebyshev – – 6 – 0.8958
Fractional – – 5 1/5 0.91
Chebyshev
3.5 Conclusion
This chapter is started with a brief history and basics of Chebyshev orthogonal poly-
nomials. Chebyshev polynomials have been used in many use cases and recently as
the kernel function in kernel-based learning algorithms and are proved to surpass tra-
ditional kernels in many situations. Construction of Chebyshev kernels explained and
proved. It is demonstrated in this chapter that the fractional form of Chebyshev poly-
nomials extends the applicability of the Chebyshev kernel. Thus, by using fractional
Chebyshev kernel functions, a wider set of problems can be tackled. Experiments
demonstrate fractional Chebyshev kernel leverages the accuracy of classification
with SVM when being used as the kernel in the SVM kernel trick. In the last section,
the results of such experiments are depicted.
References
Achirul Nanda, M., Boro Seminar, K., Nandika, D., Maddu, A.: A comparison study of kernel
functions in the support vector machine and its application for termite detection. Information 9,
5–29 (2018)
An-na, W., Yue, Z., Yun-tao, H., Yun-lu, L.I.: A novel construction of SVM compound kernel
function. In: 2010 International Conference on Logistics Systems and Intelligent Management
(ICLSIM), vol. 3, pp. 1462–1465 (2010)
Asghari, M., Hadian Rasanan, A.H., Gorgin, S., Rahmati, D., Parand, K.: FPGA-orthopoly: a
hardware implementation of orthogonal polynomials. Eng. Comput. (2022). https://doi.org/10.
1007/s00366-022-01612-x
Boyd, J.P.: Chebyshev and Fourier Spectral Methods. Courier Corporation (2001)
Boyd, J.P.: Chebyshev and Fourier Spectral Methods. Courier Corporation, Massachusetts (2001)
Capozziello, S., D’Agostino, R., Luongo, O.: Cosmographic analysis with Chebyshev polynomials.
MNRAS 476, 3924–3938 (2018)
Chakraverty, S., Mall, S.: Single layer Chebyshev neural network model with regression-based
weights for solving nonlinear ordinary differential equations. Evol. Intell. 13, 687–694 (2020)
Dabiri, A., Butcher, E.A., Nazari, M.: Coefficient of restitution in fractional viscoelastic compliant
impacts using fractional Chebyshev collocation. J. Sound Vib. 388, 230–244 (2017)
Glau, K., Mahlstedt, M., Pötz, C.: A new approach for American option pricing: the dynamic
Chebyshev method. SIAM J. Sci. Comput. 41, B153–B180 (2019)
Habibli, M., Noori Skandari, M.H.: Fractional Chebyshev pseudospectral method for fractional
optimal control problems. Optim. Control Appl. Methods 40, 558–572 (2019)
Hadian Rasanan, A.H., Rahmati, D., Gorgin, S., Rad, J.A.: MCILS: Monte-Carlo interpolation
least-square algorithm for approximation of edge-reliability polynomial. In: 9th International
Conference on Computer and Knowledge Engineering (ICCKE), pp. 295–299 (2019)
Hadian Rasanan, A.H., Rahmati, D., Gorgin, S., Parand, K.: A single layer fractional orthogonal
neural network for solving various types of Lane-Emden equation. New Astron. 75, 101307
(2020)
Hadian-Rasanan, A.H., Rad, J.A.: Brain activity reconstruction by finding a source parameter in an
inverse problem. In: Chakraverty, S. (ed.) Mathematical Methods in Interdisciplinary Sciences,
pp. 343–368. Wiley, Amsterdam (2020)
Hajimohammadi, Z., Baharifard, F., Ghodsi, A., Parand, K.: Fractional Chebyshev deep neural
network (FCDNN) for solving differential models. Chaos, Solitons Fractals 153, 111530 (2021)
Hassani, H., Machado, J.T., Naraghirad, E.: Generalized shifted Chebyshev polynomials for frac-
tional optimal control problems. Commun. Nonlinear Sci. Numer. Simul. 75, 50–61 (2019)
Hussain, M., Wajid, S.K., Elzaart, A., Berbar, M.: A comparison of SVM kernel functions for
breast cancer detection. In: 2011 Eighth International Conference Computer Graphics, Imaging
and Visualization, pp. 145–150 (2011)
Jafarzadeh, S.Z., Aminian, M., Efati, S.: A set of new kernel function for support vector machines:
an approach based on Chebyshev polynomials. In: ICCKE, pp. 412–416 (2013)
Jung, H.G., Kim, G.: Support vector number reduction: survey and experimental evaluations. IEEE
Trans. Intell. Transp. Syst. 15, 463–476 (2013)
Kazem, S., Shaban, M., Rad, J.A.: Solution of the coupled Burgers equation based on operational
matrices of d-dimensional orthogonal functions. Zeitschrift für Naturforschung A 67, 267–274
(2012)
Kazem, S., Abbasbandy, S., Kumar, S.: Fractional-order Legendre functions for solving fractional-
order differential equations. Appl. Math. Model. 37, 5498–5510 (2013)
Kazem, S., Shaban, M., Rad, J.A.: A new Tau homotopy analysis method for MHD squeezing flow
of second-grade fluid between two parallel disks. Appl. Comput. Math. 16, 114–132 (2017)
Kheyrinataj, F., Nazemi, A.: Fractional Chebyshev functional link neural network-optimization
method for solving delay fractional optimal control problems with Atangana-Baleanu derivative.
Optim. Control Appl. Methods 41, 808–832 (2020)
Mall, S., Chakraverty, S.: Numerical solution of nonlinear singular initial value problems of Emden-
Fowler type using Chebyshev Neural Network method. Neurocomputing 149, 975–982 (2015)
Mall, S., Chakraverty, S.: Single layer Chebyshev neural network model for solving elliptic partial
differential equations. Neural Process. Lett. 45, 825–840 (2017)
Mall, S., Chakraverty, S.: A novel Chebyshev neural network approach for solving singular arbitrary
order Lane-Emden equation arising in astrophysics. NETWORK-COMP NEURAL 31, 142–165
(2020)
Mason, J.C., Handscomb, D.C.: Chebyshev Polynomials. Chapman and Hall/CRC (2002)
Mason, J.C., Handscomb, D.C.: Chebyshev Polynomials. CRC Press, Florida (2002)
Mesgarani, H., Beiranvand, A., Aghdam, Y.E.: The impact of the Chebyshev collocation method
on solutions of the time-fractional Black-Scholes. Math. Sci. 15, 137–143 (2021)
Moghaddam, V.H., Hamidzadeh, J.: New Hermite orthogonal polynomial kernel and combined
kernels in support vector machine classifier. Pattern Recognit. 60, 921–935 (2016)
Musavi, M.T., Ahmed, W., Chan, K.H., Faris, K.B., Hummels, D.M.: On the training of radial basis
function classifiers. Neural Netw. 5, 595–603 (1992)
Omidi, M., Arab, B., Hadian Rasanan, A.H., Rad, J.A., Parand, K.: Learning nonlinear dynamics
with behavior ordinary/partial/system of the differential equations: looking through the lens of
orthogonal neural networks. Eng. Comput. 1–20 (2021)
Ozer, S., Chen, C.H.: Generalized Chebyshev kernels for support vector classification. In: 19th
International Conference on Pattern Recognition, pp. 1–4 (2008)
Ozer, S., Chen, C.H., Cirpan, H.A.: A set of new Chebyshev kernel functions for support vector
machine pattern classification. Pattern Recognit. 44, 1435–1447 (2011)
Padierna, L.C., Carpio, M., Rojas-Domínguez, A., Puga, H., Fraire, H.: A novel formulation of
orthogonal polynomial kernel functions for SVM classifiers: the Gegenbauer family. Pattern
Recognit. 84, 211–225 (2018)
Pan, Z.B., Chen, H., You, X.H.: Support vector machine with orthogonal Legendre kernel. In: 2012
International Conference on Wavelet Analysis and Pattern Recognition, pp. 125–130 (2012)
Parand, K., Delkhosh, M.: Solving Volterra’s population growth model of arbitrary order using the
generalized fractional order of the Chebyshev functions. Ric. Mat. 65, 307–328 (2016)
Parand, K., Delkhosh, M.: Accurate solution of the Thomas-Fermi equation using the fractional
order of rational Chebyshev functions. J. Comput. Appl. Math. 317, 624–642 (2017)
Parand, K., Moayeri, M.M., Latifi, S., Delkhosh, M.: A numerical investigation of the boundary
layer flow of an Eyring-Powell fluid over a stretching sheet via rational Chebyshev functions.
Eur. Phys. J. 132, 1–11 (2017)
Parand, K., Moayeri, M.M., Latifi, S., Rad, J.A.: Numerical study of a multidimensional dynamic
quantum model arising in cognitive psychology especially in decision making. Eur. Phys. J. Plus
134, 109 (2019)
Pavlović, V.D., Dončov, N.S., Ćirić, D.G.: 1D and 2D economical FIR filters generated by Cheby-
shev polynomials of the first kind. Int. J. Electron. 100, 1592–1619 (2013)
Reddy, S.V.G., Reddy, K.T., Kumari, V.V., Varma, K.V.: An SVM based approach to breast cancer
classification using RBF and polynomial kernel functions with varying arguments. IJCSIT 5,
5901–5904 (2014)
Scholkopf, B., Sung, K.K., Burges, C.J., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing
support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans.
Signal Process. 45, 2758–2765 (1997)
Schölkopf, B., Smola, A.J., Bach, F.: Learning with Kernels: Support Vector Machines, Regular-
ization, Optimization, and Beyond. MIT press, Cambridge (2002)
Sedaghat, S., Ordokhani, Y., Dehghan, M.: Numerical solution of the delay differential equations
of pantograph type via Chebyshev polynomials. Commun. Nonlinear Sci. Numer. Simul. 17,
4815–4830 (2012)
Shaban, M., Kazem, S., Rad, J.A.: A modification of the homotopy analysis method based on
Chebyshev operational matrices. Math. Comput. Model. 57, 1227–1239 (2013)
Shen, J., Tang, T., Wang, L. L.: Spectral methods: algorithms, analysis and applications, vol. 41.
Springer Science & Business Media, Berlin (2011)
Shuman, D.I., Vandergheynst, P., Kressner, D., Frossard, P.: Distributed signal processing via Cheby-
shev polynomial approximation. IEEE Trans. Signal Inf. Process. Netw. 4, 736–751 (2018)
Sun, L., Toh, K.A., Lin, Z.: A center sliding Bayesian binary classifier adopting orthogonal poly-
nomials. Pattern Recognit. 48, 2013–2028 (2015)
Thrun, S.B., Bala, J.W., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K.A., Dzeroski,
S., Fisher, D.H., Fahlman, S.E. Hamann, R.: The monk’s problems: a performance comparison
of different learning algorithms (1991)
Tian, M., Wang, W.: Some sets of orthogonal polynomial kernel functions. Appl. Soft Comput. 61,
742–756 (2017)
Vert, J.P., Qiu, J., Noble, W.S.: A new pairwise kernel for biological network inference with support
vector machines. BMC Bioinform. BioMed Cent. 8, 1–10 (2007)
Yaman, S., Pelecanos, J.: Using polynomial kernel support vector machines for speaker verification.
IEEE Signal Process. Lett. 20, 901–904 (2013)
Ye, N., Sun, R., Liu, Y., Cao, L.: Support vector machine with orthogonal Chebyshev kernel. In:
18th International Conference on Pattern Recognition (ICPR’06), vol. 2, pp. 752–755 (2006)
Zhao, J., Yan, G., Feng, B., Mao, W., Bai, J.: An adaptive support vector regression based on a new
sequence of unified orthogonal polynomials. Pattern Recognit. 46, 899–913 (2013)
Zhao, F., Huang, Q., Xie, J., Li, Y., Ma, L., Wang, J.: Chebyshev polynomials approach for numeri-
cally solving system of two-dimensional fractional PDEs and convergence analysis. Appl. Math.
Comput. 313, 321–330 (2017)
Zhou, F., Fang, Z., Xu, J.: Constructing support vector machine kernels from orthogonal polynomials
for face and speaker verification. In: Fourth International Conference on Image and Graphics
(ICIG), pp. 627–632 (2007)
Chapter 4
Fractional Legendre Kernel Functions:
Amirreza Azmoon, Snehashish Chakraverty, and Sunil Kumar
Abstract The support vector machine algorithm has been able to show great flexibil-
ity in solving many machine learning problems due to the use of different functions
as a kernel. Linear, radial basis functions, and polynomial functions are the most
common functions used in this algorithm. Legendre polynomials are among the
most widely used orthogonal polynomials that have achieved excellent results in the
support vector machine algorithm. In this chapter, some basic features of Legendre
and fractional Legendre functions are introduced and reviewed, and then the kernels
of these functions are introduced and validated. Finally, the performance of these
functions in solving two problems (two sample datasets) is measured.
Keywords Legendre polynomial · Fractional legendre functions · Kernel trick ·

Orthogonal functions · Mercer’s theorem
4.1 Introduction
Another family of classic complete and orthogonal polynomials is the family of

Legendre Polynomials, which was discovered by French mathematician Adrien-
Marie Legendre (1752–1833) in 1782.
A. Azmoon (B)
Department of Computer Science, The Institute for Advance Studies in Basic Sciences (IASBS),
Zanjan, Iran
S. Chakraverty
Department of Mathematics, National Institute of Technology Rourkela, Sundargarh, OR, India
S. Kumar
Department of Mathematics, National Institute of Technology, Jamshedpur 831014, JH, India
70 A. Azmoon et al.
Adrien-Marie Legendre (1752–1833) is a French mathematician who made

numerous contributions to mathematics. Legendre got his education from
Collège Mazarin of Paris and defended his thesis at the age of 18. In 1782,
he won a prize from the Berlin Academy for an essay titled “Research on
the Trajectories of Projectiles in a Resistant Medium”, which brought signif-
icant attention to this young mathematician. During his career, many papers
have been published on elliptic functions, number theory, and the method
of least squares. In a publication titled “Research on the Shape of Planets”,
he introduced Legendre polynomials. In honor of Legendre, a crater on the
moon is named after him.a
aFor more information about A.M. Legendre and his contribution, please see: https://
mathshistory.st-andrews.ac.uk/Biographies/Legendre/.
Like other functions in the family of orthogonal functions, Legendre polyno-

mials are used in various fields of science and industry, such as topography filter-
ing Haitjema (2021), solving differential equations in the crystallization process in
chemistry Chang and Wang (1984), expansions of hypergeometric functions Holde-
man and Jonas (1970); Sánchez-Ruiz and Dehesa (1988), calculating scattering of
charged particles in physics Spencer (1953), Kaghashvili et al. (2004), financial mar-
ket prediction models Dash (2020), Dash and Dash (2017), and modeling epileptic
seizure Moayeri et al. (2020). Other important and common applications of Legendre
functions in the industry include the analysis of time-varying linear control systems
HWANG et al. (1985); model reduction in control systems Chang and Wang (1983);
estimation algorithms in PMU systems Qian et al. (2014) in the field of power; wave
motion in propagation Gao et al. (2021), Dahmen et al. (2016), Zheng et al. (2019);
and reflection Gao et al. (2019). These examples mentioned are some of the fields in
which orthogonal functions, especially Legendre functions, have been extensively
used both to improve the scientific level of projects and to improve industrial effi-
ciency. One of the most common and attractive applications of Legendre functions is
to solve differential equations with different orders with the help of Legendre wavelet
functions Mohammadi and Hosseini (2011), Chang and Isah (2016) and shifted Leg-
endre functions Chang and Wang (1983), Zeghdane (2021). In particular, Mall and
Chakraverty (2016) designed neural networks using Legendre functions to solve
ordinary differential equations. This method, based on a single-layer Legendre neu-
ral network model, has been developed to solve initial and boundary value problems.
In the proposed approach, a Legendre-polynomial-based functional link artificial
neural network is developed. In fact, the hidden layer is eliminated by expanding the
input pattern using Legendre polynomials. On the other hand, due to the diversity of
different kernels, support vector machine algorithms Pan et al. (2012) are one of the
most frequent and varied machine learning methods. In a recent study on this subject
Afifi and Zanaty (2019), the usage of orthogonal functions as kernels has expanded
substantially, so that these functions are now utilized as fractional, wavelet, or shift-
ing order kernels in addition to regular orthogonal functions Afifi and Zanaty (2019),
4 Fractional Legendre Kernel Functions: Theory and Application 71
Table 4.1 Some applications for different kinds of Legendre polynomials

Legendre polynomials Legendre polynomials are used frequently in
different fields of science. For example, they
are used as an orthogonal kernel in support
vector machines Marianela and Gómez (2014),
or as basis functions for solving various
differential equations such as fractional
equations Saadatmandi and Dehghan (2010),
integral equations Bhrawy et al. (2016),
optimal control problems Ezz-Eldien et al.
(2017), and so on Kazem et al. (2012).
Moreover, it has been used in long short-term
memory neural networks Voelker et al. (2019)
Fractional Legendre functions Based on fractional-order Legendre
polynomials, Benouini et al. proposed a new
set of moments that can be used to extract
pattern features Pan et al. (2012), Benouini
et al. (2019). These functions have many other
applications, for instance, solving fractional
differential equations Kazem et al. (2013),
Bhrawy et al. (2016)
Rational Legendre functions They are used to solve differential equations in
semi-infinite intervals Parand and Razzaghi
(2004), Parand et al. (2009, 2010)
Marianela and Gómez (2014), Benouini et al. (2019). In Table 4.1, some applications
of different kinds of Legendre polynomials are mentioned.
The following is how this chapter is organized: Section 4.2 presents the fun-
damental definitions and characteristics of orthogonal Legendre polynomials and
fractional Legendre functions. In Sect. 4.3, the ordinary Legendre kernel is provided,
and the innovative fractional Legendre kernel is introduced. In Sect. 4.4, the results
of experiments on both the ordinary Legendre kernel and the fractional covered and
give a comparison of the accuracy results of the mentioned kernels used as a kernel
in the SVM algorithm with the normal polynomial and Gaussian kernels and ordi-
nary Chebyshev and fractional Chebyshev on well-known datasets to demonstrate
the validity and efficiency of kernels. Finally, in Sect. 4.5, the concluding remarks of
this chapter are presented.
4.2 Preliminaries
This section presents the definition and basic properties of Legendre orthogonal
polynomials. Addition to the basics of these polynomials, the fractional form of this
family is discussed.
72 A. Azmoon et al.
4.2.1 Properties of Legendre Polynomials
Legendre polynomials of degree n, Pn (x), are solutions to the following

Sturm-Liouville differential equation Shen (1994), Kazem et al. (2013), Hadian
Rasanan et al. (2020), Rad et al. (2014), Asghari et al. (2022):
d2 y dy
(1 − x 2 ) − 2x + n(n + 1)y = 0, (4.1)
d2x dx
where n is a positive integer. This family is also orthogonal over the interval [−1, 1]
with respect to the weight function w(x) = 1 Olver et al. (2010), such that
1
2
Pn (x)Pm (x) d x = δnm , (4.2)
−1 2n + 1
in which δnm is the Kronecker delta and is defined as follows:

1, n = m,
δmn = (4.3)
0, Other wise.
Besides the Sturm-Liouville differential equation, there is a generating function for

these polynomials that can generate each order of the Legendre polynomials. By
setting |t| < 1 and |x| ≤ 1, the generating function of the Legendre polynomials is
obtained as follows Shen (1994), Kazem et al. (2013), Hadian Rasanan et al. (2020),
Rad et al. (2014), Asghari et al. (2022):
∞

1
√ = t n Pn (x). (4.4)
1 − 2t x + t 2 n=0
Thus, the coefficient of t n is a polynomial in x of degree n, which shows the n-th

Legendre polynomial. Expanding this generating function up to t 1 yields
P0 (x) = 1, P1 (x) = x. (4.5)
Similar to the Chebyshev polynomials, the Legendre polynomials can be obtained by

utilizing a recursive formula which is given as follows Doman (2015), Shen (1994),
Kazem et al. (2013), Hadian Rasanan et al. (2020), Rad et al. (2014), Asghari et al.
(2022):
P0 (x) = 1,
P1 (x) = x,
(n + 1)Pn+1 (x) − (2n + 1)x Pn (x) + n Pn−1 (x) = 0, n ≥ 1. (4.6)
Fig. 4.1 The first six orders of Legendre polynomials
So, the first few Legendre polynomials in explicit form are
P0 (x) = 1,
P1 (x) = x,
P2 (x) = (3x 2 − 1)/2,
P3 (x) = (5x 3 − 3x)/2,
P4 (x) = (35x 4 − 30x 2 + 3)/8,
P5 (x) = (63x 5 − 70x 3 + 15x)/8,
P6 (x) = (231x 6 − 315x 4 + 105x 2 − 5)/16, (4.7)
which are depicted in Fig. 4.1.

In addition to the recursive relation in Eq. 4.6, using the generator function Eq.
4.4, the Legendre polynomials are defined in the following two ways as recursive
relations:
n Pn (x) = x Pn (x) − Pn−1

(x),

Pn+1 (x) = x Pn (x) + (n + 1)Pn (x). (4.8)
74 A. Azmoon et al.
Also, the following recursive relations, by combining the above relations and their
derivatives with each other, create other forms of recursive relations of Legendre
polynomials Lamb Jr (2011):

(2n + 1)Pn (x) = Pn+1 (x) − Pn−1 (x),
(1 − x 2 )Pn (x) = n[Pn−1 (x) − x Pn (x)]. (4.9)
The reader can use the following Python code to generate any order of Legendre
polynomials symbolically:
Program Code
import sympy
def Ln(x, n):

if n == 0:
return 1
elif n == 1:
return x
elif n >= 2:
return ((2 * n - 1) * (x * Ln(x,n-1)) - (n-1)*Ln(x,n-2))/n
sympy.expand(sympy.simplify(Ln(x,3)))
3
> 5x2 − 3x2 .
Similar to other classical orthogonal polynomials, Legendre polynomials follow
symmetry. Therefore, Legendre polynomials of even order have even symmetry and
contain only even powers of x and similarly odd orders of Legendre polynomials
have odd symmetry and contain only odd powers of x:

Pn (−x), n is even,
Pn (x) = (−1) Pn (x) =
n
(4.10)
−Pn (−x), n is odd.

Pn (−x), n is even,
Pn (x) = (−1)n Pn (x) =
(4.11)
−Pn (−x), n is odd.
For Pn (x), n = 0, 1, ..., n, there exist exactly n zeros in [−1, 1] that are real, distinct
from each other Hadian Rasanan et al. (2020). If they are regarded as dividing the
interval [−1, 1] into n + 1 sub-intervals, each sub-interval will contain exactly one
zero of Pn+1 Hadian Rasanan et al. (2020). Also, Pn (x) has n − 1 local minimum
and maximum in (−1, 1) Hadian Rasanan et al. (2020). On the other hand, these
polynomials have the following basic properties:
Pn (1) = 1, (4.12)

1, n = 2m
Pn (−1) = , (4.13)
−1, n = 2m + 1

(−1)m 2m (−1)m (2m)!
= 22m (m!)2
, n = 2m
Pn (0) = 4m m . (4.14)
0, n = 2m + 1
Above all, to shift the Legendre polynomials from [−1, 1] to [a, b], we should
use the following transformation:
2t − a − b
x= , (4.15)
b−a
where x ∈ [−1, 1] and t ∈ [a, b]. It is worth mentioning that all the properties of
the Legendre polynomials remain unchanged for shifted Legendre polynomials with
this difference that the position of −1 transfers to a and the position of 1 transfers
to b.
4.2.2 Properties of Fractional Legendre Functions
Legendre polynomials of fractional order of α (called F Pnα (x)) over the finite interval
[a, b] by use of mapping x = 2( b−a
x−a α
) − 1, (α > 0) that x ∈ [−1, 1], are defined
as Kazem et al. (2013)
x −a α
F Pnα (x) = Pn (x ) = Pn (2( ) − 1). (4.16)
b−a
Also, the generating function for fractional Legendre is defined in the same way as
the generating function for Legendre, with the difference that in this function x is
x−a α
defined as (x = 2( b−a ) − 1) in the interval [a, b] Shen (1994), Kazem et al. (2013),
Hadian Rasanan et al. (2020), Rad et al. (2014), Asghari et al. (2022):
∞
∞

1 x −a α
= F Pnα (x)t n = Pn (2( ) − 1)t n . (4.17)
x−a α
1 − 2t ( b−a ) − 1 + t2 n=0 n=0
b−a
On the other hand, the fractional Legendre function with weight function w(x) =
x α−1 , like the Legendre polynomial, has the property of orthogonality Hadian
Rasanan et al. (2020), this property, for example, over the interval [0, 1] is easily
defined as follows:
76 A. Azmoon et al.
1
1
F Pnα (x)F Pmα (x)d x = δnm , (4.18)
0 (2n + 1)α
where δnm is the Kronecker delta that is defined in Eq. 4.3.

It can also be easily shown that for fractional Legendre polynomials, the recursive
relation will be given as follows Kazem et al. (2013), Hadian Rasanan et al. (2020):
F P0α (x) = 1,
x −a α
F P1α (x) = 2( ) − 1,
b−a
α 2n + 1 x −a α n
F Pn+1 (x) = ( )(2( ) − 1)F Pnα (x ) − ( α
)F Pn−1 (x ).
n+1 b−a n+1
(4.19)
With this in mind, the first few fractional Legendre polynomials order in the
explicit form are
F P0α (x) = 1,
x −a α
F P1α (x) = 2( ) − 1,
b−a
x − a 2α x −a α
F P2α (x) = 6( ) − 6( ) + 1,
b−a b−a
x − a 3α x − a 2α x −a α
F P3α (x) = 20( ) − 30( ) + 12( ) − 1,
b−a b−a b−a
x − a 4α x − a 3α x − a 2α
F P4α (x) = 70( ) − 140( ) + 90( )
b−a b−a b−a
x −a α
−20( ) + 1,
b−a
x − a 5α x − a 4α x − a 3α
F P5α (x) = 252( ) − 630( ) + 560( )
b−a b−a b−a
x − a 2α x −a α
−210( ) + 30( ) − 1,
b−a b−a
x − a 6α x − a 5α x − a 4α
F P6α (x) = 924( ) − 2772( ) + 3150( )
b−a b−a b−a
x − a 3α x − a 2α x −a α
−1680( ) + 420( ) − 42( ) + 1. (4.20)
b−a b−a b−a
Interested readers can use the following Python code to generate any order of Leg-
endre polynomials of fractional order symbolically
Fig. 4.2 The first six order of fractional Legendre function over the finite interval [0, 5] where
α = 0.5
Program Code
import sympy
x=sympy.sympify(2*((x-a)/(b-a))**alpha -1)
def FLn(x, n):

if n == 0:
return 1
elif n == 1:
return x
elif n >= 2:
return ((2 * n - 1) * (x * Ln(x,n-1)) - (n-1)*Ln(x,n-2))/n
sympy.simplify(FLn(x,3))
x−a α
> 20( b−a ) − 30( b−a
x−a 3α
) + 12( b−a
x−a 2α
) −1
Figure 4.2 shows the fractional Legendre functions of the first kind up to the sixth
order where a = 0, b = 5, and α is 0.5.
78 A. Azmoon et al.
Fig. 4.3 Fifth order of fractional Legendre function over the finite interval [0, 5] with different α
Also, Fig. 4.3 depicts the fractional Legendre functions of the first kind of order
5 for different values of α.
4.3 Legendre Kernel Functions
In this section, the ordinary Legendre polynomial kernel formula is presented first.
Then, some other versions of this kernel function will be introduced, and at the end,
the fractional Legendre polynomial kernel will be introduced in this section.
4.3.1 Ordinary Legendre Kernel Function
Legendre polynomials, like other polynomials, are orthogonal except that Legendre
polynomial weight function is a constant function of value 1 (w(x) = 1). So accord-
ing to Eqs. 4.2 and 4.3 for different m and n, Legendre polynomials are orthogonal
to each other with weight function 1. This enables us to construct the Legendre ker-
nel without denominators. Therefore, for scalar inputs x and z, Legendre kernel is
defined as follows:

n
K (x, z) = Pi (x)Pi (z) =< φ(x), φ(z) >, (4.21)
i=0
where <, > is an inner product and the unique parameter n is the highest order of the
Legendre polynomials, and the nonlinear mapping determined by Legendre kernel
is
φ(x) = (P0 (x), P1 (x), ..., Pn (x)) ∈ Rn+1 . (4.22)
Now, with this formula, it can be seen that the φ(x) values are orthogonal to each
other, indicating that the Legendre kernel is also orthogonal Shawe-Taylor and Cris-
tianini (2004).
Theorem 4.1 (Pan et al. (2012)) Legendre kernel is a valid Mercer kernel.
Proof According to the Mercer theorem introduced at (Sect. 2.2.1), a valid kernel
function that needs to be positive semi-definite, or equivalently should satisfy the
necessary and sufficient conditions of Mercer’s theorem. As Mercer theorem states
any SVM kernel to be a valid kernel should be non-negative, in a precise way:

K (x, z)w(x, z) f (x) f (z)d xdz ≥ 0, (4.23)
where

n
K (x, z) = Pi (x)PiT (z), (4.24)
i=0
and w(x, z) = 1, and f (x) is a function where : f : Rm → R. Assuming each ele-

ment is independent of the others, the Mercer condition K (x, z) can be evaluated as
follows:

n
K (x, z)w(x, z) f (x) f (z)d xdz = P j (x)P jT (z) f (x) f (z)d xdz,
j=0
n

= P j (x)P jT (z) f (x) f (z)d xdz,
j=0
n
= P j (x) f (x)d x P jT (z) f (z)dz ,
j=0
n

= P j (x) f (x)d x P jT (z) f (z)dz ,
j=0
0. (4.25)
The Legendre kernels up to the third order can be expressed as:

80 A. Azmoon et al.
K (x, z) = 1 + x z,
9(x z)2 − 3x 2 − 3z 2 + 1
K (x, z) = 1 + x z + ,
4
9(x z)2 − 3x 2 − 3z 2 + 1 25(x z)3 − 15x 3 z − 15x z 3 + 9x z
K (x, z) = 1 + x z + + .
4 4
(4.26)
This kernel can be expanded and specified as the following for vector inputs x, z ∈
Rd :
d d
n
K (x, z) = K j (x j , z j ) = Pi (x j )Pi (z j ), (4.27)
j=1 j=1 i=0
where x j is j-th element of the vector x.

Given that the multiplication of two valid kernels is still a valid kernel, this kernel
is also valid. The same as the Chebyshev kernel function for vector input, each feature
of the input vector for the Legendre kernel function lies in [−1, 1]. So the input data
has been normalized to [−1, 1] via the following formula:
2(xiold − Min i )
xinew = − 1, (4.28)
Maxi − Min i
where xi is the i-th feature of the vector x, Maxi and Min i are the minimum and
maximum values along the i-th dimensions of all the training and test data, respec-
tively.
4.3.2 Other Legendre Kernel Functions
Apart from the ordinary form of the Legendre kernel function, other kernels with
unique properties and special weight functions are defined on Legendre polynomials,
some of which are introduced in this section.
4.3.2.1 Generalized Legendre Kernel
Ozer et al. (2011) applied kernel functions onto vector inputs directly instead of
applying them to each input element. In fact, the generalized Legendre kernel by
generalized Legendre polynomial was proposed as follows Tian and Wang (2017):

n
K G−Legendr e (x, z) = Pi (x)PiT (z), (4.29)
i=0
where
P0 (x) = 1,
P1 (x) = x,
2n − 1 T n−1
Pn (x) = x Pn−1 (x) − Pn−2 (x). (4.30)
n n
Regarding Chebyshev generalized kernels, it should be noted that these kernels have
more complex expressions than generalized Legendre kernels due to their more com-
plex weight function, so generalized Chebyshev kernels can depict more abundant
nonlinear information.
4.3.2.2 Exponentially Legendre Kernel
The exponentially modified orthogonal polynomial kernels are actually the product
of the well-known Gaussian kernel and the corresponding generalized orthogonal
polynomial kernels (without weighting function). Since an exponential function (the
Gaussian kernel) can capture local information along the decision surface better
than the square root function, Ozer et al. (2011) replaced the weighting function
with Gaussian kernel ( exp γ 1x−z 2 ), and defined the exponentially modified Legendre
kernels as Tian and Wang (2017)
n
Pi (x)PiT (z)
K ex p−Legendr e (x, z) = i=0
. (4.31)
exp γ x − z 2
It should be noted that the modified generalized orthogonal polynomial kernels can
be seen as semi-local kernels. Also, having two parameters, n and γ > 0, makes the
optimization of these two kernels more difficult to exploit than that of the generalized
orthogonal polynomial kernels Tian and Wang (2017).
4.3.2.3 Triangularly Modified Legendre Kernel
The Triangular kernel, which is basically an affine function of the Euclidean distance
(d(i, j)) between the points in the original space, is expressed as Fleuret and Sahbi
(2003)
x−z
K (x, z) = (1 − )+ , (4.32)
λ
where the ()+ forces this mapping to be positive and ensures this expression to be
a kernel. Therefore, the triangularly modified Legendre kernels can be written as
follows Tian and Wang (2017), Belanche Muñoz (2013); Fleuret and Sahbi (2003):
82 A. Azmoon et al.
x −z n
K T ri−Legendr e (x, z) = (1 − )+ Pi (x)PiT (z), (4.33)
λ i=0
N
where λ = max{d(xi , x̄)|x̄ = N1 i=1 xi , xi ∈ X }, X is a finite sample set, and N is
the number of samples. Thus, all the data live in a ball of radius λ. Since the parameter
λ only depends on the input data, the triangularly modified orthogonal polynomial
kernels have a unique parameter chosen from a small set of integers.
4.3.3 Fractional Legendre Kernel
Given that the fractional weight function is ( f w(x, z) = (x z T )α−1 ) and similar to
the Legendre kernel, the corresponding fractional forms are introduced as
m
n
K F Legendr e (X, Z ) = F Piα (x x j )F Piα (x z j ) f w(x x j , x z j ), (4.34)
j=1 i=0
where m is the dimension of vectors X and Z . According to the procedure we have

in this book, after introducing a kernel, we should guarantee its validity. So we can
continue with the following theorem.
Theorem 4.2 The fractional Legendre kernel is a valid Mercer kernel.
Proof According to Mercer’s theorem introduced at (Sect. 2.2.1), a valid kernel

should satisfy the sufficient conditions of Mercer’s theorem. As we know, Mer-
cer’s theorem states any SVM kernel to be a valid kernel must be non-negative, in a
precise way:
K (x, z)w(x, z) f (x) f (z)d xdz ≥ 0, (4.35)
where

n
K (x, z) = F Piα (x x )F Piα (x z )T f w(x x j , x z j ), (4.36)
i=0
and f w(x, z) = (x z T )α−1 , where f (x) is function as f : Rm → R. Thus, we have

K (x, z)w(x, z) f (x) f (z)d xdz

n
= F Piα (x x )F Piα (x z )T f w(x x j , x z j ) f (x) f (z)d xdz,
i=0
n

= F Piα (x x )F Piα (x z )T (x x x zT )α−1 f (x) f (z)d xdz,
i=0
n

= F Piα (x x )(x x )α−1 f (x)d x F Piα (x z )T (x zT )α−1 f (z)dz ,
i=0
n

= F Piα (x x )(x x )α−1 f (x)d x F Piα (x z )T (x zT )α−1 f (z)dz ,
i=0
0. (4.37)
4.4 Application of Legendre Kernel Functions on Real

Datasets
In this section, the application of the ordinary Legendre kernel and the fractional
Legendre kernel is shown in SVM. Also, the obtained results on two real datasets
are compared with the results of RBF kernels, ordinary polynomial kernel, ordinary
Chebyshev kernel, and fractional Chebyshev kernel.
The Spiral dataset has already been introduced in the previous chapter. In this section,
the Legendre kernel and the fractional Legendre kernel are used to classify spiral
datasets using SVM. As we mentioned before, this multi-class classification dataset
can be split into three binary classification datasets. It was also previously explained
(Fig. 3.5), by transferring data to the fractional space using the α coefficient (0.1 ≤
α ≤ 0.9), the data density gradually decreases with decreasing alpha value, and the
central axis of data density from the point (0, 0) is transferred to point (1, 1). Although
it seems that data points are conglomerated on one corner of the axis on a 2D plot,
Fig. 4.4, this is not necessarily what happens in 3D space when the kernel function
is applied, Fig. 4.5.
Using the data transferred to the fractional space, the decision boundaries are found
for the problem with the above-mentioned divisions with the help of the Legendre
and fractional Legendre kernel functions.
In order to have an intuition of how the decision boundaries are different, we can
see Fig. 4.6, which depicts corresponding Legendre classifiers of different orders of
84 A. Azmoon et al.
Fig. 4.4 Spiral dataset, in

normal and fractional spaces
Fig. 4.5 Normal and

fractional Legendre kernels
of order 3, and α = 0.4
applied on Spiral dataset in
3D space
3, 4, 5 and 6 on the original Spiral dataset, where the binary classification of 1-vs-2, 3
is chosen. From these figures, it can be seen that the boundaries of decision-making
are becoming more twisted.
The complexity of the decision boundary even gets worse in fractional space.
Figure 4.7 demonstrates the decision boundary of the fractional Legendre classifier
with different orders of 3, 4, 5, and 6, where α = 0.4. Again there is a complicated
decision boundary as one approach from order 3 to order 6.1
Each decision boundary or decision surface in 3D space classifies the data points
in a specific form. As the decision boundary or surface of a kernel function for each
order is fixed and determined, it is the degree of correlation between that specific
shape and the data points that determine the final accuracy.
The experiment results are summarized in the following tables. In particular, Table
4.2 summarizes the results of class 1-vs-{2, 3}. As it can be seen, the fractional
Legendre kernel outperforms other kernels, and the fractional Chebyshev kernel
has the second best accuracy. Also, the classification accuracy of the mentioned
kernels on class 2-vs-{1, 3} is summarized in Table 4.3, where the RBF kernel has
the best accuracy score. Finally, Table 4.4 is the classification accuracy scores on
class 3-vs-{1, 2}, in which the fractional Legendre kernel has the best performance.
1Based on many studies, it was concluded that the fractional Legendre kernel with order 3 is not a
suitable option for use on the spiral dataset.
Fig. 4.6 Legendre kernel of orders 3, 4, 5, and 6 on Spiral dataset
As another case example, the three Monks’ problem is considered here (see Chap. 3
for more information about the dataset). The Legendre kernel (i.e., Eq. 4.21) and the
fractional Legendre kernel (i.e., Eq. 4.34) are applied to the datasets from the three
Monks’ problem. Table 4.5 illustrates the output accuracy of each model on the first
problem of Monk’s every model, where the RBF kernel has the best accuracy by
σ ≈ 2.844 at 1 and the Chebyshev kernel has the worst among them at 0.8472.
Table 4.6 also shows the output accuracy of each model on the second problem
of Monk’s each model, with fractional Legendre kernel having the best accuracy by
α ≈ 0.8 at 1 and Legendre kernel having the worst accuracy of 0.8032.
Finally, Table 4.7 illustrates the output accuracy of each model on the third prob-
lem of Monk’s each model, where fractional Chebyshev kernel by α ≈ 16 1
and RBF
kernel has the best accuracies at 0.91 and fractional Legendre kernel has the worst
among them at 0.8379.
86 A. Azmoon et al.
Fig. 4.7 Fractional Legendre kernel functions of orders 3, 4, 5, and 6 on Spiral dataset with α = 0.4
Table 4.2 Comparison of RBF, polynomial, Chebyshev, fractional Chebyshev, Legendre, and frac-
tional Legendre accuracy scores on 1-vs-{2,3} Spiral dataset. It is clear that the Fractional Legendre
kernel outperforms other kernels
Sigma Power Order Alpha(α) Lambda(λ) Accuracy
RBF 0.73 – – – – 0.97
Polynomial – 8 – – – 0.9533
Chebyshev – – 5 - – 0.9667
Fractional – – 3 0.3 – 0.9733
Chebyshev
Legendre – – 7 – – 0.9706
Fractional – – 7 0.4 – 0.9986
Legendre
tional Legendre accuracy scores on 2-vs-{1,3} Spiral dataset. It is clear that the RBF kernel outper-
forms other kernels
RBF 0.1 – – – – 0.9867
Polynomial – 5 – – – 0.9044
Chebyshev – – 6 – – 0.9289
Fractional – – 6 0.8 - 0.9344
Chebyshev
Legendre – – 8 – – 0.9773
Fractional – – 8 0.4 – 0.9853
Legendre
tional Legendre accuracy scores on 3-vs-{1,2} Spiral dataset. It is clear that the Fractional Legendre
and polynomials kernels are better than others
RBF 0.73 – – – – 0.98556
Polynomial – 5 – – – 0.98556
Chebyshev – – 6 – – 0.9622
Fractional – – 6 0.6 – 0.9578
Chebyshev
Legendre – – 7 – – 0.9066
Fractional – – 5 0.4 – 0.9906
Legendre
tional kernels on Monks’ first problem
RBF 2.844 – – – – 0.8819
Polynomial – 3 – – – 0.8681
Chebyshev – – 3 – – 0.8472
Fractional – – 3 1/16 – 0.8588
Chebyshev
Legendre – – 4 – – 0.8333
Fractional – – 4 0.1 – 0.8518
Legendre
88 A. Azmoon et al.
tional kernels on Monks’ second problem
RBF 5.5896 – – – – 0.875
Polynomial – 3 – – – 0.8657
Chebyshev – – 3 – – 0.8426
Fractional – – 3 1/16 – 0.9653
Chebyshev
Legendre – – 3 – – 0.8032
Fractional – – 3 0.1 – 1
Legendre
tional kernels on Monks’ third problem.
RBF 2.1586 – – – – 0.91
Polynomial – 3 – – – 0.875
Chebyshev – - 6 – – 0.895
Fractional – – 5 1/5 – 0.91
Chebyshev
Legendre – – 4 – – 0.8472
Fractional – – 3 0.8 – 0.8379
Legendre
4.5 Conclusion
In this chapter, the Legendre polynomial kernel in fractional order for support vector
machines is introduced and presented by using the ordinary Legendre polynomial
kernel. The Legendre polynomial kernel can extract good properties from data due
to the orthogonality of elements in the feature vector, thus reducing data redundancy.
Also, based on the results of the experiment on two datasets in the previous section, it
can be shown that SVM with Legendre and fractional Legendre kernels can separate
nonlinear data well.
References
Afifi, A., Zanaty, EA.: Generalized legendre polynomials for support vector machines (SVMS)
classification. Int. J. Netw. Secur. Appl. (IJNSA) 11, 87–104 (2019)
1007/s00366-022-01612-x
Belanche Muñoz, L. A.: Developments in kernel design. In ESANN 2013 Proceedings: European
Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning,
pp. 369–378 (2013)
Benouini, R., Batioua, I., Zenkouar, Kh., Mrabti, F.: New set of generalized Legendre moment
invariants for pattern recognition. Pattern Recognit. Lett. 123, 39–46 (2019)
Bhrawy, A.H., Abdelkawy, M.A., Machado, J.T., Amin, A.Z.M.: Legendre-Gauss-Lobatto collo-
cation method for solving multi-dimensional Fredholm integral equations. Comput. Math. Appl.
4, 1–13 (2016)
Bhrawy, A.H., Doha, E.H., Ezz-Eldien, S.S., Abdelkawy, M.A.: A numerical technique based on
the shifted Legendre polynomials for solving the time-fractional coupled KdV equations. Calcolo
53, 1–17 (2016)
Chang, P., Isah, A.: Legendre Wavelet Operational Matrix of fractional Derivative through wavelet-
polynomial transformation and its Applications in Solving Fractional Order Brusselator system.
J. Phys.: Conf. Ser. 693 (2016)
Chang, R.Y., Wang, M.L.: Model reduction and control system design by shifted Legendre poly-
nomial functions. J. Dyn. Syst. Meas. Control 105, 52–55 (1983)
Chang, R.Y., Wang, M.L.: Optimal control of linear distributed parameter systems by shifted Leg-
endre polynomial functions. J. Dyn. Syst. Meas. Control 105, 222–226 (1983)
Chang, R.Y., Wang, M.L.: Shifted Legendre function approximation of differential equations; appli-
cation to crystallization processes. Comput. Chem. Eng. 8, 117–125 (1984)
Dahmen, S., Morched, B.A., Mohamed Hédi, B.G.: Investigation of the coupled Lamb waves prop-
agation in viscoelastic and anisotropic multilayer composites by Legendre polynomial method.
Compos. Struct. 153, 557–568 (2016)
Dash, R., Dash, P. K.: MDHS-LPNN: a hybrid FOREX predictor model using a Legendre polynomial
neural network with a modified differential harmony search technique. Handbook of Neural
Computation, pp. 459–486. Academic Press (2017)
Dash, R.: Performance analysis of an evolutionary recurrent Legendre Polynomial Neural Network
in application to FOREX prediction. J. King Saud Univ.—Comput. Inf. Sci. 32, 1000–1011 (2020)
Doman, B.G.S.: The Classical Orthogonal Polynomials. World Scientific, Singapore (2015)
Ezz-Eldien, S.S., Doha, E.H., Baleanu, D., Bhrawy, A.H.: A numerical approach based on Legendre
orthonormal polynomials for numerical solutions of fractional optimal control problems. J VIB
Control 23, 16–30 (2017)
Fleuret, F., Sahbi, H.: Scale-invariance of support vector machines based on the triangular kernel.
In: 3rd International Workshop on Statistical and Computational Theories of Vision (2003)
Fleuret, F., Sahbi, H.: Scale-invariance of support vector machines based on the triangular kernel.
In: 3rd International Workshop on Statistical and Computational Theories of Vision, pp. 1–13
(2003)
Gao, J., Lyu, Y., Zheng, M., Liu, M., Liu, H., Wu, B., He, C.: Application of Legendre orthogonal
polynomial method in calculating reflection and transmission coefficients of multilayer plates.
Wave Motion 84, 32–45 (2019)
Gao, J., Lyu, Y., Zheng, M., Liu, M., Liu, H., Wu, B., He, C.: Application of state vector formalism
and Legendre polynomial hybrid method in the longitudinal guided wave propagation analysis
of composite multi-layered pipes. Wave Motion 100, 102670 (2021)
(2020)
90 A. Azmoon et al.
Hadian Rasanan, A.H., Bajalan, N., Parand, K., Rad, J.A.: Simulation of nonlinear fractional dynam-
ics arising in the modeling of cognitive decision making using a new fractional neural network.
Math. Methods Appl. Sci. 43, 1437–1466 (2020)
Haitjema, H.: Surface profile and topography filtering by Legendre polynomials. Surf. Topogr. 9,
15–17 (2021)
HWANG, C., Muh-Yang, C.: Analysis and optimal control of time-varying linear systems via shifted
Legendre polynomials. Int. J. Control 41, 1317–1330 (1985)
Kaghashvili, E.K., Zank, G.P., Lu, J.Y., Dröge, W. : Transport of energetic charged particles. Part
2. Small-angle scattering. J. Plasma Phys. 70, 505–532 (2004)
Kazem, S., Shaban, M., Rad, J.A.: Solution of the coupled Burgers equation based on operational
matrices of d-dimensional orthogonal functions. Zeitschrift für Naturforschung A 67, 267–274
(2012)
Kazem, S., Abbasbandy, S., Kumar, S.: Fractional-order Legendre functions for solving fractional-
order differential equations. Appl. Math. Model. 37, 5498–5510 (2013)
Lamb, G.L., Jr.: Introductory Applications of Partial Differential Equations: with Emphasis on Wave
Propagation and Diffusion. Wiley, Amsterdam (2011)
Holdeman, J.H., Jr., Jonas, T., Legendre polynomial expansions of hypergeometric functions with
applications: J Math Phys 11, 114–117 (1970)
Mall, S., Chakraverty, S.: Application of Legendre neural network for solving ordinary differential
equations. Appl. Soft Comput. 43, 347–356 (2016)
Marianela, P., Gómez, J.C.: Legendre polynomials based feature extraction for online signature
verification. Consistency analysis of feature combinations. Pattern Recognit. 47, 128–140 (2014)
Moayeri, M.M., Rad, J.A., Parand, K.: Dynamical behavior of reaction-diffusion neural networks
and their synchronization arising in modeling epileptic seizure: A numerical simulation study.
Comput. Math. with Appl. 80, 1887–1927 (2020)
Mohammadi, F., Hosseini, M.M.: A new Legendre wavelet operational matrix of derivative and its
applications in solving the singular ordinary differential equations. J. Franklin Inst. 348, 1787–
1796 (2011)
N Parand, K., Delafkar, Z., Rad, J. A., Kazem S.: Numerical study on wall temperature and surface
heat flux natural convection equations arising in porous media by rational Legendre pseudo-
spectral approach. Int. J. Nonlinear Sci 9, 1–12 (2010)
Olver, F.W.J., Lozier, D.W., Boisvert, R.F., Clark, C.W.: NIST Handbook of Mathematical Functions
Hardback and CD-ROM. Cambridge University Press, Singapore (2010)
Ozer, S., Chi, H., Chen, Hakan, A., Cirpan.: A set of new Chebyshev kernel functions for support
vector machine pattern classification. Pattern Recognit. 44, 1435–1447 (2011)
Pan, Z.B., Chen, H., You, X.H.: Support vector machine with orthogonal Legendre kernel. In:
International Conference on Wavelet Analysis and Pattern Recognition, pp. 125–130. IEEE (2012)
Parand, K., Razzaghi, M.: Rational Legendre approximation for solving some physical problems
on semi-infinite intervals. Phys. Scr. 69, 353 (2004)
Parand, K., Shahini, M., Dehghan, M.: Rational Legendre pseudospectral approach for solving
nonlinear differential equations of Lane-Emden type. J. Comput. Phys. 228, 8830–8840 (2009)
Qian, C.B., Tianshu, L., Jinsong, L. H. Liu, Z.: Synchrophasor estimation algorithm using Legendre
polynomials. IEEE PES General Meeting Conference and Exposition (2014)
Rad, J.A., Kazem, S., Shaban, M., Parand, K., Yildirim, A.H.M.E.T.: Numerical solution of frac-
tional differential equations with a Tau method based on Legendre and Bernstein polynomials.
Saadatmandi, A., Dehghan, M.: A new operational matrix for solving fractional-order differential
equations. Comput. Math. Appl. 59, 1326–1336 (2010)
Sánchez-Ruiz, J., Dehesa, J.S.: Expansions in series of orthogonal hypergeometric polynomials. J.
Comput. Appl. Math. 89, 155–170 (1998)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press,
Cambridge (2004)
Shen, J.: Efficient spectral-Galerkin method I. Direct solvers of second-and fourth-order equations
using Legendre polynomials. SISC 15, 1489–1505 (1994)
Spencer, L.V.: Calculation of peaked angular distributions from Legendre polynomial expansions
and an application to the multiple scattering of charged particles. Phys. Rev. 90, 146–150 (1953)
742–756 (2017)
Voelker, A., Kajić, I., Eliasmith, C.: Legendre memory units: Continuous-time representation in
recurrent neural networks. Adv. Neural Inf. Process. Syst. 32 (2019)
Zeghdane, R.: Numerical approach for solving nonlinear stochastic Itô-Volterra integral equations
using shifted Legendre polynomials. Int. J. Dyn. Syst. Diff. Eqs. 11, 69–88 (2021)
Zheng, M., He, C., Lyu, Y., Wu, B.: Guided waves propagation in anisotropic hollow cylinders by
Legendre polynomial solution based on state-vector formalism. Compos. Struct. 207, 645–657
(2019)
Chapter 5
Fractional Gegenbauer Kernel
Functions: Theory and Application
Sherwin Nedaei Janbesaraei, Amirreza Azmoon, and Dumitru Baleanu
Abstract Because of the usage of many functions as a kernel, the support vec-
tor machine method has demonstrated remarkable versatility in tackling numerous
machine learning issues. Gegenbauer polynomials, like the Chebyshev and Legender
polynomials which are introduced in previous chapters, are among the most com-
monly utilized orthogonal polynomials that have produced outstanding results in the
support vector machine method. In this chapter, some essential properties of Gegen-
bauer and fractional Gegenbauer functions are presented and reviewed, followed
by the kernels of these functions, which are introduced and validated. Finally, the
performance of these functions in addressing two issues (two example datasets) is
evaluated.
Keywords Gegenbauer polynomial · Fractional Gegenbauer functions · Kernel

trick · Orthogonal functions · Mercer’s theorem
5.1 Introduction
Gegenbauer polynomials or better known as ultra-spherical polynomials are another

family of classic orthogonal polynomials, named after Austrian mathematician
Leopold Bernhard Gegenbauer (1849–1903). As the name itself suggests, ultra-
spherical polynomials provide a natural extension to spherical harmonics in higher
S. Nedaei Janbesaraei (B)

A. Azmoon
Zanjan, Iran
D. Baleanu
Department of Mathematics, Cankaya University, Ankara 06530, Turkey
94 S. Nedaei Janbesaraei et al.
dimensions (Avery 2012). Spherical harmonics are special functions defined on the
surface of a sphere. They are often being used in applied mathematics for solving dif-
ferential equations (Doha 1998; Elliott 1960). In recent years, the classical orthogonal
polynomials, especially the Gegenbauer orthogonal polynomials, have been used to
address pattern recognition (Liao et al. 2002; Liao and Chen 2013; Herrera-Acosta
et al. 2020), classification (Hjouji et al. 2021; Eassa et al. 2022), and kernel-based
learning in many fields such as physic (Ludlow and Everitt 1995), medical image
processing (Arfaoui et al. 2020; Stier et al. 2021; Öztürk et al. 2020), electronic
(Tamandani and Alijani 2022), and other basic fields (Soufivand et al. 2021; Park
2009; Feng and Varshney 2021). In 2006, Pawlak (2006) showed the benefits of the
image reconstruction process based on Gegenbauer polynomials and corresponding
moments. Then in the years that followed, Hosny (2011) used the high compu-
tational requirements of Gegenbauer moments to propose novel methods for image
analysis and recognition, which in these studies, fractional-order shifted Gegenbauer
moments (Hosny et al. 2020) and Gegenbauer moment Invariants (Hosny 2014) are
seen. Also in 2014, Abd Elaziz et al. (2019) used artificial bee colony based on
orthogonal Gegenbauer moments to be able to classify galaxies images with the help
of support vector machines. In 2009, with the help of the orthogonality property of
the Gegenbauer polynomial, Langley and Zhao (2009) introduced a new 3D phase
unwrapping algorithm for the analysis of magnetic resonance imaging (MRI). Also,
in 1995, Ludlow studied the application of Gegenbauer polynomials to the emission
of light from spheres (Ludlow and Everitt 1995). Clifford polynomials are partic-
ularly well suited as kernel functions for a higher dimensional continuous wavelet
transform. In 2004, Brackx et al. (2004) constructed Clifford–Gegenbauer and gener-
alized Clifford–Gegenbauer polynomials as new specific wavelet kernel functions for
a higher dimensional continuous wavelet transform. Then, Ilić and Pavlović (2011)
in 2011, with the use Christoffel–Darboux formula for Gegenbauer orthogonal poly-
nomials, present the filter function solution that exhibits optimum amplitude as well
as optimum group delay characteristics.
Flexibility in using different kernels in the SVM algorithm is one of the rea-
sons why orthogonal classical polynomials have recently been used as kernels. The
Gegenbauer polynomial is one of those that have been able to show acceptable results
in this area. In 2018, Padierna et al. (2018) introduced a formulation of orthogonal
polynomial kernel functions for SVM classification. Also in 2020, he (Padierna et al.
2020) used this kernel to classify peripheral arterial disease in patients with type 2
diabetes. In addition to using orthogonal polynomials as kernels in SVM to solve
classification problems, these polynomials can be used as kernels in support vector
regression (SVR) to solve regression problems as well as help examine time series
problems. In 1992, Azari et al. (1992) proposed a corresponding ultra-spherical ker-
nel. Also, in 2019, Feng suggests using extended support vector regression (X-SVR),
a new machine-learning-based metamodel, for the reliability study of dynamic sys-
tems using first-passage theory. Furthermore, the power of X-SVR is strengthened
by a novel kernel function built from the vectorized Gegenbauer polynomial, specif-
ically for handling complicated engineering problems (Feng et al. 2019). On the
other hand, in 2001, Ferrara and Guégan (2001) dealt with the k-factor extension
5 Fractional Gegenbauer Kernel Functions: Theory and Application 95
of the long memory Gegenbauer process, in which this model investigated the pre-
dictive ability of the k-factor Gegenbauer model on real data of urban transport
traffic in the Paris area. Other applications of orthogonal Gegenbauer polynomi-
als include their effectiveness in building and developing neural networks. In 2018,
Zhang et al. (2018) investigated and constructed a two-input Gegenbauer orthogonal
neural network (TIGONN) using probability theory, polynomial interpolation, and
approximation theory to avoid the inherent problems of the back-propagation (BP)
training algorithm. Then, In 2019, a novel type of neural network based on Gegen-
bauer orthogonal polynomials, termed as GNN, was constructed and investigated
(He et al. 2019). This model could overcome the computational robustness problems
of extreme learning machines (ELM), while still having comparable structural sim-
plicity and approximation capability (He et al. 2019). In addition to the applications
mentioned here, Table 5.1 can be seen for some applications for different kinds of
these polynomials.
Leopold Bernhard Gegenbauer (1849–1903) was an Austrian mathemati-

cian. However, he entered the University of Vienna to study history and lin-
guistics, but he graduated as a mathematics and physics teacher. His profes-
sional career in mathematics specifically began with a grant awarded to him
and made it possible for him to undertake research at the University of Berlin
for 2 years, where he could attend lectures of great mathematicians of that
time such as Karl Weierstrass, Eduard Kummer, Hermann Helmholtz, and
Leopold Kronecker. Leopold Gegenbauer had many interests in mathematics
such as number theory and function theory. He introduced a class of orthogo-
nal polynomials, which has known as Gegenbauer Orthogonal Polynomials,
in his doctoral thesis of 1875. However, the name of Gegenbauer appears in
many mathematical concepts such as Gegenbauer transforms, Gegenbauer’s
integral inequalities, Gegenbauer approximation, Fourier–Gegenbauer sums,
the Gegenbauer oscillator, and many more, but another important contribu-
tion of Gegenbauer was designing a course on “Insurance Theory” at the
University of Vienna, where he was appointed as a full professor of mathe-
matics from 1893 till his death at 1903.a
aFor more information about L.B. Gegenbauer and his contribution, please see: https://
mathshistory.st-andrews.ac.uk/Biographies/Gegenbauer/.
In this chapter, a new type of Gegenbauer orthogonal polynomials is introduced,

called fractional Gegenbauer functions, and the corresponding kernels are con-
structed, which is excellently applicable in kernel-based learning methods like sup-
port vector machines. In Sect. 5.2, the basic definition and properties of Gegenbauer
polynomials and their fractional form are introduced. Then, in Sect. 5.3, in addition
to explaining the previously proposed Gegenbauer kernel functions, the fractional
Gegenbauer function is proposed. Also, in Sect. 5.4, the results of SVM classifica-
tion with Gegenbauer and fractional Gegenbauer kernel on well-known dataset are
Table 5.1 Some applications for different kinds of Gegenbauer polynomials

Gegenbauer polynomials Srivastava et al. (2019) suggested a potentially
helpful novel approach for solving the
Bagley–Torvik problem based on the
Gegenbauer wavelet expansion and operational
matrices of fractional integral and block-pulse
functions.
Rational Gegenbauer functions Based on rational Gegenbauer functions,
Parand has solved numerically the third-order
nonlinear differential equation (Parand et al.
2013) and a sequence of linear ordinary
differential equations (ODE) that was
converted with the quasi-linearization method
(QLM) (Parand et al. 2018)
Generalized Gegenbauer functions Belmehdi (2001) created a
differential-difference relation, and this
sequence solves the second-order and
fourth-order differential equation fulfilled by
the association (of arbitrary order) of the
generalized Gegenbauer. Also, Cohl (2013);
Liu and Wang (2020) have demonstrated
applications of these functions in their studies.
Then, in 2020, Yang et al. (2020) proposed a
novel looseness state recognition approach for
bolted structures based on multi-domain
sensitive features derived from quasi-analytic
wavelet packet transform (QAWPT) and
generalized Gegenbauer support vector
machine (GGSVM)
Fractional order of rational Gegenbauer To solve the Thomas–Fermi problem,
functions Hadian-Rasanan et al. (2019) proposed two
numerical methods based on the Newton
iteration method and spectral algorithms. The
spectral technique was used in both approaches
and was based on the fractional order of
rational Gegenbauer functions
(Hadian-Rasanan et al. 2019)
compared with similar results from the previous chapters. Finally, in Sect. 5.5, the
conclusion remarks will summarize the whole chapter.
5.2 Preliminaries
In this section, the basics of Gegenbauer polynomials have been covered. These
polynomials have been defined using the related differential equation and, in the
following, the properties of Gegenbauer polynomials have been introduced, and also
the fractional form of them has been defined besides their properties too.
5.2.1 Properties of Gegenbauer Polynomials
Gegenbauer polynomials of degree n, G λn (x) and order λ > − 21 are solutions to the
following Sturm–Liouville differential equation (Doman 2015; Padierna et al. 2018;
Asghari et al. 2022):
d2 y dy
(1 − x 2 ) − (2λ + 1)x + n(n + 2λ)y = 0, (5.1)
dx2 dx
where n is a positive integer and λ is a real number and greater than −0.5. Gegenbauer
polynomials are orthogonal on the interval [−1, 1] with respect to the weight function
(Padierna et al. 2018; Parand et al. 2018; Asghari et al. 2022):
w(x) = (1 − x 2 )λ− 2 .
1
(5.2)
Therefore, the orthogonality relation is defined as (Ludlow and Everitt 1995; Parand
et al. 2018; Hadian-Rasanan et al. 2019; Asghari et al. 2022)
1
π 21−2λ (n + 2λ)
G λn (x)G λm (x)w(x)d x = δnm , (5.3)
−1 n!(n + λ)((λ))2
where δnm is the Kronecker delta function (El-Kalaawy et al. 2018). The standard
Gegenbauer polynomial G (λ)
n (x) can be defined also as follows (Parand et al. 2018;
Hadian-Rasanan et al. 2019):
2
n
(n + λ − j)
G λn (x) = (−1) j (2x)n−2 j , (5.4)
n=1
j!(n − 2 j)!(λ)
where (.) is the Gamma function.

Assuming λ = 0 and λ ∈ R, the generating function for Gegenbauer polynomial
is given as (Cohl 2013; Doman 2015; Asghari et al. 2022)
1
G λn (x, z) = . (5.5)
(1 − 2x z + z 2 )λ
It can be shown that it is true for |z| < 1, |x| 1, λ > − 21 (Cohl 2013; Doman 2015).
Considering for fixed x the function is holomorphic in |z| < 1, it can be expanded
in Taylor series (Reimer 2013):
∞

G λn (x, z) = G λn (x)z n . (5.6)
n=0
Gegenbauer polynomials can be obtained by the following recursive formula

(Olver et al. 2010; Yang et al. 2020; Hadian-Rasanan et al. 2019):
G λ0 (x) = 1, G λ1 (x) = 2λx,

1
G λn (x) = [2x(n + λ − 1)G λn−1 (x) − (n + 2λ − 2)G λn−2 (x)]. (5.7)
n
Also, other recursive relations for this orthogonal polynomial are as follows
(Doman 2015):
(n + 2)G λn+2 (x) = 2(λ + n + 1)x G λn+1 (x) − (2λ + n)G λn (x), (5.8)
λ+1(x)
nG λn (x) = 2λ{x G λ+1
n−1 (x) − G n−2 }, (5.9)
λ+1
(n + 2λ)G λn (x) = 2λ{G λ+1
n (x) − x G n−1 (x)}, (5.10)
nG λn (x) = (n − 1 + 2λ)x G λn−1 (x) − 2λ(1 −x 2
)G λ−1
n−2 (x), (5.11)
d λ
G = 2λG λ+1
n+1 . (5.12)
dx n
Everyone who has tried to generate Gegenbauer polynomials has experienced
some difficulties as the number of terms rises at each higher order, for example, for
the orders of zero to four, in the following:
G λ0 (x) = 1,
G λ1 (x) = 2λx,
G λ2 (x) = (2λ2 + 2λ)x 2 − λ,

λ 4 3 8
G 3 (x) = λ + 4λ + λ x 3 + (−2λ2 − 2λ)x,
2
3 3

λ 2 22 2
G 4 (x) = λ + 4λ + λ + 4λ x 4 + (−2λ3 − 6λ2 − 4λ)x 2 ,
4 3
3 3
(5.13)
which are depicted in Figs. 5.1, 5.2, and there will be more terms in higher orders
subsequently. To overcome this difficulty Pochhammer polynomials have already
been used to simplify (Doman 2015). The Pochhammer polynomials or the rising
factorial is defined as

n
x (n) = x(x + 1)(x + 2) · · · (x + n − 1) = (x + k − 1). (5.14)
k=1
Consequently, some of Pochhammer polynomials are
x (0) = 1,
x (1) = x,
x (2) = x(x + 1) = x 2 + x,
x (3) = x(x + 1)(x + 2) = x 3 + 3x 2 + 2x,
x (4) = x(x + 1)(x + 2)(x + 3) = x 4 + 6x 3 + 11x 2 + 6x,
x (5) = x 5 + 10x 4 + 35x 3 + 50x 2 + 24x,
x (6) = x 6 + 15x 5 + 85x 4 + 225x 3 + 274x 2 + 120x. (5.15)
By means of Pochhammer polynomials, first few orders of Gegenbauer polynomials

of order λ are
G λ0 (x) = 1,
G λ1 (x) = 2λx,
G λ2 (x) = a (2) 2x 2 − λ,
a (3) 4x 3
G λ3 (x) = − 2a (2) x,
3
(4)
a 2x 4
G λ4 (x) = − a (3) 2x 2 + a (2) ,
3
(5)
a 4x 5 a (4) 4x 3
G λ5 (x) = − + a (3) x,
15 3
(6)
a 4x 6 a (5) 2x 4 a (4) x 2
G λ6 (x) = − + − a (3) , (5.16)
45 3 2
where a (n) are Pochhammer polynomials.

Here for the sake of convenience, a Python code has been introduced that can
generate any order of Gegenbauer polynomials of order λ symbolically, using sympy
library:
Program Code
import sympy
lambd = sympy.Symbol (r’\lambda’)
def Gn(x, n):

if n == 0:
return 1
elif n == 1:
return 2*lambd*x
elif n >= 2:
return (sympy.Rational(1,n)*(2*x*(n+lambd-1)*Gn(x,n-1)\
-(n+2*lambd-2)*Gn(x,n-2)))
Program Code
sympy.expand(sympy.simplify(Gn(x,2)))
> 2x 2 λ2 + 2x 2 λ − λ
For G λn (x), n = 0, 1, . . . , n, λ > − 21 , which is orthogonal with respect to weight

function (1 − x 2 )λ− 2 , there exist exactly n zeros in [−1, 1] which are denoted by
1
xnk (λ), k = 1, · · · , n and are enumerated in decreasing order 1 > xn1 (λ) > xn2 >
· · · > xnn (λ) > −1 (Reimer 2013). The zeros of G λn (x) and G λm (x), m > n separate
each other and between any two zeros of G λn (x), there is at least one zero (Olver et al.
2010).
Gegenbauer polynomials follow the symmetry same as other classical orthogonal
polynomials. Therefore, Gegenbauer polynomials of even order have even symmetry
and contain only even powers of x and similarly odd orders of Gegenbauer polyno-
mials have odd symmetry and contain only odd powers of x (Olver et al. 2010).

λ n λ G λn (−x), n even
G n (x) = (−1) G n (x) = , (5.17)
−G λn (−x). n odd
Also, some special values of G λn (x) can be written as follows:
(2λ)n
G λn (1) = , (5.18)
n!
(−1)n (λ)n
G λ2n (0) = , (5.19)
n!
2(−1) (λ)n + 1
n
G λ2n+1 (0) = . (5.20)
n!
5.2.2 Properties of Fractional Gegenbauer Polynomials
Gegenbauer polynomial of the fractional order of λ over a finite interval [a, b] by use
of mapping x = 2( b−a
x−a α
) − 1, where α > 0 and x ∈ [−1, 1], is defined as (Parand
and Delkhosh 2016) follows:

x −a α
F G α,λ
n (x) = G λn (x ) = G λn 2 −1 , (5.21)
b−a
where α ∈ R+ is the “fractional order of function” which is determined with respect

to the context and λ > − 21 is Gegenbauer polynomial order.
Regarding recursive relation which is already introduced for Gegenbauer polyno-
mials Eq. 5.7, one can define the recursive relation for fractional Gegenbauer func-
tions simply by inserting mapping x = 2( b−a
x−a α
) − 1 into Eq. 5.7:
1
F G α,λ
n (x) = [2x(n + λ − 1)F G α,λ α,λ
n−1 (x) − (n + 2λ − 2)F G n−2 (x)],
n

α,λ α,λ x −a α
F G 0 (x) = 1, F G 1 (x) = 2λ 2 −1 . (5.22)
b−a
Consequently, first few orders of Gegenbauer polynomials of fractional order are
F G α,λ
0 (x) = 1,

x −a α
F G α,λ
1 (x) = 4λ − 2λ,
b−a
2α α
α,λ 2 x −a 2 x −a x − a 2α
F G 2 (x) = 8λ − 8λ + 2λ + 8λ
2
b−a b−a b−a
α
x −a
−8λ + λ.
b−a
For higher orders there exist many terms which makes it impossible to write so for
sake of convenience one can use the below Python code to generate any order of
Gegenbauer polynomials of fractional order:
Program Code
import sympy
lambd = sympy.Symbol(r’\lambda’)
x=sympy.sympify(1-2*(x-a/b-a)**alpha)
def FGn(x, n):

if n == 0:
return 1
elif n == 1:
return 2*lambd*x
elif n >= 2:
return (sympy.Rational(1,n)*(2*x*(n+lambd-1)*FGn(x,n-1)\
-(n+2*lambd-2)*FGn(x,n-2)))
For example, the third order can be generated as follows:

Program Code
sympy.expand(sympy.simplify(FGn(x,2)))
> x−a )2α − 8λ2 ( x−a )α + 2λ2 + 8λ( x−a )2α − 8λ( x−a )α + λ
8λ2 ( b−a b−a b−a b−a
To define a generating function for fractional Gegenbauer polynomials, one can

follow the same process as already done for the recursive relation of Gegenbauer
polynomials in Eq. 5.22, by replacing the transition equation (x = 2( b−a
x−a α
) − 1) with
the x parameter in Eq. 5.5, then rewriting the equation for fractional Gegenbauer
polynomial F G α,λ
n (x):
1
F G α,λ λ
n (x) = G n (x ) = x−a α λ
. (5.23)
1 − 2 2 b−a − 1 z + z2
Similarly, the weight function to which the fractional Gegenbauer functions are
orthogonal is simple
x−atoα define. Considering the weight function as Eq. 5.2 and the
transition x = 2 b−a − 1, we have
α−1 α 2α λ− 21
α,λ x −a x −a x −a
w (x ) = 2 4 −4 . (5.24)
b−a b−a b−a
The fractional Gegenbauer polynomials are orthogonal over a finite interval respect-
ing weight function in Eq. 5.24, and therefore one can define the orthogonality relation
as (Dehestani et al. 2020) follows:
1 1
G λn (x )G λm (x )w(x )d x = F G α,λ α,λ
n (x)F G m (x)w
α,λ
(x)d x
−1 0
21−4λ π (2λ + m)
= δmn , (5.25)
(λ + m)m! 2 (λ)
where δnm is the Kronecker delta function.
5.3 Gegenbauer Kernel Functions
In this section, the ordinary Gegenbauer kernel function has been covered consider-
ing addressing annihilation and explosion problems and proving the validity of such
a kernel. Furthermore, the generalized Gegenbauer kernel function and also the frac-
tional form of the Gegenbauer kernel function have been covered and its validation
has been proved according to the Mercer theorem.
5.3.1 Ordinary Gegenbauer Kernel Function
Gegenbauer polynomials have been used as a kernel function due to less need to
support vectors while acquiring high accuracy in comparison with well-known kernel
functions such as RBF or many other classical and orthogonal kernels (Padierna
et al. 2018). That is because Gegenbauer polynomials, same as any other orthogonal
polynomial kernels produce kernel matrices with lower significant eigenvalues which
in turn means fewer support vectors are needed (Padierna et al. 2018).
As we know, a multidimensional SVM kernel function can be defined as follows:

d
K (X, Z ) = K j (x j , z j ). (5.26)
j=0
It can be seen that two undesired results are produced, annihilation effect, when any
or both of x j and z j of k(x j , z j ) is/are close to zero, then kernel outputs very small
values, and the second one is considered as explosion effect which refers to very big
output of the kernel | dj=1 K (x j , z j )| −→ ∞ which leads to numerical difficulties
(Padierna et al. 2018). To overcome annihilation and explosion effects, Padierna
et al. (2018) proposed a new formulation for SVM kernel:

d
n
k(X, Z ) = φ(X ), φ(Z ) = pi (x j ) pi (z j )w(x j , z j )u( pi )2 , (5.27)
j=1 i=0
where w(x j , z j ) is the weight function, the scaling function is u( pi ) ≡ β pi (.) −1

max ∈
R+ , which is the reciprocal of the maximum absolute value of an ith degree poly-
nomial within its operational range, and β is a convenient positive scalar factor
(Padierna et al. 2018).
The Gegenbauer polynomials G λn can be classified into two groups considering
the value of λ:
1. −0.5 < λ ≤ 0.5
As it is clear in Figs. 5.1, 5.2, the amplitude is −1 ≤ G λn (x) ≤ 1, so a scaling
function is not required and also the weight function for this group is equal to 1
(Padierna et al. 2018).
2. λ > 0.5
In this group of Gegenbauer polynomials, the amplitude may grow way bigger
than 1, thereby it is probable to face an explosion effect (Padierna et al. 2018),
as mentioned before a weight function and scaling function come to play (see
Figs. 5.3, 5.4, 5.5). In this case, the weight function for Gegenbauer polynomials
is
wλ (x) = (1 − x 2 )λ− 2 .
1
(5.28)
Fig. 5.1 Gegenbauer polynomials with λ = 0.25 in the positive range of λ ∈ (−0.5, 0.5]
Fig. 5.2 Gegenbauer polynomials with λ = −0.25 in the negative range of λ ∈ (−0.5, 0.5]
According to formulation in Eq. 5.27, the scaling function is u( pi ) ≡ β| pi (.)|−1

max
∈ R+ and considering for Gegenbauer polynomials, the maximum amplitudes
are reached at x = ±1. Thereupon, by means of Pochhammer operator (a ∈
R)n∈Z = a(a + 1)(a + 2)...(a + n − 1), where λ > 0.5 and i > 0, Padierna et al.
(2018) proposed the scaling function. By setting β = √n+11
, we have
√
u( pi ) = u(G iλ ) = ( n + 1|G iλ (1)|)−1 . (5.29)
The weight function introduced in Eq. 5.28 is the univariate weight function,
meaning that is the weight function for one variable. To use the Gegenbauer polyno-
mials kernel function, a bivariate weight function is needed, in which one can define
a product of the corresponding univariate weight functions (Dunkl and Yuan 2014):
wλ (x, z) = ((1 − x 2 )(1 − z 2 ))λ− 2 + ε.

1
(5.30)
Fig. 5.3 The original polynomials in the second group, where λ = 2
Fig. 5.4 The weighted Gegenbauer polynomials in the second group, where λ = 2
The ε term in Eq. 5.30 is to prevent weight function from getting zero at values on
the border in the second group of Gegenbauer polynomials (Figs. 5.4 and 5.5) by
adding a small value to the output (e.g., = 0.01) (Padierna et al. 2018).
Using relations in Eqs. 5.27, 5.29, and 5.30, one can define the Gegenbauer kernel
function as

d
n
K Geg (X, Z ) = G iλ (x j )G λj (z j )wλ (x j , z j )u(G λj )2 . (5.31)
j=1 i=0
Fig. 5.5 The weighted-and-scaled Gegenbauer polynomials in the second group, where λ = 2
5.3.2 Validation of Gegenbauer Kernel Function
Theorem 5.1 (Padierna et al. (2018)) Gegenbauer kernel function introduced at

Proof A valid kernel function needs to be positive semi-definite or equivalently

must satisfy the necessary and sufficient conditions of Mercer’s theorem introduced
in 2.2.1. Furthermore, positive semi-definiteness property ensures that the optimiza-
tion problem of the SVM algorithm can be solved with Convex Optimization Pro-
gramming (Wu and Zhou 2005). According to Mercer theorem, a symmetric and
continuous function K (X, Z ) is a kernel function if it satisfies

K (x, z) f (x) f (z)d xdz ≥ 0. (5.32)
Bearing in mind that K Geg (x, z) = dj=1 K Geg (x j , z j ), denoting the multidimen-
sionality of the Gegenbauer kernel function, consequently the Gegenbauer for scalars
x and z is

n
K Geg (x, z) = G iλ (x)G iλ (z)wλ (x, z)u(G iλ )2 . (5.33)
i=0
By substitution of Eq. 5.33 into Eq. 5.32, we conclude

K Geg (x, z)g(x)g(z)d xdz =

n
G iλ (x)G iλ (z)wλ (x, z)u(G iλ )2 g(x)g(z)d xdz. (5.34)
i=0
Also, by inserting weight function from Eq. 5.30 into Eq. 5.34, we have

n
G iλ (x)G iλ (z) (1 − x 2 )λ− 2 (1 − z 2 )λ− 2 + ε u(G iλ )2 g(x)g(z)d xdz.
1 1
=
i=0
(5.35)
It should be noted that u(G iλ ) is always positive and independent of the data, so

n
u(G iλ )2 ) G iλ (x)G iλ (z)(1 − x 2 )λ− 2 (1 − z 2 )λ− 2 g(x)g(z)d xdz,
1 1
=
i=0

n
+ εu(G iλ )2 G iλ (x)G iλ (z)g(x)g(z)d xdz, (5.36)
i=0

n
u(G iλ )2 G iλ (x)(1 2 λ− 21
G iλ (z)(1 − z 2 )λ− 2 g(z)dz,
1
= −x ) g(x)d x
i=0

n
+ εu(G iλ )2 G iλ (x)g(x)d x G iλ (z)g(z)dz,
i=0

n 2
n 2
1
= u(G iλ )2 G iλ (x)(1 − x 2 )λ− 2 g(x)d x + εu(G iλ )2 G iλ (x)g(x)d x ≥ 0.
i=0 i=0
Therefore, K (x, z) is a valid kernel. Considering the product of two kernels is a

kernel too, then K Geg (X, Z ) = dj=1 K Geg (x j , z j ) for the vectors x and z is also a
valid kernel. Consequently, one can deduce that the Gegenbauer kernel function in
Eq. 5.31 is a valid kernel too.
5.3.3 Other Gegenbauer Kernel Functions
In this section, the generalized Gegenbauer kernel has been introduced which was
recently proposed by Yang et al. (2020).
5.3.3.1 Generalized Gegenbauer Kernel
The Generalized Gegenbauer kernel (GGK) is constructed by using the partial sum of
the inner products of generalized Gegenbauer polynomials. The generalized Gegen-
bauer polynomials have the recursive relation as follows (Yang et al. 2020):
GG λ0 (x) = 1,
GG λ1 (x) = 2λx,
1
GG λn (x) = [2x(d + λ − 1)GG λn−1 (x) − (n + 2λ − 2)GG λn−2 (x)],
d
(5.37)
where x ∈ Rn denotes the vector of input variables. The output of generalized Gegen-
bauer polynomial GG λn (x) is scalar for even orders of n and is a vector for odd orders
of n.
Then Yang et al. (2020) proposed generalized Gegenbauer kernel function
K GG (xi , x j ) for order n of two input vectors xi and x j as
n
GG lλ (xi )T GG lλ (x j )
K GG (xi , x j ) = l=0
, (5.38)
exp (σ xi − x j 22 )
where each element of xi and x j is normalized in the range [−1, 1]. In this context,
both α and σ are considered as the kernel scales or the so-called decaying parameters
of the proposed kernel function.
Theorem 5.2 (Padierna et al. (2018)) The proposed GGK is a valid Mercer kernel.
Proof The proposed GGK can be alternatively formulated as the product of two
kernel functions such that
K 1 (xi , x j ) = exp (−σ xi − x j 2 ),

2

d
K 2 (xi , x j ) = GG lα (xi )T GG lα (x j ),
l=0
K GG (xi , x j ) = K 1 (xi , x j )K 2 (xi , x j ). (5.39)
As already been discussed in 2.2.1, the multiplication of two valid Mercer Kernels
is also a valid kernel function. Since that K 1 (xi , x j ) is the Gaussian kernel (σ > 0)
which satisfies the Mercer theorem, K GG (xi , x j ) can be proved as a valid kernel by
verifying that K 2 (xi , x j ) satisfies the Mercer theorem. Given an arbitrary squared
integrable function g(x) defined as g : Rn → R and assuming each element in xi
and x j is independent with each other, we can conclude that

K 2 (xi , x j )g(xi )g(x j )d xi d x j =

n
= GG lλ (xi )T GG lλ (x j )g(xi )g(x j )d xi d x j ,
l=0
n

= GG lλ (xi )T GG lλ (x j )g(xi )g(x j )d xi d x j ,
l=0
n
= GG lλ (xi )T g(xi )d xi GG lλ (x j )g(x j )d x j ,
l=0
n
= GG lλ (xi )T g(xi )d xi GG lλ (x j )g(x j )d x j 0.
l=0
(5.40)
Thus, K 2 (xi , x j ) is a valid Mercer kernel, and it can be concluded that the proposed
GGK K GG (xi , x j ) is an admissible Mercer kernel function.
5.3.4 Fractional Gegenbauer Kernel Function
Similar to the latest kernel function introduced at Eq. 5.31 as the Gegenbauer kernel
and the weight function introduced at Eq. 5.28 as the fractional weight function, the
fractional Gegenbauer kernel has been introduced. First, the bivariate form of the
fractional weight function has to be defined that its approach is again similar to the
previous corresponding definition at Eq. 5.30, i.e.,
2α 2α λ− 21
x −a z−a
f wα,λ (x, z) = 1− 1− + . (5.41)
b−a b−a
Therefore, the fractional Gegenbauer kernel function is

d
n
K F Geg (X, Z ) = G iλ (x x j )G λj (x z j ) f wα,λ (x x j , x z j )u(G λj )2 . (5.42)
j=1 i=0
Theorem 5.3 Gegenbauer kernel function introduced at Eq. 5.42 is a valid Mercer
kernel.
Proof According to the Mercer theorem, a valid kernel must satisfy sufficient con-
ditions of Mercer’s theorem. As Mercer theorem states, any SVM kernel to be a valid
kernel must be non-negative, in a precise way:

K (x, z)w(x, z) f (x) f (z)d xdz ≥ 0. (5.43)
By the fact that here the kernel is as Eq. 5.42, it can be seen with a simple replacement
that

K F Geg (x, z)g(x)g(z)d xdz =

n
G iλ (x x j )G λj (x z j ) f wλ (x x j , x z j )u(G λj )2 g(x)g(z)d xdz. (5.44)
i=0
In the last equation, the fractional bivariate weight function (i.e., Eq. 5.41) is consid-
ered, so we have
⎡ ⎤

n 2α 2α λ− 21 2
x −a z−a
G iλ (x x j )G λj (x z j ) ⎣ 1− 1− + ⎦ u G λj g(x)g(z)d xdz.
b−a b−a
i=0
(5.45)
Note that u(G iλ ) is always positive and independent of the data, therefore:
2α 2α λ− 21
n λ 2
n λ (x )G λ (x )
= i=0 u(G i ) G
i=0 i xj j zj 1 − x−a
b−a 1 − z−a
b−a g(x)g(z)d xdz
n
+ i=0 u(G iλ )2 G iλ (x x j )G λj (x z j )g(x)g(z)d xdz. (5.46)
5.4 Application of Gegenbauer Kernel Functions on Real

Datasets
In this section, the result of Gegenbauer and fractional Gegenbauer kernels has
been compared on some datasets, with other well-known kernels such as RBF, poly-
nomial, and also Chebyshev and fractional Chebyshev kernels, introduced in the
previous chapters. In order to have a neat classification, there may need to be some
preprocessing steps according to the dataset. Here, these steps have not been focused
on, but they are mandatory when using Gegenbauer polynomials as the kernel. For
this section, two datasets have been selected, which are well known and helpful for
machine learning practitioners.
The spiral dataset as already has been introduced in Chap. 3 is one of the well-known
multi-class classification tasks. Using the OVA method, this multi-class classification
dataset has been split into three binary classification datasets and applies the SVM
Fig. 5.6 Spiral dataset, 1000 data points
Fig. 5.7 3D Gegenbauer spiral dataset
Gegenbauer kernel. Figure 5.6 depicts the data points in normal and fractional space
of the spiral dataset.
Despite the data density in the spiral dataset fraction mode, using more features
(three dimensions or more) it can be seen that the Gegenbauer kernel can display
data classification separately more clearly and simply (see Fig. 5.7).
Also, Fig. 5.8 shows how the classifiers of the Gegenbauer kernel for different
orders (Elliott 1960; Arfaoui et al. 2020; Stier et al. 2021; Soufivand et al. 2021) and
λ = 0.6 have chosen the boundaries. Similarly, Fig. 5.6 depicts the data points of
the spiral dataset after transforming into fractional space of order α = 0.5. Thereby,
Fig. 5.9 depicts the decision boundaries of relevant fractional Gegenbauer kernel,
where α = 0.5 and λ = 0.6.
On the other hand, the following tables provide a comparison of the experiments on
the spiral dataset. Three possible binary classifications have been examined according
to the One-versus-All method. As it is clear from Table 5.2, fractional Legendre,
kernel shows better performance for the class-1-versus-(Doha 1998; Elliott 1960).
However, for other binary classifications on this spiral dataset, according to results
in Tables 5.3 and 5.4, RBF kernels outperform fractional kinds of Legendre kernels.
Fig. 5.8 Gegenbauer kernel with orders of 3, 4, 5, and 6 on Spiral dataset with λ = 0.6
Table 5.2 Comparison of RBF, polynomial, Chebyshev, fractional Chebyshev, Gegenbauer, and
fractional Gegenbauer kernels functions on Spiral dataset
RBF 0.73 – – – – 0.97
Polynomial – 8 – – – 0.9533
Chebyshev – – 5 – – 0.9667
Fractional – – 3 0.3 – 0.9733
Chebyshev
Legendre – – 7 – – 0.9706
Fractional – – 7 0.4 – 0.9986
Legendre
Gegenbauer – – 6 – 0.3 0.9456
Fractional – – 6 0.3 0.7 0.9533
Gegenbauer
Fig. 5.9 Fractional Gegenbauer kernel with orders of 3, 4, 5, and 6 on Spiral dataset with α = 0.5
and λ = 0.6
fractional Gegenbauer kernels on Spiral dataset
RBF 0.1 – – – – 0.9867
Polynomial – 5 – – – 0.9044
Chebyshev – – 6 – – 0.9289
Fractional – – 6 0.8 – 0.9344
Chebyshev
Legendre – – 8 – – 0.9773
Fractional – – 8 0.4 – 0.9853
Legendre
Gegenbauer – – 5 – 0.3 0.9278
Fractional – – 4 0.6 0.6 0.9356
Gegenbauer
fractional Gegenbauer kernels on Spiral dataset
RBF 0.73 – – – – 0.9856
Polynomial – 5 – – – 0.9856
Chebyshev – – 6 – – 0.9622
Fractional – – 6 0.6 – 0.9578
Chebyshev
Legendre – – 7 – – 0.9066
Fractional – – 5 0.4 – 0.9906
Legendre
Gegenbauer – – 6 – 0.3 0.9611
Fractional – – 6 0.9 0.3 0.9644
Gegenbauer
Table 5.5 Comparison of RBF, polynomial, Chebyshev, fractional Chebyshev, Legendre, fractional
Legendre, Gegenbauer, and fractional Gegenbauer kernels on Monk’s first problem. It is shown
that the Gegenbauer kernel outperforms others, and also the fractional Gegenbauer and fractional
Legendre have the most desirable accuracy which is 1
RBF 2.844 – – – – 0.8819
Polynomial – 3 – – – 0.8681
Chebyshev – – 3 – – 0.8472
Fractional – – 3 1/16 – 0.8588
Chebyshev
Legendre – – 4 – – 0.8333
Fractional – – 4 0.1 – 0.8518
Legendre
Gegenbauer – – 3 – –0.2 0.9931
Fractional – – 3 0.7 0.2 1
Gegenbauer
Another problem in point is the three Monks’ problem, which is addressed here (see
Chap. 3 for more information about the dataset). We applied the Gegenbauer kernel
introduced at Eq. 5.31 on the datasets from the three Monks’ problem and append
the result to Tables 5.5, 5.6, and 5.7. It can be seen that fractional Gegenbauer kernel
shows strong performance on these datasets, specifically on the first problem which
had 100% accuracy, and also on the third dataset, both kinds of Gegenbauer kernels
have the best accuracy among all kernels under comparison. Tables 5.5, 5.6, and 5.7
illustrate the details of these comparisons.
Legendre, Gegenbauer, and fractional Gegenbauer kernels on the Monk’s second problem. It is
shown that the fractional Chebyshev kernel has the second best result, following the fractional
Legendre kernel
RBF 5.5896 – – – – 0.875
Polynomial – 3 – – – 0.8657
Chebyshev – – 3 – – 0.8426
Fractional – – 3 1/16 – 0.9653
Chebyshev
Legendre – – 3 – – 0.8032
Fractional – – 3 0.1 – 1
Legendre
Gegenbauer – – 3 – 0.5 0.7824
Fractional – – 3 0.1 0.5 0.9514
Gegenbauer
Legendre, Gegenbauer, and fractional Gegenbauer kernels on the Monk’s third problem. Note that
the Gegenbauer and also fractional Gegenbauer kernels have the best result
RBF 2.1586 – – – – 0.91
Polynomial – 3 – – – 0.875
Chebyshev – – 6 – – 0.895
Fractional – – 5 1/5 – 0.91
Chebyshev
Legendre – – 4 – – 0.8472
Fractional – – 3 0.8 – 0.8379
Legendre
Gegenbauer – – 4 – –0.2 0.9259
Fractional – – 3 0.7 –0.2 0.9213
Gegenbauer
5.5 Conclusion
Gegenbauer (ultra-spherical) orthogonal polynomials are very important among

orthogonal polynomials since they have been used to address many differential equa-
tion problems, specifically in spherical spaces as its name suggests. This chapter
gives a brief background on the research history and explained the basics and prop-
erties of these polynomials, and constructed the related kernel function which have
been reviewed. Moreover, the novel method of the fractional Gegenbauer orthog-
onal polynomials and the corresponding kernels have been introduced, and in the
last section, how to successfully use this new kernel through an experiment on well-
known datasets in kernel-based learning algorithms such as SVM has been depicted.
References
Abd Elaziz, M., Hosny, K.M., Selim, I.M.: Galaxies image classification using artificial bee colony
based on orthogonal Gegenbauer moments. Soft. Comput. 23, 9573–9583 (2019)
Arfaoui, S., Ben Mabrouk, A., Cattani, C.: New type of Gegenbauer-Hermite monogenic polyno-
mials and associated Clifford wavelets. J. Math. Imaging Vis. 62, 73–97 (2020)
1007/s00366-022-01612-x
Avery, J.S.: Hyperspherical Harmonics: Applications in Quantum Theory, vol. 5. Springer Science
& Business Media, Berlin (2012)
Azari, A.S., Mack, Y.P., Müller, H.G.: Ultraspherical polynomial, kernel and hybrid estimators for
non parametric regression. Sankhya: Indian J. Stat. 80–96 (1992)
Belmehdi, S.: Generalized Gegenbauer orthogonal polynomials. J. Comput. Appl. Math. 133, 195–
205 (2001)
Brackx, F., De Schepper, N., Sommen, F.: The Clifford-Gegenbauer polynomials and the associated
continuous wavelet transform. Integr. Transform. Spec. Funct. 15, 387–404 (2004)
Cohl, H.S.: On a generalization of the generating function for Gegenbauer polynomials. Integr.
Transform. Spec. Funct. 24, 807–816 (2013)
Dehestani, H., Ordokhani, Y., Razzaghi, M.: Application of fractional Gegenbauer functions in
variable-order fractional delay-type equations with non-singular kernel derivatives. Chaos, Soli-
tons Fractals 140, 110111 (2020)
Doha, E.H.: The ultraspherical coefficients of the moments of a general-order derivative of an
infinitely differentiable function. J. Comput. Appl. Math. 89, 53–72 (1998)
Dunkl, C.F., Yuan, X.: Orthogonal Polynomials of Several Variables. Cambridge University Press,
Cambridge (2014)
Eassa, M., Selim, I.M., Dabour, W., Elkafrawy, P.: Automated detection and classification of galaxies
based on their brightness patterns. Alex. Eng. J. 61, 1145–1158 (2022)
El-Kalaawy, A.A., Doha, E.H., Ezz-Eldien, S.S., Abdelkawy, M.A., Hafez, R.M., Amin, A.Z.M.,
Zaky, M.A.: A computationally efficient method for a class of fractional variational and optimal
control problems using fractional Gegenbauer functions. Rom. Rep. Phys. 70, 90109 (2018)
Elliott, D.: The expansion of functions in ultraspherical polynomials. J. Aust. Math. Soc. 1, 428–438
(1960)
Feng, B.Y., Varshney, A.: SIGNET: efficient neural representation for light fields. In: Proceedings
of the IEEE/CVF (2021)
Feng, J., Liu, L., Wu, D., Li, G., Beer, M., Gao, W.: Dynamic reliability analysis using the extended
support vector regression (X-SVR). Mech. Syst. Signal Process. 126, 368–391 (2019)
Ferrara, L., Guégan, D.: Forecasting with k-factor Gegenbauer processes: theory and applications.
J. Forecast. 20, 581–601 (2001)
Hadian-Rasanan, A.H., Nikarya, M., Bahramnezhad, A., Moayeri, M.M., Parand, K.: A comparison
between pre-Newton and post-Newton approaches for solving a physical singular second-order
boundary problem in the semi-infinite interval. arXiv:1909.04066
He, J., Chen, T., Zhang, Z.: A Gegenbauer neural network with regularized weights direct determi-
nation for classification (2019). arXiv:1910.11552
Herrera-Acosta, A., Rojas-Domínguez, A., Carpio, J.M., Ornelas-Rodríguez, M., Puga, H.:
Gegenbauer-based image descriptors for visual scene recognition. In: Intuitionistic and Type-
2 Fuzzy Logic Enhancements in Neural and Optimization Algorithms: Theory and Applications,
pp. 629–643 (2020)
Hjouji, A., Bouikhalene, B., EL-Mekkaoui, J., Qjidaa, H.: New set of adapted Gegenbauer-
Chebyshev invariant moments for image recognition and classification. J. Supercomput. 77,
5637–5667 (2021)
Hosny, K.M.: Image representation using accurate orthogonal Gegenbauer moments. Pattern Recog-
nit. Lett. 32, 795–804 (2011)
Hosny, K.M.: New set of Gegenbauer moment invariants for pattern recognition applications. Arab.
J. Sci. Eng. 39, 7097–7107 (2014)
Hosny, K.M., Darwish, M.M., Eltoukhy, M.M.: New fractional-order shifted Gegenbauer moments
for image analysis and recognition. J. Adv. Res. 25, 57–66 (2020)
Ilić, A.D., Pavlović, V.D.: New class of filter functions generated most directly by Christoffel-
Darboux formula for Gegenbauer orthogonal polynomials. Int. J. Electron. 98, 61–79 (2011)
Langley, J., Zhao, Q.: A model-based 3D phase unwrapping algorithm using Gegenbauer polyno-
mials. Phys. Med. Biol. 54, 5237–5252 (2009)
law Pawlak, M.: Image analysis by moments: reconstruction and computational aspects. Oficyna
Wydawnicza Politechniki Wrocławskiej (2006)
Liao, S., Chiang, A., Lu, Q., Pawlak, M.: Chinese character recognition via Gegenbauer moments.
In: Object Recognition Supported by User Interaction for Service Robots, vol. 3, pp. 485–488
(2002)
Liao, S., Chen, J.: Object recognition with lower order Gegenbauer moments. Lect. Notes Softw.
Eng. 1, 387 (2013)
Liu, W., Wang, L.L.: Asymptotics of the generalized Gegenbauer functions of fractional degree. J.
Approx. Theory 253, 105378 (2020)
Ludlow, I.K., Everitt, J.: Application of Gegenbauer analysis to light scattering from spheres: theory.
Phys. Rev. E 51, 2516–2526 (1995)
Olver, F.W.J., Lozier, D.W., Boisvert, R.F., Clark, C.W.: NIST Handbook of Mathematical Functions
Hardback and CD-ROM. Cambridge University Press, Cambridge (2010)
Öztürk, Ş, Ahmad, R., Akhtar, N.: Variants of artificial Bee Colony algorithm and its applications
in medical image processing. Appl. Soft Comput. 97, 106799 (2020)
Padierna, L.C., Carpio, M., Rojas-Dominguez, A., Puga, H., Fraire, H.: A novel formulation of
Recognit. 84, 211–225 (2018)
Padierna, L.C., Amador-Medina, L.F., Murillo-Ortiz, B.O., Villaseñor-Mora, C.: Classification
method of peripheral arterial disease in patients with type 2 diabetes mellitus by infrared ther-
mography and machine learning. Infrared Phys. Technol. 111, 103531 (2020)
Parand, K., Delkhosh, M.: Solving Volterra’s population growth model of arbitrary order using the
generalized fractional order of the Chebyshev functions. Ricerche mat. 65, 307–328 (2016)
Parand, K., Dehghan, M., Baharifard, F.: Solving a laminar boundary layer equation with the rational
Gegenbauer functions. Appl. Math. Model. 37, 851–863 (2013)
Parand, K., Bahramnezhad, A., Farahani, H.: A numerical method based on rational Gegenbauer
functions for solving boundary layer flow of a Powell-Eyring non-Newtonian fluid. Comput.
Appl. Math. 37, 6053–6075 (2018)
Park, R.W.: Optimal compression and numerical stability for Gegenbauer reconstructions with
applications. Arizona State University (2009)
Reimer, M.: Multivariate Polynomial Approximation. Springer Science & Business Media, Berlin
(2003)
Soufivand, F., Soltanian, F., Mamehrashi, K.: An operational matrix method based on the Gegen-
bauer polynomials for solving a class of fractional optimal control problems. Int. J. Ind. Electron.
4, 475–484 (2021)
Srivastava, H.M., Shah, F.A., Abass, R.: An application of the Gegenbauer wavelet method for
the numerical solution of the fractional Bagley-Torvik equation. Russ. J. Math. Phys. 26, 77–93
(2019)
Stier, A.C., Goth, W., Hurley, A., Feng, X., Zhang, Y., Lopes, F.C., Sebastian, K.R., Fox, M.C.,
Reichenberg, J.S., Markey, M.K., Tunnell, J.W.: Machine learning and the Gegenbauer kernel
improve mapping of sub-diffuse optical properties in the spatial frequency domain. In: Molecular-
Guided Surgery: Molecules, Devices, and Applications VII, vol. 11625, p. 1162509 (2021)
Tamandani, A., Alijani, M.G.: Development of an analytical method for pattern synthesizing of
linear and planar arrays with optimal parameters. Int. J. Electron. Commun. 146, 154135 (2022)
Wu, Q., Zhou, D.X.: SVM soft margin classifiers: linear programming versus quadratic program-
ming. Neural Comput. 17, 1160–1187 (2005)
Yang, W., Zhang, Z., Hong, Y.: State recognition of bolted structures based on quasi-analytic
wavelet packet transform and generalized Gegenbauer support vector machine. In: 2020 IEEE
International Instrumentation and Measurement Technology Conference (I2MTC), pp. 1–6 (2020)
Yang, W., Zhousuo, Z., Hong, Y.: State recognition of bolted structures based on quasi-analytic
Zhang, Z., He, J., Tang, L. : Two-input gegenbauer orthogonal neural network with growing-and-
pruning weights and structure determination. In: International Conference on Cognitive Systems
and Signal Processing, pp. 288–300 (2018)
Chapter 6
Fractional Jacobi Kernel Functions:
Amir Hosein Hadian Rasanan, Jamal Amani Rad, Malihe Shaban Tameh,
and Abdon Atangana
Abstract Orthogonality property in some kinds of polynomials has led to significant

attention on them. Classical orthogonal polynomials have been under investigation
for many years, with Jacobi polynomials being the most common. In particular,
these polynomials are used to tackle multiple mathematical, physics, and engineering
problems as well as their usage as the kernels has been investigated in the SVM
algorithm classification problem. Here, by introducing the novel fractional form,
a corresponding kernel is going to be proposed to extend the SVM’s application.
Through transforming the input dataset to a fractional space, the fractional Jacobi
kernel-based SVM shows the excellent capability of solving nonlinear problems
in the classification task. In this chapter, the literature, basics, and properties of
this family of polynomials will be reviewed and the corresponding kernel will be
explained, the fractional form of Jacobi polynomials will be introduced, and the
validation according to Mercer conditions will be proved. Finally, a comparison of
the obtained results over a well-known dataset will be provided, using the mentioned
kernels with some other orthogonal kernels as well as RBF and polynomial kernels.
Keywords Jacobi polynomial · Fractional jacobi functions · Kernel trick ·

Orthogonal functions · Mercer’s theorem
A. H. Hadian Rasanan (B) · J. A. Rad

M. S. Tameh
Department of Chemistry, University of Minnesota, Minneapolis, MN 55455, USA
A. Atangana
Faculty of Natural and Agricultural Sciences, Institute for Groundwater Studies, University of the
Free State, Bloemfontein 9300, South Africa
6.1 Introduction
Classical orthogonal functions have many applications in approximation theory Mas-

troianni and Milovanovic (2008), Boyd (2001), Askey and Wilson (1985) as well as
in various numerical algorithms such as spectral methods Kazem (2013), Bhrawy
and Alofi (2012), Abdelkawy et al. (2017). The most general type of these orthogonal
functions is Jacobi polynomials Askey and Wilson (1985), Milovanovic et al. (1994),
Asghari et al. (2022). There are some special cases of this orthogonal polynomial
family such as Legendre, the four kinds of Chebyshev, and Gegenbauer polynomials
Ezz-Eldien and Doha (2019), Doha et al. (2011). The Jacobi polynomials have been
used to solve various problems in different fields of science Parand et al. (2016),
Parand et al. (2019), Moayeri et al. (2021). Ping et al. used Jacobi–Fourier moments
for a deterministic image, and then they reconstructed the original image with the
moments Ping et al. (2007). In 2013, Kazem (2013) used Jacobi polynomials to solve
linear and nonlinear fractional differential equations, which we know that the frac-
tional differential equations have many use cases to show model physical and engi-
neering processes. Shojaeizadeh et al. (2021) used the shifted Jacobi polynomials to
address the optimal control problem in advection–diffusion reaction. Also, Bahrawy
et.al (2012), Abdelkawy et al. (2017), Bhrawy (2016), Bhrawy and Zaky (2016) have
many contributions to Jacobi applications. In Bhrawy and Alofi (2012), they proposed
a shifted Jacobi–Gauss collocation method to solve the nonlinear Lane–Emden equa-
tions. Also, in Abdelkawy et al. (2017), the authors used Jacobi polynomials for
solving multi-dimensional Volterra integral equations. In Bhrawy (2016), Bhrawy
used Jacobi spectral collocation method for solving multi-dimensional nonlinear
fractional sub-diffusion equations. Then, Bhrawy and Zaky (2016) used fractional
order of Jacobi functions to solve time fractional problems. On the other hand, Jacobi
polynomials are also used as a kernel in high-performance heterogeneous comput-
ers, named “the Jacobi iterative method” as an example, to gain higher speed up,
in Morris and Abed (2012). In addition, the orthogonal rotation-invariant moments
have many use cases in image processing and pattern recognition, including Jacobi
moments. In particular, Rahul Upneja et al. (2015) used Jacobi–Fourier moments for
invariant image processing using a novel fast method leveraging the time complexity.
It can be said explicitly that the applications of these polynomials are so many that
they have penetrated into various other branches of science such as neuroscience, etc.,
and have found important applications. In 2020, Moayeri et al. (2020) used Jacobi
polynomials for the collocation method (the generalized Lagrange–Jacobi–Gauss–
Lobatto, precisely) to simulate nonlinear spatio-temporal neural dynamic models and
also the relevant synchronizations that resulted in a lower computational complex-
ity. Hadian Rasanan et al. (2020) have used recently the fractional order of Jacobi
polynomials as the activation functions of neural networks to simulate the nonlin-
ear fractional dynamics arising in the modeling of cognitive decision-making. Also,
Nkengfack et al. (2021) used Jacobi polynomials for the classification of EEG signals
for epileptic seizures detection and eye states identification. As another application
of these polynomials, Abdallah et al. (2021) have introduced contiguous relations for
6 Fractional Jacobi Kernel Functions … 121
Wilson polynomials using Jacobi polynomials. Also, Khodabandehlo et al. (2021)

used shifted Jacobi operational matrix to present generalized nonlinear delay differ-
ential equations of fractional variable order. For more applications of different types
of Jacobi polynomials, the interested readers can see Doha et al. (2012), Bhrawy
(2016), Bhrawy et al. (2015, 2016).
If one can choose a general family among classical orthogonal polynomials,
ψ,ω
without no doubt it is the Jacobi polynomials Jn . The Jacobi polynomials are
also known as hypergeometric polynomials named after Carl Gustav Jacob Jacobi
(1804–1851). In fact, Jacobi polynomials are the most general class in the domain
[−1, 1]. All of the other families are special cases of this family. For example,
ψ = ω = 0 is the Legendre polynomials and ψ = ω is the Gegenbauer family.
Carl Gustav Jacob Jacobi (1804–1851) was a Prussian (German) genius

mathematician who could reach the standards of entering university at the age
of 12; however, he had to wait until the age of 16 according to the University
of Berlin’s laws. He received his doctorate at the age of 21 from the University
of Berlin. Jacobi’s contribution is mainly recognized by his theory of elliptic
functions and its relations with theta functions in 1829. Moreover, he also had
many discoveries in number theory. His results on cubic residues impressed
Carl Friedrich Gauss, and brought significant attention to him among other
mathematicians such as Friedrich Bessel and Adrien-Marie Legendre. Jacobi
had important research on partial differential equations and their applications
and also determinants, and as a result functional determinants also known
as Jacobian determinants. Due to the numerous contributions of Jacobi, his
name has appeared in many mathematical concepts such as Jacobi’s elliptic
functions, Jacobi symbol, Jacobi polynomials, Jacobi transform, and many
others. His political point of view, during the revolution days of Prussia,
made his last years of life full of disturbances. He died under infection to
smallpox, at the age of 46.a
aFor more information about Carl Gustav Jacob Jacobi and his contribution, please see:
https://mathshistory.st-andrews.ac.uk/Biographies/Jacobi/.
6.2 Preliminaries
This section covers the basic definitions and properties of Jacobi orthogonal polyno-
mials. Moreover, the fractional form of Jacobi orthogonal polynomials is introduced
and relevant properties are clarified.
6.2.1 Properties of Jacobi Polynomials
Let’s start with a simple recap on the definition of orthogonal polynomials. In Szeg
(1939), Szego defined the orthogonal polynomials using a function, such as F(x),
to be non-decreasing which includes many points of increase in the interval [a, b].
Suppose the following moments exist for this function:
b
cn = x n d F(x), n = 0, 1, 2, . . . . (6.1)
a
By orthogonalizing the set of non-negative powers of x, i.e., 1, x, x 2 , x 3 , . . . , x n , . . .,

he obtained a set of polynomials as follows Szeg (1939), Asghari et al. (2022):
J0 (x), J1 (x), J2 (x), . . . , Jn (x), . . . (6.2)
which can be determined, uniquely by the following conditions Szeg (1939):

• Jn (x) is a polynomial of exactly degree n, where the coefficient of x n is positive.
• {Jn (x)} is orthonormal, i.e.,
b
Jn (x)Jm (x)d F(x) = δnm , n, m = 0, 1, 2, . . . , (6.3)
a
where δnm is Kronecker’s delta function.

Using Eq. 6.3 one can define Jacobi orthogonal polynomials. Therefore, Jacobi
ψ,ω
polynomials which are denoted by Jn (x) are orthogonal over [−1, 1] interval with
ψ,ω
w (x) weight function Hadian et al. (2020), Askey and Wilson (1985), Kazem
(2013), Asghari et al. (2022):
w(ψ,ω) (x) = (1 − x)ψ (1 + x)ω , (6.4)
where ψ, ω > −1. However, Eq. 6.4 formulation of the weight function suffers from
computational difficulties at two distinct input points where ψ, ω < 0 Doman (2015).
The input data has to be normalized into interval [−1, 1]. In general, it can be written
that:
The orthogonal polynomials with the weight function (b − x)ψ (x − a)ω on the inter-
ψ,ω
val [a, b], for any orthogonal polynomial denoted by Jn (x), can be expressed in
the following form Doman (2015), Bhrawy and Zaky (2016):
x −a
Jn(ψ,ω) 2( )−1 , (6.5)
b−a
where a = −1 and b = 1. It is clear that the minimum and maximum of possible

input can be −1 and +1, respectively. Practically while ψ < 0 and x = +1 causes the
term (1 − x)ψ equal to zero to the power of a negative number and that yields infinity.
Fig. 6.1 Ordinary Jacobi

weight function, using
different values of ψ and ω
Fig. 6.2 Jacobi weight

function using = 0.0001
and different values of ψ
and ω
Similarly, it happens when ω < 0 for the second part (1 + x)ω , while x = −1. To
tackle this issue adding a trivial noise to x is proposed through summing with a slack
variable = 10−4 Tian and Wang (2017).
Consequently, one can rewrite the relation (6.4) as follows:
w(ψ,ω) (x) = (1 − x + )ψ (1 + x + )ω . (6.6)
Note that in programming languages like Python, it is necessary that floating point
part precision of 1 − x and 1 + x should be handled and set equal to precision of the
slack variable, for example, following figures compare Eqs. 6.4 and 6.6 for differ-
ent ψ, ω, and = 0.0001. Since there are some difficulties with the calculation of
ordinary weight functions on boundaries, we have used Maple software for plotting.
Figure 6.1 demonstrates the weight function of Jacobi polynomials without a noise
term for different ψ and ω, whereas Fig. 6.2 depicts the plots of weight function
of Jacobi polynomials for different ψ and ω while a noise ( = 0.0001) is added
to terms. As these plots depict, there is no considerable difference in the output of
related functions.
Thus, the orthogonality relation is Bhrawy et al. (2016), Doman (2015), Askey
(1975):
1
Jmψ,ω (x)Jnψ,ω (x)w(ψ,ω) (x)d x = 0, m = n. (6.7)
−1
The Jacobi polynomials can be defined as eigenfunctions of a singular Sturm–

Liouville differential equation as follows: Doman (2015), Askey (1975), Bhrawy and
Zaky (2016), Hadian et al. (2020):
d d ψ,ω ψ,ω
(1 − x + )ψ+1 (1 + x + )ω+1 Jn (x) + (1 − x + )ψ (1 + x + )ω ρn Jn (x) = 0,
dx dx
(6.8)
where ρn = n(n + ψ + ω + 1).

Considering the general form of recurrence relation which the orthogonal poly-
nomials follow, for Jacobi Polynomials, there exist the following relation: Doman
(2015), Askey (1975), Bhrawy and Zaky (2016), Hadian et al. (2020)
ψ,ω ψ,ω
Jn+1 (x) = (An x + Bn )Jnψ,ω (x) − Cn Jn−1 (x), n 1, (6.9)
while
ψ,ω
J0 (x) = 1,
ψ,ω 1 1
J1 (x) = (ψ + ω + 2)x + (ψ − ω),
2 2
(6.10)
where
(2n + ψ + ω + 1)(2n + ψ + ω + 2)
An = ,
2(n + 1)(n + ψ + ω + 1)
(ω2 − ψ 2 )(2n + ψ + ω + 1)
Bn = ,
2(n + 1)(n + ψ + ω + 1)(2n + ψ + ω)
(n + ψ)(n + ω)(2n + ψ + ω + 2)
Cn = .
(n + 1)(n + ψ + ω + 1)(2n + ψ + ω)
(6.11)
The first few Jacobi polynomials are same as Eq. 6.10. On the other hand, because
Jacobi polynomials form lengthy terms, the higher orders are ignored to present here.
For higher orders of Jacobi polynomials, the following code can be used which uses
Python’s sympy module to calculate symbolically rather than numerically:
Program Code
import sympy
beta = sympy.Symbol(r’\omega’)
alpha = sympy.Symbol(r’\psi’)
x=sympy.Symbol("x")
def Jacobi(x, n):

An=(2*n+alpha+beta+1)*(2*n+alpha+beta+2)/2*(n+1) \
*(n+alpha+beta+1)
Bn=(beta**2-alpha**2)*(2*n+alpha+beta+1)/2*(n+1) \
* (n+alpha+beta+1)*(2*n+alpha+beta)
Cn=(n+alpha)*(n+beta)*(2*n+alpha+beta+2)/(n+1) \
* (n+alpha+beta+1)*(2*n+alpha+beta)
if n == 0:
return 1
elif n == 1:
return ((alpha+beta+2)*x)/2 +(alpha-beta)/2
elif n >= 2:
return ((An*x+Bn)*Jacobi(x,n-1)-(Cn)*Jacobi(x,n-2))
Moreover, these polynomials have some special properties which are introduced
in Eqs. 6.12–6.14 Doman (2015), Askey (1975):
Jnψ,ω (−x) = (−1)n Jnω,ψ (x), (6.12)
(n + ψ + 1)
Jnψ,ω (1) = , (6.13)
n! × (ψ + 1)
dm (m + n + ψ + ω + 1) ψ+m,ω+m
(Jnψ,ω (x)) = m J (x). (6.14)
dx m 2 × (n + ψ + ω + 1) n−m
ψ,ω
Theorem 6.1 (Hadian et al. (2020)) The Jacobi polynomial Jn (x) has exactly n
real zeros on the interval (−1, 1).
Proof Referring to Hadian et al. (2020) shows that n zeros of Jacobi polynomial
ψ,ω
Jn (x) can be achieved by calculating the eigenvalues of the following three-
diagonal matrix: ⎡ ⎤
ρ1 γ2
⎢γ1 ρ2 γ3 ⎥
⎢ ⎥
⎢ γ3 ρ3 γ4 ⎥
⎢ ⎥
Kn = ⎢ . . . ⎥
⎢ .. .. .. ⎥
⎢ ⎥
⎣ γn−1 ρn−1 γn ⎦
γn ρn ,
where
ψ,ω ψ,ω
(x)Ji (x)wψ,ω (x)d x
1
−1 x Ji
ρi+1 = 1 ψ,ω ψ,ω
, (6.15)
−1 Ji (x)Ji (x)wψ,ω (x)d x
⎧
⎨0 i = 0,
γi+1 = 1 ψ,ω
Ji (x)Ji
ψ,ω
(x)wψ,ω (x)d x . (6.16)
⎩ −1
1 ψ,ω ψ,ω i = 1, 2, . . .
−1 Ji−1 (x)Ji−1 (x)wψ,ω (x)d x
Jacobi polynomials satisfy the following ordinary differential equation Doman

(2015), Askey (1975), Hadian et al. (2020):
d2 y dy
(1 − x 2 )
2
+ ω − ψ(ψ + ω + 2)x + λy = 0. (6.17)
dx dx

The solution can be expressed by means of power series y = ∞ n
n=0 an x :
∞
∞ ∞

(1 − x 2 ) n(n − 1)an x n−2 + ω − ψ − (ψ + ω + 2)x ann−1 + λ ann = 0. (6.18)
n=0 n=0 n=0
Also, generating function for Jacobi Polynomials can be defined as follows Doman
(2015), Askey (1975):
∞
2ψ+ω
ψ ω
= Jnψ,ω (x)x n , (6.19)
R(1 + R − x) (1 + R + X ) n=0
√
where R = 1 − 2x z + z 2 and |z| < 1.
This family of orthogonal polynomials follows the same symmetry as previous
orthogonal families already introduced, i.e.,

Jn (−x), n even
Jnψ,ω (x) = (−1) n
Jnψ,ω (x) = (6.20)
−Jn (−x), n odd
Figures 6.3 and 6.4 illustrate Jacobi polynomials of orders 0 to 6 and compare
negative ψ, positive ω in Fig. 6.3 versus positive ψ while negative ω in Fig. 6.4.
Figures 6.5 and 6.6 depict Jacobi polynomials of order 5 with different ψ, fixed ω,
fixed ψ, and fixed ω, respectively.
6.2.2 Properties of Fractional Jacobi Functions
In order to obtain fractional order of Jacobi functions, a transformation is used as

x−a α
x = 2( b−a ) − 1, where a and b are the minimum and maximum of the transition
Fig. 6.3 Jacobi polynomials of orders 0 to 6, negative ψ and positive ω
Fig. 6.4 Jacobi polynomials of orders 1 to 6, positive ψ and negative ω
Fig. 6.5 Jacobi polynomials of order 5, different ψ and fixed ω

Fig. 6.6 Jacobi polynomials of order 5, fixed ψ and different ω
and α > 0. By applying this transformation, the fractional order of Jacobi functions
ψ,ω,α
is obtained which is denoted by F Jn (x) as follows:
x −a α
F Jnψ,ω,α (x) = Jnψ,ω (2( ) − 1). (6.21)
b−a
The fractional order of Jacobi function is orthogonal over the interval [−1, 1] with
the following weight function Doman (2015), Askey (1975), Hadian et al. (2020):
x −a ψ x −a ω
wψ,ω,α (x) = 1 − 2( )α − 1 − 1 + 2( )α − 1 + .
b−a b−a
(6.22)
Similarly, there exists a Sturm–Liouville differential equation for the fractional order
of Jacobi functions. This Sturm–Liouville equation is as follows Doman (2015),
Askey (1975), Hadian et al. (2020):
d x −a α x −a α d ψ,ω
(1 − (2( ) − 1) − )ψ+1 (1 + (2( ) − 1) + )ω+1 J (x)
dx b−a b−a dx n
x −a α x −a α
+(1 − (2( ) − 1) − )ψ (1 + (2( ) − 1) + )ω ρn Jnψ,ω (x) = 0, (6.23)
b−a b−a
x−a α
where ρn = n(n + ψ + ω + 1). By applying the mapping z = 2( b−a ) − 1 to the
generating function defined for Jacobi polynomials, one can get the equation for the
fractional form:
∞
2ψ,ω x −a α
x−a α ψ x−a α ψ ω
= F Jnψ,ω,α (x)((2( ) − 1)ψ )n ,
R(1 + R − (2( b−a ) − 1) (1 R + (2( b−a ) − 1) ) b −a
n=0
(6.24)
where
x −a α x −a α
R= 1 − 2x((2( ) − 1)ψ ) + ((2( ) − 1)ψ )2 ,
b−a b−a
x−a α
and (2( b−a ) − 1)ψ ∈ [−1, 1].
Fractional Jacobi polynomials are also orthogonal. These polynomials are orthog-
onal in interval [−1, 1] with respect to the weight function similar to Eq. 6.6 where
x−a α
the input x is mapped by means of 2( b−a ) − 1. So the proper weight function for
orthogonality relation of fractional Jacobi polynomials is Bhrawy and Zaky (2016),
Hadian et al. (2020), Kazem (2013)
x −a α x −a α
Fwψ,ω,α (x) = (1 − (2( ) − 1) + )ψ (1 − (2( ) − 1) + )ω . (6.25)
b−a b−a
Now one can define the orthogonality relation for the fractional Jacobi polyno-
mials as b
F Pmψ,ω,α (x)F Pnψ,ω,α (x)Fwψ,ω,α (x)d x. (6.26)
a
The recursive relation for fractional Jacobi polynomials can be defined the same
way as already has been done so far by using the mapped x into the normal equation
Doman (2015), Askey (1975), Bhrawy and Zaky (2016), Hadian et al. (2020):
ψ,ω,α
F J0 (x) = 1,
ψ,ω,α 1 x −a α 1
F J1 (x) = (ψ + ω + 2)(2( ) − 1) + (ψ − ω),
2 b−a 2
ψ,ω,α x −a α ψ,ω,α
F Jn+1 (x) = (An (2( ) − 1) + Bn )F Jnψ,ω,α (x) − Cn F Jn+1 (x), n 1,
b−a
(6.27)
where
(2n + ψ + ω + 1)(2n + ψ + ω + 2)
An = ,
2(n + 1)(n + ψ + ω + 1)
(ω2 − ψ 2 )(2n + ψ + ω + 1)
Bn = ,
2(n + 1)(n + ψ + ω + 1)(2n + ψ + ω)
(n + ψ)(n + ω)(2n + ψ + ω + 2)
Cn = .
(n + 1)(n + ψ + ω + 1)(2n + ψ + ω)
(6.28)
Any order of fractional Jacobi polynomials can be obtained by means of following

python code, as well:
Program Code
import sympy
beta = sympy.Symbol(r’\omega’)
alpha = sympy.Symbol(r’\psi’)
delta = sympy.Symbol(r’\alpha’)
x=sympy.sympify(1-2*(x-a/b-a)**delta)
def A(n):
return (2*n+alpha+beta+1)*(2*n+alpha+beta+2)/2*(n+1)*(n+alpha+beta+1)
def B(n):
return (beta**2-alpha**2)*(2*n+alpha+beta+1)/2*(n+1)*(n+alpha+beta+1)
* (2*n+alpha+beta)
def C(n):
return (n+alpha)*(n+beta)*(2*n+alpha+beta+2)/(n+1)*(n+alpha+beta+1)
* (2*n+alpha+beta)
def FJacobi(x, n):
if n == 0:
return 1
elif n == 1:
return ((alpha+beta+2)*x)/2+(alpha-beta)/2
elif n >= 2:
return ((A(n-1)*x+B(n-1))*Jacobi(x,n-1)-(C(n-1))*Jacobi(x,n-2))
Here Fig. 6.7 illustrates fractional Jacobi polynomials of orders 0 to 6 with ψ =

−0.5, ω = 0.5, α = 0.5, and Fig. 6.8 compares fractional Jacobi polynomials of
order 5 with different α.
Fig. 6.7 Fractional Jacobi polynomials of orders 0 to 6, positive ψ and ω

Fig. 6.8 Fractional Jacobi polynomials of order 5, positive ψ and ω and different α
6.3 Jacobi Kernel Functions
In this section, the kernel functions of ordinary type will be introduced and also its
validation will be proved according to Mercer condition. Moreover, Wavelet-Jacobi
kernel will be introduced which recently attracted the attention of researchers. The
last subsection is devoted to fractional Jacobi kernel.
6.3.1 Ordinary Jacobi Kernel Function
As we already know, the unweighted orthogonal polynomial kernel function for SVM
can be written as follows:

n
K (x, z) = Ji (x)Ji (z), (6.29)
i=0
where J (.) denotes the evaluation of the polynomial. x and z are the kernel’s input
arguments and n is the highest polynomial order. Using this definition and the fact that
need the multi-dimensional form of the kernel function, one can introduce the Jacobi
kernel function by the evaluation of inner product of input vectors (x, z = x z T )
Nadira et al. (2019), Ozer et al. (2011):

n
ψ,ω ψ,ω T
k J acobi (x, z) = Ji (x)Ji (z)wψ,ω (x, z). (6.30)
i=0
Theorem 6.2 (Nadira et al. (2019)) The Jacobi Kernel introduced in Eq. 6.30 is
valid mercer kernel.
Proof Mercer theorem states that an SVM kernel should be positive semi-definite
or non-negative; in other words, the kernel should satisfy below relation:

K (x, z) f (x) f (z)d xdz 0. (6.31)
By using the fact that multiplication of two valid kernels is also a valid kernel, one
can divide the Jacobi kernel introduced at Eq. 6.30 into two kernels, one the inner
product and the other is weight function, therefore:

n
ψ,ω ψ,ω T
K (1) (x, z) = Ji (x)Ji (z), (6.32)
i=0
ψ,ω
K (2) (x, z) = w (x, z) (6.33)
= (d − x, z + )ψ (d + x, z + )ω ,
where d is the dimension of input x and z. Considering f (x) is a function where f :

Rm −→ R, one can evaluate the mercer condition for K (1) (x, z) as follows, assuming
each element is independent from others:

n
ψ,ω ψ,ω T
K (1) (x, z) f (x) f (z)d xdz = Ji (x)Ji (z) f (x) f (z)d xdz
j=0
n
ψ,ω ψ,ω T
= Ji (x)Ji (z) f (x) f (z)d xdz
j=0
n

ψ,ω ψ,ω T
= Ji (x) f (x)d x Ji (z) f (z)dz
j=0
n

ψ,ω ψ,ω T
= Ji (x) f (x)d x Ji (z) f (z)dz 0.
j=0
(6.34)
Therefore, the kernel K (1) (x, z) is a valid Mercer kernel. To prove K (2) (x, z) 0, we
can prove the weight function in Eq. 6.6 is positive semi-definite, because it is easier to
intuitively consider, then the general weight function for kernel which reformulated
for two vectors input becomes clear. Due to normalization of the input data, this
weight function is positive semi-definite, so the weight function wψ,ω (x, z) = (d −
x, z)ψ (d + x, z)ω which is a generalized form for two input vectors of the kernel
is positive semi-definite too.
6.3.2 Other Jacobi Kernel Functions
SVM with the kernel-trick heavily depends on the proper kernel to choose according
to data, which is still an attractive topic to introduce and examine new kernels for
SVM. Orthogonality properties of some polynomials such as Jacobi have made them
a hot alternative for such use cases. By the way, some previously introduced kernels
such as wavelet have been used successfully Nadira et al. (2019). Combining the
wavelet and orthogonal Kernel functions such as Chebyshev, Hermit and Legendre
have already been proposed and examined in signal processing Garnier et al. (2003),
solving differential equations Imani et al. (2011), Khader and Adel (2018), optimal
control Razzaghi and Yousefi (2002), Elaydi et al. (2012), and calculus variation
problems Bokhari et al. (2018).
6.3.2.1 Regularized Jacobi Wavelet Kernel Function
Abassa et al. Nadira et al. (2019) recently introduced SVM kernels based on Jacobi
wavelets. However, the proposed kernel will be glanced here, interested reader may
find the proof and details in original paper Nadira et al. (2019). It should be noted
that some notations have to be changed to preserve the integrity of the book, but the
integration of formulation is intact and same as the original work.
(ψ,ω)
The Jacobi polynomials Jm are defined as follows:
(ψ + ω + 2m − 1)[ψ 2 − ω2 + x(ψ + ω + 2m)(ψ + ω + 2m − 2)] (ψ,ω)
Jm(ψ,ω) (x) = Jm−1 (x)
2m(ψ + ω + 2m − 2)(ψ + ω + m)
(ψ + m − 1)(ω + m − 1)(ψ + ω + 2m) (ψ,ω)
− Jm−2 (x),
m(ψ + ω + 2m − 2)(ψ + ω + m)
where ψ > −1, ω > −1, and
(ψ,ω) (ψ,ω) ψ +ω+2 ψ −ω

J0 (x) = 1, J1 (x) = x+ .
2 2
These polynomials belong to the weight space L 2w ([−1, 1]), i.e.,

(ψ,ω)
Jm(ψ,ω) , Jm " = h (ψ,ω)
m δm,m , ∀m, m ∈ N,
L 2w
where
2ψ+ω+1 (ψ + m + 1)(ω + m + 1)

h (ψ,ω) = Jm(ψ,ω) = ,
m
(2m + 1 + ψ + ω)m!(ψ + ω + m + 1)
and w(x) = (1 − x)ψ (1 + x)ω . In addition, δn,m is the Kronecker function, is the
Euler
gamma,
and ., . L 2w denotes the inner product of L 2w ([−1, 1]). The family
(ψ,ω)
Jm forms an orthogonal basis for L 2w ([−1, 1]).
m∈N
6.3.3 Fractional Jacobi Kernel
Since the weight function of the fractional Jacobi functions is equal to Eq. 6.25,
defining the fractional Jacobi kernel is easy. One can construct the fractional Jacobi
kernel as follows:

n
ψ,ω,α ψ,ω,α
K F J acobi (x, z) = F Ji (x), F Ji (z)Fwψ,ω,α (x, z). (6.35)
i=0
Theorem 6.3 The Jacobi kernel introduced at Eq. 6.35 is valid Mercer kernel.
Proof Similar to the proof for Theorem 6.2, fractional Jacobi kernel function Eq.
6.35 can be considered as multiplication of two kernels:

n
ψ,ω,α ψ,ω,α T
K (1) (x, z) = F Ji (x)F Ji (z), (6.36)
i=0
K (2) (x, z) = Fwψ,ω,α (x, z) (6.37)

x −a α z−a α x −a α z−a α
= (d − 2( ) − 1, 2( ) − 1 − )ψ (d + 2( ) − 1, 2( ) − 1 + )ω(6.38)
,
b−a b−a b−a b−a
where d is the dimension of inputs x and z. Consider f : Rm −→ R, and assume

each element is independent from others. One can evaluate the Mercer condition for
K (1) (x, z) as follows:
K (1) (x, z) f (x) f (z)d xdz

n
ψ,ω,α ψ,ω,α T
= F Ji (x)F Ji (z) f (x) f (z)d xdz
j=0
n
ψ,ω,α ψ,ω,α T
= F Ji (x)F Ji (z) f (x) f (z)d xdz
j=0
n

ψ,ω,α ψ,ω,α T
= F Ji (x) f (x)d x F Ji (z) f (z)dz
j=0
n

ψ,ω,α ψ,ω,α T
= F Ji (x) f (x)d x F Ji (z) f (z)dz 0.
j=0
(6.39)
Therefore, Eq. 6.36 is a valid Mercer kernel. Validity of K (2) (x, z) can be considered
similar to the weight function of ordinary Jacobi kernel function. It can be deduced
that the output of K (2) (x, z) is never less than zero, as the inner product of two vectors
which are normalized over [−1, 1] is in the range [−d, d], where d is the dimension
of the input vectors x or z, and the effect of negative inner product will be neutralized
by the parameter d. Therefore, Eq. 6.37 is also a valid Mercer kernel.
6.4 Application of Jacobi Kernel Functions on Real

Datasets
In this section, the results of Jacobi and the fractional Jacobi kernel on some well-
known datasets are compared with other kernels such as RBF, polynomial, Cheby-
shev, fractional Chebyshev, Gegenbauer, fractional Gegenbauer, Legendre, and frac-
tional Legendre kernels introduced in the previous chapters. To have a neat classi-
fication, there may need to apply some prepossessing steps to dataset, here these
steps are not in focus, but the normalization which is mandatory when using Jacobi
polynomials as a kernel. There are some online data stores available for public use,
such a widely used datastore is the UCI Machine Learning Repository1 of University
of California, Irvine, and also Kaggle.2 For this section, four datasets from UCI are
used that are well known for machine learning practitioners.
The Spiral dataset is already introduced in details in Chap. 3. As a quick recap,

Fig. 6.9 depicts the original Spiral dataset (on the left) and also the same dataset in
fractional space of order α = 0.3 (on the right), which consist of 1000 data points. In
classification task on this dataset, 90% of data points are chosen as test set. It is clear
that in fractional space data points are concentrated in positive quarter as it seems
in 2D plot. As already discussed, this is only a step in preparing the dataset to apply
the kernel function and find the most suitable classifier.
Figure 6.10 demonstrates the same dataset while the Jacobi kernel function is
applied. The plot on the left depicts the original dataset after the Jacobi kernel of
order 3 and ψ = −0.2, and ω = 0.3 is applied to. The plot on the right demonstrates
the Spiral dataset in fractional space of order 0.3 where the fractional Jacobi kernel
of order 3, ψ = −0.2, ω = 0.3, and the fractional order of 0.3 is applied.
Applying Jacobi kernel has given a higher dimension to each data point of the Spi-
ral dataset. Thus, a new order has emerged in this way, and consequently a chance to
find a decision surface (in 3D) to do a binary classification. However, it may does not
1 https://archive.ics.uci.edu/ml/datasets.php.
2 https://www.kaggle.com/datasets.
Fig. 6.9 Spiral dataset
Fig. 6.10 Jacobi kernel applied to Spiral dataset, both normal and fractional
seem much improvement, but one have to wait for the Jacobi classifiers. Afterward,
one can achieve a better judgment. Figure 6.11 depicts the classifiers of different set-
tings of Jacobi kernel on Spiral dataset. Due to computational difficulties of Jacobi
kernel, orders of 7 and 8 have been ignored. Following figures only present the clas-
sifiers of Jacobi kernel and fractional Jacobi kernels of orders 3 to 6. Since it is clear
in following plots, with the fixed parameters ψ = −0.5 and ω = 0.2, correspond-
ing classifier get more curved and twisted as order raises from 3 to 6. Nevertheless,
this characteristics does not necessarily mean better classifying opportunity due to
its high dependence on multiple parameters besides kernel’s specific parameters.
According to these plots, one can deduce that Jacobi kernels of order 3 and 5 with
ψ = −0.5 and ω = 0.2 are slightly better classifiers for binary classification of class
1-v-[2, 3] on Spiral dataset in comparison to same kernel of orders 4 and 6.
Figure 6.12 demonstrates corresponding plots of fractional Jacobi kernel of orders
3 to 6, with fixed parameters of ψ = −0.5 and ω = 0.2 and fractional order of
0.3. Clearly in fractional space, the classifier gets more twisted and intricate in
comparison to the normal one. The fractional Jacobi kernel is fairly successful in
this classification task, both in normal and fractional space. The following figure
depicts how the fractional Jacobi kernel determines the decision boundary.
Fig. 6.11 Jacobi kernel applied to Spiral dataset, corresponding classifiers of order 3 to 6, at fixed
parameters ψ = −0.5, ω = 0.2
The following tables summarize experiment results of SVM classification on

Spiral dataset. Using OVA method, all possible formations of classes 1-v-[2, 3], 2-v-
[1, 3], and 3-v-[1, 2] are chosen and relevant results are summarized in Tables 6.1, 6.2,
and 6.3, respectively. In these experiments, the target was not getting the best accuracy
scores. Therefore, reaching a higher accuracy score is technically possible through
finding the best number of support vectors, setting the generalization parameter of
SVM, and definitely trying different values for kernel’s parameters. Table 6.1 is
the complete accuracy comparison between introduced orthogonal kernels, RBF,
and polynomial kernels on the binary classification of 1-v-[2, 3] on Spiral dataset
discussed in Chapter 3, in which fractional Legendre kernel was closely accurate as
100%. However, the lowest accuracy score was 94% for the fractional Gegenbauer
kernel. Table 6.2 expresses the summary of the second possible binary classification
task on the Spiral dataset (2-v-[1, 3]). The RBF and fractional Legendre kernels
outperform other kernels with an accuracy score of 98%, while the second best score
is close to 93%. Finally, Table 6.3 summarizes the third form of binary classification
on the Spiral dataset which is 3-v-[1, 3]. The fractional Legendre kernel has the best
performance with 99% accuracy. However, the fractional Jacobi kernel yields an
accuracy close to 97%.
Fig. 6.12 Fractional Jacobi kernel applied to Spiral dataset, corresponding classifiers of order 3 to
6, at fixed parameters ψ = −0.5, ω = 0.2, and fractional order = 0.3
Table 6.1 Class 1-v-[2, 3], comparison of RBF, polynomial, Chebyshev, fractional Chebyshev,
Gegenbauer, fractional Gegenbauer, Jacobi, fractional Jacobi kernels on the Spiral dataset. It is
clear that RBF, fractional Chebyshev, and fractional Jacobi kernels closely outcome best results
Sigma Power Order Alpha(α) Lambda(λ) Psi(ψ) Omega(ω) Accuracy
RBF 0.73 – – – – – – 0.97
Polynomial – 8 – – – – – 0.9533
Chebyshev – – 5 – – – – 0.9667
Fractional Chebyshev – – 3 0.3 – – – 0.9733
Legendre – – 6 – – – – 0.9706
Fractional Legendre – – 7 0.8 – – – 0.9986
Gegenbauer – – 6 – 0.3 – – 0.9456
Fractional Gegenbauer – – 6 0.3 0.7 – – 0.9533
Jacobi – – 3 – – −0.8 0 0.96
Fractional Jacobi – – 7 0.7 – −0.5 0.6 0.9711
According to “The MONK’s Problems-A Performance Comparison of Different

Learning algorithms” . For more details, please refer to related section in Chap. 3,
i.e., (3.4.2).
Gegenbauer, fractional Gegenbauer, Jacobi, and fractional Jacobi kernels on the Spiral dataset. As
it is clear, RBF kernel outperforms other kernels
RBF 0.1 – – – – – – 0.9867
Polynomial – 5 – – – – – 0.9044
Chebyshev – – 6 – – – – 0.9289
Legendre – – 8 – – – – 0.9344
Gegenbauer - - 5 – 0.3 – – 0.9278
Jacobi – – 5 – - −0.2 0.4 0.9144
Fractional Jacobi – – 3 0.3 – −0.2 0 0.9222
Gegenbauer, fractional Gegenbauer, Jacobi, and fractional Jacobi kernels on the Spiral dataset. The
RBF and polynomial kernels have the best accuracy and after them is the fractional Jacobi kernel
RBF 0.73 – – – – – – 0.9856
Polynomial – 5 – – – – – 0.9856
Chebyshev – – 6 - – – – 0.9622
Legendre – – 6 – – – – 0.9066
Gegenbauer – – 6 – 0.3 – – 0.9611
Jacobi – – 5 – – −0.8 0.3 0.96
Fractional Jacobi – – 7 0.9 – −0.8 0 0.9722
The Jacobi kernels introduced in Eqs. 6.30 and 6.35 are applied on datasets from
The three Monks’ problem and the relevant are appended to Tables 6.4, 6.5, and 6.6.
Fractional Jacobi kernel showed slightly better performance on these datasets in
comparison with other fractional and ordinary kernels on Monks’ problem M1 and
M2.
Finally, some kernels of support vector machine algorithms are summarized in
Table 6.7. A brief explanation of introduced kernels in this book is also presented for
each one that highlights the most notable characteristics of them and gives a suitable
comparison list to glance at and compare the introduced orthogonal kernels of this
book.
Table 6.4 Comparison of RBF, polynomial, Chebyshev, fractional Chebyshev, Gegenbauer, frac-
tional Gegenbauer, Legendre, fractional Legendre, Jacobi, and fractional Jacobi kernels on Monk’s
first problem. The fractional Gegenbauer and fractional Jacobi kernels have the most desirable
accuracy of 1
RBF 2.844 – – – – – – 0.8819
Polynomial – 3 – – – – – 0.8681
Chebyshev – – 3 – – – – 0.8472
Fractional Chebyshev – – 3 1/16 – – – 0.8588
Legendre – – 3 – – - – 0.8333
Gegenbauer – – 3 – -0.2 – – 0.9931
Fractional Gegenbauer – – 3 0.7 0.2 – – 1
Jacobi – – 4 – – −0.2 −0.5 0.9977
Fractional Jacobi – – 4 0.4 – −0.2 −0.5 1
tional Gegenbauer, Legendre, fractional Legendre, Jacobi, and the fractional Jacobi kernels on
the second Monk’s problem. The fractional Legendre and fractional Jacobi Kernels have the best
accuracy of 1
RBF 5.5896 – – – – – – 0.875
Polynomial – 3 – – – – – 0.8657
Chebyshev – – 3 – – – – 0.8426
Fractional – – 3 1/16 – – – 0.9653
Chebyshev
Legendre – – 3 – – – – 0.8032
Fractional – – 3 0.8 – – – 1
Legendre
Gegenbauer – – 3 – 0.5 – – 0.7824
Fractional – – 3 0.1 0.5 – – 0.9514
Gegenbauer
Jacobi – – 3 – – −0.5 −0.2 0.956
Fractional – – 3 0.1 – −0.2 −0.5 1
Jacobi
6.5 Summary and Conclusion
Jacobi orthogonal polynomials have been covered in this chapter which is the most
general kind of classical orthogonal polynomials and has been used in many cases.
In fact, the basics and the properties of these polynomials are explained, and also
the ordinary Jacobi kernel function is constructed. Moreover, the fractional form of
Jacobi polynomials and the relevant fractional Jacobi kernel function are introduced
here which extended the applicability of such kernel by transforming the input data
into a fractional space that has proved to leverage the accuracy of classification.
Also, a comprehensive comparison is provided between the introduced Jacobi kernels
tional Gegenbauer, Legendre, fractional Legendre, Jacobi, and fractional Jacobi kernels on the third
Monk’s problem. The Gegenbauer and fractional Gegenbauer kernels have the best accuracy score,
then Jacobi, fractional Jacobi, RBF, and fractional Chebyshev Kernels
RBF 2.1586 – – – – – – 0.91
Polynomial – 3 – – – – – 0.875
Chebyshev – – 6 – – – – 0.895
Fractional Chebyshev – – 5 1/5 – – – 0.91
Legendre – – 3 – – – – 0.8472
Fractional Legendre – – 3 0.8 – – - 0.8379
Gegenbauer – – 4 – −0.2 – – 0.9259
Fractional Gegenbauer – – 3 0.7 −0.2 – – 0.9213
Jacobi – – 5 – – −0.5 0.0 0.919
Fractional Jacobi – – 4 – – −0.5 0.0 0.9167
Table 6.7 A summary of SVM kernels

Kernel function Description
Legendre The Legendre kernel is presented in Pan et al. (2012) and has considerable merit, which
does not require a weight function. Although it is immune to the explosion effect, it
suffers from the annihilation effect
Chebyshev This kernel was presented by Ye et al. (2006). They reported it as the first orthogonal
polynomial kernel for SVM classifiers. Also, it is not immune to both explosion and
annihilation effects
Generalized Chebyshev This kernel was proposed by Ozer et al. (2011), and it was the first orthogonal kernel with
vector formulation. By introducing this kernel, they found an effective way to avoid the
annihilation effect
Modified Chebyshev This kernel was introduced by Ozer et al. (2011) and has the same characteristics as the
generalized Chebyshev kernel. The vector form of this kernel has better performance in
facing the nonlinear problems in comparison with the generalized Chebyshev kernel
Chebyshev Wavelet Introduced by Jafarzadeh et al. at (2013) using mercer rules, which in fact was the
multiplications of Chebyshev and wavelet kernels and showed to have competitive
accuracy
Unified Chebyshev Zhao et al. (2013) introduced a new kernel through unifying Chebyshev kernels of the first
and second kind. A new kernel with two notifiable properties, orthogonality, and
adaptivity. This kernel was also shown to have low computational cost and competitive
accuracy
Gegenbauer It was introduced in Padierna et al. (2018), it is not only immune to both annihilation and
explosion effects, but also it involves Chebyshev and Legendre kernels and improves them
as special cases
Generalized Gegenbauer Introduced by Yang et al. (2020) originally to address an engineering problem of state
recognition of bolted structures. Showed considerable accuracy in that specific context
Regularized Jacobi Wavelets This kernel function was introduced by Abassa in Nadira et al. (2019). Although such
kernel has high time complexity, it still provides competitive results compared to other
kernels
and all other kernels discussed in Chaps. 3, 4, and 5. The experiments showed the
efficiency of Jacobi kernels that made them a suitable kernel function for kernel-based
learning algorithms such as SVM.
References
Abdallah, N.B., Chouchene, F.: New recurrence relations for Wilson polynomials via a system of
Jacobi type orthogonal functions. J. Math. Anal. Appl. 498, 124978 (2021)
Abdelkawy, M.A., Amin, A.Z., Bhrawy, A.H., Machado, J.A.T., Lopes, A.M.: Jacobi collocation
approximation for solving multi-dimensional Volterra integral equations. Int. J. Nonlinear Sci.
Numer. Simul. 18, 411–425 (2017)
1007/s00366-022-01612-x
Askey, R., Wilson, J.A.: Some basic hypergeometric orthogonal polynomials that generalize Jacobi
polynomials. Am. Math. Soc. 319 (1985)
Askey, R.: Orthogonal Polynomials and Special Functions. Society for Industrial and Applied
Mathematics, Pennsylvania (1975)
Bhrawy, A.H.: A Jacobi spectral collocation method for solving multi-dimensional nonlinear frac-
tional sub-diffusion equations. Numer. Algorith. 73, 91–113 (2016)
Bhrawy, A.H., Alofi, A.S.: Jacobi-Gauss collocation method for solving nonlinear Lane-Emden
type equations. Commun. Nonlinear Sci. Numer. Simul. 17, 62–70 (2012)
Bhrawy, A.H., Zaky, M.A.: Shifted fractional-order Jacobi orthogonal functions: application to a
system of fractional differential equations. Appl. Math. Model. 40, 832–845 (2016)
Bhrawy, A., Zaky, M.: A fractional-order Jacobi Tau method for a class of time-fractional PDEs
with variable coefficients. Math. Methods Appl. Sci. 39, 1765–1779 (2016)
Bhrawy, A.H., Hafez, R.M., Alzaidy, J.F.: A new exponential Jacobi pseudospectral method for
solving high-order ordinary differential equations. Adv. Differ. Equ. 2015, 1–15 (2015)
Bhrawy, A.H., Doha, E.H., Saker, M.A., Baleanu, D.: Modified Jacobi-Bernstein basis transforma-
tion and its application to multi-degree reduction of Bézier curves. J. Comput. Appl. Math. 302,
369–384 (2016)
Bokhari, A., Amir, A., Bahri, S.M.: A numerical approach to solve quadratic calculus of variation
problems. Dyn. Contin. Discr. Impuls. Syst. 25, 427–440 (2018)
Boyd, J.P.: Chebyshev and Fourier Spectral Methods. Courier Corporation, MA (2001)
Doha, E.H., Bhrawy, A.H., Ezz-Eldien, S.S.: Efficient Chebyshev spectral methods for solving
multi-term fractional orders differential equations. Appl. Math. Model. 35, 5662–5672 (2011)
Doha, E.H., Bhrawy, A.H., Ezz-Eldien, S.S.: A new Jacobi operational matrix: an application for
solving fractional differential equations. Appl. Math. Model. 36, 4931–4943 (2012)
Elaydi, H.A., Abu Haya, A.: Solving optimal control problem for linear time invariant systems via
Chebyshev wavelet. Int. J. Electr. Eng. 5 (2012)
Ezz-Eldien, S.S., Doha, E.H.: Fast and precise spectral method for solving pantograph type Volterra
integro-differential equations. Numer. Algorithm. 81, 57–77 (2019)
Garnier, H., Mensler, M.I.C.H.E.L., Richard, A.L.A.I.N.: Continuous-time model identification
from sampled data: implementation issues and performance evaluation. Int. J. Control 76, 1337–
1357 (2003)
Guo, B.Y., Shen, J., Wang, L.L.: Generalized Jacobi polynomials/functions and their applications.
Appl. Numer. Math. 59, 1011–1028 (2009)
Imani, A., Aminataei, A., Imani, A.: Collocation method via Jacobi polynomials for solving non-
linear ordinary differential equations. Int. J. Math. Math. Sci. 2011, 673085 (2011)
Jafarzadeh, S.Z., Aminian, M., Efati, S.: A set of new kernel function for support vector machines:
an approach based on Chebyshev polynomials. In: ICCKE, pp. 412–416 (2013)
Kazem, S.: An integral operational matrix based on Jacobi polynomials for solving fractional-order
differential equations. Appl. Math. Model. 37, 1126–1136 (2013)
Khader, M.M., Adel, M.: Chebyshev wavelet procedure for solving FLDEs. Acta Appl. Math. 158,
1–10 (2018)
Khodabandehlo, H.R., Shivanian, E., Abbasbandy, S.: Numerical solution of nonlinear delay dif-
ferential equations of fractional variable-order using a novel shifted Jacobi operational matrix.
Eng. Comput. (2021). https://doi.org/10.1007/s00366-021-01422-7
Mastroianni, G., Milovanovic, G.: Interpolation Processes: Basic Theory and Applications. Springer
Science & Business Media, Berlin (2008)
Milovanovic, G.V., Rassias, T.M., Mitrinovic, D.S.: Topics In Polynomials: Extremal Problems.
Inequalities. Zeros. World Scientific, Singapore (1994)
Moayeri, M.M., Hadian Rasanan, A.H., Latifi, S., Parand, K., Rad, J.A.: An efficient space-splitting
method for simulating brain neurons by neuronal synchronization to control epileptic activity.
Eng. Comput. (2020). https://doi.org/10.1007/s00366-020-01086-9
Moayeri, M.M., Rad, J.A., Parand, K.: Desynchronization of stochastically synchronized neural
populations through phase distribution control: a numerical simulation approach. Nonlinear Dyn.
104, 2363–2388 (2021)
Morris, G.R., Abed, K.H.: Mapping a Jacobi iterative solver onto a high-performance heterogeneous
computer. IEEE Trans. Parallel Distrib. Syst. 24, 85–91 (2012)
Nadira, A., Abdessamad, A., Mohamed, B.S.: Regularized Jacobi Wavelets Kernel for support
vector machines. Stat. Optim. Inf. Comput. 7, 669–685 (2019)
Nkengfack, L.C.D., Tchiotsop, D., Atangana, R., Louis-Door, V., Wolf, D.: Classification of EEG
signals for epileptic seizures detection and eye states identification using Jacobi polynomial
transforms-based measures of complexity and least-square support vector machine. Inform. Med.
Unlocked 23, 100536 (2021)
Padierna, L.C., Carpio, M., Rojas-Dominguez, A., Puga, H., Fraire, H.: A novel formulation of
Recognit. 84, 211–225 (2018)
Pan, Z.B., Chen, H., You, X. H.: Support vector machine with orthogonal Legendre kernel. In:
Parand, K., Rad, J.A., Ahmadi, M.: A comparison of numerical and semi-analytical methods for
the case of heat transfer equations arising in porous medium. Eur. Phys. J. Plus 131, 1–15 (2016)
Parand, K., Moayeri, M.M., Latifi, S., Rad, J.A.: Numerical study of a multidimensional dynamic
quantum model arising in cognitive psychology especially in decision making. Eur. Phys. J. Plus
134, 109 (2019)
Ping, Z., Ren, H., Zou, J., Sheng, Y., Bo, W.: Generic orthogonal moments: Jacobi-Fourier moments
for invariant image description. Pattern Recognit. 40, 1245–1254 (2007)
Razzaghi, M., Yousefi, S.: Legendre wavelets method for constrained optimal control problems.
Shojaeizadeh, T., Mahmoudi, M., Darehmiraki, M.: Optimal control problem of advection-
diffusion-reaction equation of kind fractal-fractional applying shifted Jacobi polynomials. Chaos
Solitons Fract. 143, 110568 (2021)
Szeg, G.: Orthogonal Polynomials. American Mathematical Society, Rhode Island (1939)
742–756 (2017)
Upneja, R., Singh, C.: Fast computation of Jacobi-Fourier moments for invariant image recognition.
Pattern Recognit. 48, 1836–1843 (2015)
Yang, W., Zhang, Z., Hong, Y.: State recognition of bolted structures based on quasi-analytic
Zhao, J., Yan, G., Feng, B., Mao, W., Bai, J.: An adaptive support vector regression based on a new
sequence of unified orthogonal polynomials. Pattern Recognit. 46, 899–913 (2013)
Part III
Applications of Orthogonal Kernels
Chapter 7
Solving Ordinary Differential Equations
by LS-SVM
Mohsen Razzaghi, Simin Shekarpaz, and Alireza Rajabi
Abstract In this chapter, we propose a machine learning method for solving a class
of linear and nonlinear ordinary differential equations (ODEs) which is based on
the least squares-support vector machines (LS-SVM) with collocation procedure.
One of the most important and practical models in this category is Lane-Emden
type equations. By using LS-SVM for solving these types of equations, the solution
is expanded based on rational Legendre functions and the LS-SVM formulation is
presented. Based on this, the linear problems are solved in dual form and a system
of linear algebraic equations is concluded. Finally, by presenting some numerical
examples, the results of the current method are compared with other methods. The
comparison shows that the proposed method is fast and highly accurate with expo-
nential convergence.
Keywords Lane-Emden differential equation · Collocation method · Rational

Legendre functions
7.1 Introduction
Differential equations are a kind of mathematical equation that can be used for
modeling many physical and engineering problems in real life, such as dynamics
M. Razzaghi
Department of Mathematics and Statistics, Mississippi State University, Starkville, USA
S. Shekarpaz (B)
Department of Applied Mathematics, Brown University, Providence, RI 02912, USA
A. Rajabi
Department of Computer and Data Science, Faculty of Mathematical Sciences, Shahid Beheshti
148 M. Razzaghi et al.
of oscillators, cosmology, study of the solar system, the study of unsteady gases,
fluid dynamics, and many other applications (see, e.g., Liu et al. 2018; Rodrigues
et al. 2018; Anderson et al. 2016; Khoury et al. 2018; Farzaneh-Gord and Rahbari
2016; Lusch et al. 2018; Bristeau et al. 1979; Parand et al. 2011; Parand and Rad
2012; Kazem et al. 2011; Parand et al. 2016; Kazem et al. 2012; Parand et al. 2012,
2017; Abbasbandy et al. 2013). In the meantime, Lane-Emden type equations are
an important class of ordinary differential equations on the semi-infinite domain
which was introduced by Jonathan Homer Lane and by Emden as a model for the
temperature of the sun Lane (1870); Emden (1907). These equations have found many
interesting applications in modeling many physical phenomena such as the theory
of stellar structure, thermal behavior of a spherical cloud of gas, isothermal gas
spheres, and the theory of thermionic currents Chandrasekhar and Chandrasekhar
(1957), Wood (1921). As one of the important applications, it can be said that in
astrophysics, this equation describes the equilibrium density distribution in the self-
gravitating sphere of polytropic isothermal gas.
The general form of the Lane-Emden equation is as follows Hadian Rasanan et al.
(2020), Parand et al. (2010), Omidi et al. (2021):
k
y (x) + y (x) + f (x, y(x)) = h(x),
x (7.1)
y(x0 ) = A, y (x0 ) = B,
where A, B are constants and f (x, y) and h(x) are given functions of x and y.
Some of these equations do not have an exact solution, and as a result of their sin-
gularity at x = 0, their numerical solutions are a challenge for scientists Parand et al.
(2009, 2010). Many semi-analytical and numerical approaches have been applied to
solve the Lane-Emden equations, which can be presented as follows: In Bender et al.
(1989), a new perturbation technique based on an artificial parameter δ proposed to
solve the Lane-Emden equation. Also, a non-perturbative analytical solution of this
equation was derived in Shawagfeh (1993) by the Adomian decomposition method
(ADM). On the other hand, Mandelzweig et al. (2001) used the quasi-linearization
method for solving the standard Lane-Emden equation, and Liao (2003) produced an
analytical framework based on the Adomian decomposition method for Lane-Emden
type equations. The approach based on semi-analytical methods continued as He
(2003) obtained the analytical solutions to the problem by using Ritz’s variational
method. Also, in Wazwaz (2006), the modified decomposition method for the ana-
lytic behavior of nonlinear differential equations was used, and Yildirim et al. (2007)
obtained approximate solutions of a class of Lane-Emden equations by homotopy
perturbation method (HPM). Following this approach, Ramos (2005) solved Lane-
Emden equations by using a linearization method and a series solution has been
suggested in Ramos (2008). Those solutions were obtained by writing this equation
as a Volterra integral equation and assuming that the nonlinearities are smooth. Then,
Dehghan (2006; 2008) proposed the Adomian decomposition method for differen-
tial equations with an alternate procedure to overcome the difficulty of singularity.
Also, Aslanov (2008) introduced an improved Adomian decomposition method for
7 Solving Ordinary Differential Equations by LS-SVM 149
non-homogeneous, linear, and nonlinear Lane-Emden type equations. On the other

hand, Dehghan and Shakeri (2008) used an exponential transformation for the Lane-
Emden type equations to overcome the difficulty of a singular point at x = 0 and the
resulting nonsingular problem solved by the variational iteration method (VIM). The
use of two important analytical approaches, namely the homotopy analysis method
(HAM) and HPM, seems to have been so common. For example, Singh et al. (2009)
proposed an efficient analytic algorithm for the Lane-Emden type equations using a
modified homotopy analysis method. Also, Bataineh et al. (2009) proposed a homo-
topy analysis method to obtain the analytical solutions of the singular IVPs of the
Emden-Fowler type, and in Chowdhury and Hashim (2009), Chowdhury et al. applied
an algorithm based on the homotopy perturbation method to solve singular IVPs of
the Emden-Fowler type. However, robust numerical approaches have also been used
for these studies. In Yousefi (2006), a numerical method was provided for solving
the Lane-Emden equations. In that work, by using the integral operator and Leg-
endre wavelet approximation, the problem is converted into an integral equation.
Also, Marzban et al. (2008) applied the properties of a hybrid of block-pulse func-
tions together with the Lagrange interpolating polynomials for solving the nonlinear
second-order initial value problems and the Lane-Emden equation. Then, Parand et
al. (2009) presented a numerical technique based on the rational Legendre approach
to solving higher ordinary differential equations such as Lane-Emden. In Parand
et al. (2010), have applied an algorithm for the nonlinear Lane-Emden problems,
using Hermite functions collocation methods. On the other hand, Pandey and Kumar
(2010) presented an efficient and accurate method for solving Lane-Emden type
equations arising in astrophysics using Bernstein polynomials. Recently, in Yüzbaşı
and Sezer (2013), the linear Lane-Emden equations solved by the Bessel collocation
method and the error of the problem were also obtained. More recently, Parand and
Khaleqi (2016) also suggested a rational Chebyshev of the second kind collocation
method to solve the Lane-Emden equation. For analysis of the Lane-Emden equa-
tion based on the Lie symmetry approach, we refer the interested reader Kara and
Mahomed (1992, 1993). This important model has been carefully examined from a
simulation point of view with more numerical methods (see Hossayni et al. 2015;
Parand et al. 2013), but due to the weaknesses of the performance of these approaches
in complex dynamic models, one of the current trends for solving and simulating
dynamic models based on ODEs and partial differential equations (PDEs) is neural
networks (NN). In fact, NN is a machine learning technique to obtain the numerical
solution of higher-order ODE and partial differential equations (PDEs) which has
been proposed by many researchers. For example, Lagaris et al (1998) solved some
well-known ODEs by using a neural network. Also, in Malek and Beidokhti (2006),
a hybrid numerical method based on an artificial neural network and optimization
technique was studied for solving directly higher-order ODEs. More importantly, a
regression-based artificial neural network method was analyzed by Chakravarty and
Mall in Mall and Chakraverty (2014) to solve ODEs where a single layer Chebyshev
neural network was suggested to solve the Lane-Emden equation. On the other hand,
Mall and Chacraverty (2015) solved Emden-Fowler equations by using Chebyshev
neural network. They also used a Legendre neural network to solve the Lane-Emden
equation Mall and Chakraverty (2016). More recently, partial differential equation
Omidi et al. (2021), system, and fractional Hadian Rasanan et al. (2020) versions of
the Lane-Emden equation are solved by orthogonal neural networks.
In this chapter, the LS-SVM algorithm together with the rational Legendre func-
tions as kernel functions is used to solve the Lane-Emden equation. In our proposed
method, the constraints are the collocation form of the residual function. Then the
coefficients of approximate solutions together with the error function are minimized.
To obtain the solution to the minimization problem, the Lagrangian function is used
and the problem is solved in dual form.
The remainder of this chapter is organized as follows. In Sect. 7.2, the LS-SVM
formulation is discussed which is used to solve differential equations. The kernel
trick is also introduced in Sect. 7.3, and their properties together with the operational
matrix of differentiation are presented. In Sect. 7.4, the LS-SVM formulation of the
Lane-Emden equation is given and the problem is solved in dual form. Finally, the
numerical results are presented which show the efficiency of the proposed technique.
7.2 LS-SVM Formulation
We consider the general form of an m-th order initial value problem (IVP), which is
as follows:
L[y(x)] − F(x) = 0, x ∈ [a, c],

(7.2)
y (i) (a) = pi , i = 0, 1, . . . , m − 1,
where L represents an ordinary differential operator, i.e.,
L[y(x)] = L(x, y(x), y (x), . . . , y (m) (x)),
and F(x) and { pi }i=0 m−1

are known. The aim is to find a learning solution to this
equation by using the LS-SVM formulation.
We assume that we have training data {xi , yi }i=0N
, in which xi ∈ R is the input data
and yi ∈ R is output data. To solve the ODE, one-dimensional data is sufficient. Then,
the solution to our regression problem is approximated by y(x) = wT ϕ(x), where
w = [w0 , . . . , w N ]T , and ϕ(x) = [ϕ0 (x), ϕ1 (x), ..., ϕ N (x)]T is a vector of arbitrary
basis functions. It is worth mentioning that the basis functions are selected in a way
that they satisfy the initial conditions. The primal LS-SVM model is as follows (see,
e.g., Suykens et al. (2002)):
1 T γ
minimizew,ε w w + ε T ε,
2 2 (7.3)
subject to yi = wT ϕ(xi ) + εi , i = 0, 1, . . . , N ,
where εi is the error of the i-th training point, w ∈ Rh , ϕ(.) : R → Rh is a feature

map, and h is the dimension of feature space. In this formulation, the term 21 wT w
can be interpreted to obtain the stable solutions.
The problem with this formulation is that when solving the differential equation,
we don’t have the value of yi , and therefore, we put the differential equation in our
model. The collocation form of equations is used as the constraint of our optimization
problem.
To solve the optimization problem, the kernel trick is used and the problem is
transformed into a dual form.
7.2.1 Collocation Form of LS-SVM
In the case that LS-SVM is used to solve a linear ODE, the approximate solution is
obtained by solving the following optimization problem Mehrkanoon et al. (2012),
Pakniyat et al. (2021):
1 T γ
2 2 (7.4)
subject to L[ ỹi ] − F(xi ) = εi , i = 0, 1, . . . , N ,
where ε = [ε0 , . . . , ε N ]T and ỹ is the approximate solution, which is written in terms

of the basis functions, to satisfy the initial conditions. The collocation points {xi }i=0N
are also chosen to be the roots of rational Legendre functions in the learning phase.
Now we introduce a dual form of the problem:
Theorem 7.1 (Mehrkanoon et al. (2012)) Suppose that ϕi be the basis functions and
α = [α0 , α1 , . . . , α N ]T be the Lagrangian coefficients in the dual form of the mini-
mization problem Eq. 7.4, then the solution of minimization problem is as follows:
⎛ ⎞ ⎛1 ⎞
α0 γ α0
⎛ ⎞ ⎜ 1α ⎟
L(ϕ0 (x0 )) L(ϕ0 (x1 )) ... L(ϕ0 (x N )) ⎜ ⎟
⎜ α1 ⎟ ⎜ γ 1⎟
⎜ L(ϕ1 (x0 )) ⎜ ⎟
⎜ ... ... L(ϕ1 (x N )) ⎟ ⎜ ⎟
⎟⎜ . ⎟, α ⎜ . ⎟
w = Mα = ⎝ ⎠⎜ . ⎟ ε= =⎜ ⎟,
... ... ... ... ⎜ ⎟ γ ⎜ . ⎟
⎜ ⎟
L(ϕ N (x0 )) L(ϕ N (x1 )) ... L(ϕ N (x N )) ⎝ . ⎠ ⎝ . ⎠
αN 1
γ αN
where
1
(M T M + I )α = F, (7.5)
γ
and F = [F(x0 ), F(x1 ), . . . , F(x N )]T .

Proof First, we construct the Lagrangian function of our optimization problem as

follows:
⎛ ⎞
1 T γ T N N
G = w w+ ε ε− αi ⎝ w j L(ϕ j (xi )⎠ − Fi − εi ), (7.6)
2 2 i=0 j=0
where αi ≥ 0 (i = 0, 1, . . . , N ) are Lagrange multipliers. To obtain the optimal solu-

tion, by using the Karush-Kuhn-Tucker (KKT) optimality conditions Mehrkanoon
et al. (2012), Mehrkanoon and Suykens (2013), Cortes and Vapnik (1995), one gets
∂G N
= wk − αi (L(ϕk (xi ))) = 0,
∂wk i=0
∂G
= γ εk + αk = 0, (7.7)
∂εk
∂G N
= w j L(ϕ j (xk )) − Fk − εk = 0.
∂αk j=0
So, the solution is obtained by the following relations:

N
wk = Mki αi , (Mki = (L(ϕk (xi ))), k = 0, 1, 2, . . . , N ,
i=0
(7.8)
w = Mα,
and by using the second relation in Eq. 7.7, one obtains
−αk −1
εk = , ⇒ε= α. (7.9)
γ γ
Considering the third equation results in

N
w j L(ϕ j (xk )) − Fk − εk = 0,
j=0 (7.10)
(M T w)k − Fk − εk = 0, ⇒ M T w − ε = F.
So, by using Eqs. 7.8–7.10, we have
1 1
M T Mα + α = F, ⇒ MT M + I α = F, (7.11)
γ γ
where Fi = F(xi ) and so, the vector of Lagrange multipliers α is obtained.

7.3 Rational Legendre Kernels
Kernel function has an efficient rule in using LS-SVM; therefore, the choice of
kernel function is important. These kernel functions can be constructed by using
orthogonal polynomials. The operational matrices of orthogonal polynomials are
sparse and the derivatives are obtained exactly which makes our method fast and
it leads to well-posed systems. Given that the properties of Legendre polynomials,
fractional Legendre functions, as well as the Legendre kernel functions are discussed
in Chap. 4, we recommend that readers refer to that chapter of the book for review,
and as a result, in this section, we will focus more on the rational Legendre kernels. In
particular, Guo et al. (2000) introduced a new set of rational Legendre functions which
are mutually orthogonal in L 2 (0, +∞) with the weight function w(x) = (L+x) 2L
2 , as
follows:
x−L
R Pn (x) = Pn . (7.12)
x+L
Thus,
x−L
R P0 (x) = 1, R P1 (x) = ,
x+L
(7.13)
x−L
n R Pn (x) = (2n − 1) ( )R Pn−1 (x) − (n − 1) R Pn−2 (x), n ≥ 2,
x+L
where {Pn (x)} are Legendre polynomials which were defined in Chap 5. These
functions are used to solve problems on semi-infinite domains, and they can also
produce sparse matrices and have a high convergence rate Guo et al. (2000), Parand
and Razzaghi (2004).
Since the range of Legendre polynomials is [−1, 1], we have |R Pn (x)| 1. The
operational matrix of the derivative is also a lower Hessenberg matrix which can be
calculated as Parand and Razzaghi (2004), Parand et al. (2009):
1
D= (D1 + D2 ), (7.14)
L
where D1 is a tridiagonal matrix
7i 2 − i − 2 i(i + 1)
D1 = Diag( , −i, ), i = 0, ..., n − 1,
2(2i + 1) 2(2i + 1)
and D2 = [di j ] such that
0, j ≥ i − 1,
di j =
(−1)i+ j+1 (2 j + 1), j < i − 1.
The rational Legendre kernel function for non-vector data x and z is recommended
as follows:
N
K (x, z) = R Pi (x)R Pi (z). (7.15)
i=0
This function is a valid SVM kernel if that satisfies the conditions of the Mercer
theorem (see, e.g., Suykens et al. (2002)).
Theorem 7.2 To be a valid SVM kernel, for any finite function g(x), the following
integration should always be nonnegative for the given kernel function K (x, z):
K (x, z)g(x)g(z)d xdz ≥ 0. (7.16)
Proof Consider g(x) to be a function and K (x, z) that was defined in Eq. 7.15, then
by using the Mercer condition, one obtains

N
K (x, z)g(x)g(z)d xdz = R Pi (x)R Pi (z)g(x)g(z)d xdz
i=0

N
= ( R Pi (x)R Pi (z)g(x)g(z)d xdz)
i=0

N
= ( R Pi (x)g(x)d x R Pi (z)g(z)dz) (7.17)
i=0

N
= ( R Pi (x)g(x)d x R Pi (x)g(x)d x)
i=0

N
= ( R Pi (x)g(x)d x)2 ≥ 0.
i=0
Therefore, the proposed function is a valid kernel.

7.4 Collocation Form of LS-SVM for Lane-Emden Type

Equations
In this section, we implement the proposed method to solve the Lane-Emden type
equations. As we said before, these types of equations are an important class of
ordinary differential equations in the semi-infinite domain. The general form of the
Lane-Emden equation is as follows:
k
y (x) + y (x) + f (x, y(x)) = h(x),
x (7.18)
y(x0 ) = A, y (x0 ) = B,
where A, B are constants and f (x, y(x)) and h(x) are given functions of x and y.
Since the function f (x, y(x) can be both linear or nonlinear, in this chapter, we
assume that this function can be reformed to f (x, y(x)) = f (x)y(x). In this case,
there is no need for any linearization method and the LS-SVM method can be directly
applied to Eq. 7.18. For the nonlinear cases, the quasi-linearization method can be
applied to Eq. 7.18 first and then the solution is approximated by using the LS-SVM
algorithm. Now we can consider the LS-SVM formulation of this equation, which is
1 T γ
2 2
(7.19)
d 2 ỹi k d ỹi
subject to + ( ) + f (xi ) ỹ(xi ) − h(xi ) = εi , k > 0,
dx2 xi d x
where ỹi = ỹ(xi ) is the approximate solution. We use the Lagrangian function and
then
1 γ N
d 2 ỹi k d ỹi
G = wT w + ε T ε − αi 2
+ ( ) + f (xi ) ỹ(xi ) − h(xi ) − εi .
2 2 i=0
d x x i dx
(7.20)
So, by expanding the solution in terms of the rational Legendre kernels, one obtains
1 2 γ 2
N N N N
G= wi + εi − αi w j ϕ j (xi )
2 i=0 2 i=0 i=0 j=0
(7.21)
k
N N
+ w j ϕ j (xi ) + f (xi ) w j ϕ j (xi ) − h(xi ) − εi ,
xi j=0 j=0
where {ϕ j } are the rational Legendre kernels. Then by employing the Karush-Kuhn-
Tucker (KKT) optimality conditions for 0 ≤ l ≤ N , we conclude that
∂G N
k
= wl − αi ϕl (xi ) + ϕl (xi ) + f (xi )ϕl (xi ) = 0,
∂wl i=0
xi
∂G N
k
N N

= w j ϕ j (xl ) + w j ϕ j (xl ) + f (xl ) w j ϕ j (xl ) − h(xl ) − εl = 0,
∂αl j=0
xl j=0 j=0
∂G
= γ εl + αl = 0,
∂εl
(7.22)
and, by using the above relations,

N
k
wl = αi (ϕl (xi ) + ϕ (xi ) + f (xi )ϕl (xi ))
i=0
xi l
(7.23)

N
= αi ψl (xi ),
i=0
where ψl (xi ) = ϕl (xi ) + xki ϕl (xi ) f (xi )ϕl (xi ). Now, we can show the matrix form
as follows:
⎡ ⎤ ⎡ ⎤⎡ ⎤
w0 ψ0 (x0 ) ψ0 (x1 ) ψ0 (x2 ) . . . ψ0 (x N ) α0
⎢w1 ⎥ ⎢ ψ1 (x0 ) ψ1 (x1 ) ψ1 (x2 ) . . . ψ1 (x N ) ⎥ ⎢ α1 ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢ .. ⎥ = ⎢ .. .. .. .. .. ⎥ ⎢ .. ⎥ ,
⎣ . ⎦ ⎣ . . . . . ⎦⎣ . ⎦
wn ψ N (x0 ) ψ N (x1 ) ψ N (x2 ) . . . ψ N (x N ) αN
which can be summarized as w = Mα where Mli = {ψl (xi )}0≤l,i≤N .

By using the third relation in Eq. 7.22,
−αl
∀l : l = , (7.24)
γ
and substituting w j in the second equation of Eq. 7.22 results in
N N
αl
( αi ψ j (xi ))ψ j (xl ) + = h(xl ), 0 ≤ l ≤ N, (7.25)
j=0 i=0
γ
and in matrix form, one gets

⎡ ⎤⎡ ⎤⎡ ⎤
ψ0 (x0 ) ψ1 (x0 ) . . . ψ N (x0 ) ψ0 (x0 ) ψ0 (x1 ) . . . ψ0 (x N ) α0
⎢ ψ0 (x1 ) ψ1 (x1 ) . . . ψ N (x1 ) ⎥ ⎢ ψ1 (x0 ) ψ1 (x1 ) . . . ψ1 (x N ) ⎥ ⎢ α1 ⎥
⎢ ⎥⎢ ⎥⎢ ⎥
⎢ . .. .. .. ⎥⎢ . .. .. .. ⎥⎢ . ⎥
⎣ .. . . . ⎦ ⎣ .. . . . ⎦ ⎣ .. ⎦
ψ0 (x N ) ψ1 (x N ) . . . ψ N (x N ) ψ N (x0 ) ψ N (x1 ) . . . ψ N (x N ) αN
⎡1 ⎤
γ 0 ... 0 ⎡ α0 ⎤ ⎡ h 0 ⎤
⎢0 0⎥
γ ... ⎥⎢ α ⎥ ⎢h ⎥
1
⎢ ⎢ 1⎥ ⎢ 1⎥
+⎢
⎢ .. .. . . .. ⎥
⎥ .. ⎥ = ⎢ .. ⎥
⎢
⎣. . . . ⎦⎣ . ⎦ ⎣ . ⎦
0 0 ... 1 αN hN
γ
and can be written as (M T M + γ1 I )α = h where I = I N +1×N +1 , h i = h(xi ) and

h = [h 0 , h 1 , . . . , h N ]T .
Now, we need to define the derivatives of the kernel function K (xl , xi ) =
[ϕ(xl )]T ϕ(xi ). Making use of Mercer’s Theorem Vapnik (1998); Lázaro et al. (2005),
derivatives of the feature map can be written in terms of derivatives of the kernel
function. Let us consider the following differential operator:
∂ m+n
∇ m,n = ,
∂u m ∂vn
which will be used in this section.
(m,n)
We define χ (m,n) (u, v) = ∇ m,n K (u, v) and χl,i = ∇ m,n K (u, v)|u=x(l),v=x(i) ,
and Eq. 7.25 is written as follows:

N
(2,2) k (1,2) (0,2) k (2,1) k 2 (1,1) k (0,1)
hl = αi [χl,i + χl,i + fl χl,i + χl,i + χ + fl χl,i
i=0
xl xi xi xl l,i xi
(2,0) k (1,0) (0,0) αl
+ f i χl,i + fi χ + f i fl χl,i ]+ , 0 ≤ l ≤ N.
xl l,i γ
(7.26)
So we calculate {αi }0≤i≤N and by substituting in Eq. 7.23, {wl }0≤l≤N is computed,
and
n
k
ỹ = αi (χ (2,0) (xi , x) + χ (1,0) (xi , x) + f i χ (0,0) (xi , x)) (7.27)
i=0
xi
will be the dual form of an approximate solution in the Lane-Emden equation.
7.5 Numerical Examples
In this section, the collocation form of least squares-support vector regressions is

applied for solving various forms of Lane-Emden type equations based on the rational
Legendre kernels.
The value of the regularization parameter γ is effective in the performance of the
LS-SVM model. Based on experience the larger regularization parameter results in
a smaller error. Therefore, the chosen value for γ is 108 in the presented examples
which is obtained by trial and error.
For solving this problem by using the proposed method, we train our algorithm by
using the shifted roots of Legendre polynomials, and the testing points are consid-
ered to be the set of equidistant points on the domain. Then the training and testing
errors are computed which have been defined by e = y − ỹ. In order to show the
efficiency and capability of the proposed method, the numerical approximations are
compared with the other obtained results. Also, the convergence of our method is
also illustrated by numerical results.
Test case 1: Ramos (2005), Yıldırım and Öziş (2009), Chowdhury and Hashim
(2007) Let us consider f (x, y) = −2(2x 2 + 3)y, k = 2, h(x) = 0, A = 1, and B =
0 in Eq. 7.18, then the linear Lane-Emden equation is as follows:
2
y (x) +y (x) − 2(2x 2 + 3)y = 0, x ≥ 0,
x (7.28)
y(0) = 1, y (0) = 0.
2
The exact solution is y(x) = e x . This type of equation has been solved with
linearization, VIM, and HPM methods (see, e.g., Ramos (2005), Yıldırım and Öziş
(2009), Chowdhury and Hashim (2007)).
By using the proposed method, the numerical results of this test case in [0, 1]
have been obtained which are depicted in Fig. 7.1 with 30 training points. The results
and related training error function with 180 training points for solving the problem
in [0, 2] are also shown in Fig. 7.2.
The testing error by using 50 equidistant points for solving the problem in intervals
[0, 1] and [0, 2] are displayed in Fig. 7.3a, b, where the maximum testing error in this
example is 2.23 × 10−13 and 4.46−14 , respectively. Moreover, the absolute errors for
arbitrary testing data have been computed and presented in Table 7.1 for N = 180
and the optimal value of L = 4.59.
The maximum norm of testing errors with different numbers of basis functions
(training points) has also been presented in Table 7.2, which indicates the convergence
of the method for solving this kind of linear Lane-Emden equation.
Test case 2: Ramos (2005), Chowdhury and Hashim (2009), Bataineh et al. (2009),
Zhang et al. (2006) Considering f (x, y) = x y, h(x) = x 5 − x 4 + 44x 2 − 30x, k =
8, A = 0, and B = 0 in Eq. 7.18, we have the linear Lane-Emden equation,
8
y (x) +y (x) + x y(x) = x 5 − x 4 + 44x 2 − 30x, x ≥ 0,
x (7.29)
y(0) = 0, y (0) = 0,
which has the exact solution y(x) = x 4 − x 3 and has been solved with lineariza-
tion, HPM, HAM, and two-step ADM (TSADM) methods (see, e.g., (Ramos, 2005;
(a) (b)
Fig. 7.1 a Numerical results for training points in [0, 1]. b Obtained training errors (Test case 1)
(a) (b)
(a) (b)
Fig. 7.3 Error function for testing points a in [0, 1] with M = 50 and N = 30 and b in [0, 2] with
M = 50 and N = 180 (Test case 1)
Chowdhury and Hashim, 2009; Bataineh et al., 2009; Zhang et al., 2006)). By apply-
ing Eqs. 7.26–7.27, the approximate solutions are calculated. The numerical solu-
tions together with the training error function by using 30 points in [0, 10] have been
presented in Fig. 7.4.
The proposed algorithm is also tested with some arbitrary points in Table 7.3 for
N = 60 and L = 18. Moreover, in Fig. 7.5, the error function in equidistant testing
data has been plotted. The maximum norm of error by using 50 testing points is
7.77 × 10−12 .
In Table 7.4, the norm of testing error for different values of N has been obtained.
We can see that, with increasing the number of training points, the testing error
decreases, which shows the good performance and convergence of our algorithm to
solve this example.
Test case 3: Ramos (2005), Yıldırım and Öziş (2009), Chowdhury and Hashim
(2007), Zhang et al. (2006) In this test case, we consider f (x, y) = y, h(x) = 6 +
Table 7.1 The absolute errors of the present method for testing points in [0, 2] with N = 180 and
L = 4.59 (Test case 1)
Testing data Error Exact value
0.00 0.00000 1.000000000
0.01 2.2730 × 10−17 1.000100005
0.02 3.6336 × 10−18 1.000400080
0.05 1.0961 × 10−16 1.002503127
0.10 1.4026 × 10−16 1.010050167
0.20 1.1115 × 10−15 1.040810774
0.50 7.4356 × 10−16 1.284025416
0.70 2.8378 × 10−15 1.632316219
0.80 1.5331 × 10−15 1.896480879
0.90 8.3772 × 10−15 2.247907986
1.00 1.8601 × 10−14 2.718281828
1.1 4.0474 × 10−15 3.353484653
1.2 2.6672 × 10−14 4.220695816
1.5 3.9665 × 10−14 9.487735836
1.7 9.3981 × 10−15 17.99330960
1.8 4.1961 × 10−14 25.53372174
1.9 3.0212 × 10−14 36.96605281
2.0 3.6044 × 10−16 54.59815003
Table 7.2 Maximum norm of errors for testing data in [0, 2] with M = 50 and different values of
N (Test case 1)
N Error norm N Error norm
12 1.41 × 10−1 80 2.79 × 10−11
20 2.51 × 10−4 100 3.03 × 10−12
30 1.31 × 10−6 120 6.41 × 10−13
40 4.20 × 10−8 140 2.15 × 10−13
50 5.69 × 10−9 150 1.30 × 10−13
60 4.77 × 10−10 180 4.46 × 10−14
12x + x 2 + x 3 , k = 2, A = 0, and B = 0 in the general form of the Lane-Emden

equation, then we have
2
y (x) +y (x) + y(x) = 6 + 12x + x 2 + x 3 , x ≥ 0,
x (7.30)
y(0) = 0, y (0) = 0,
(a) (b)
Table 7.3 The absolute errors of the present method for testing points with N = 60 and L = 18
(Test case 2)
0.00 0 0.0000000000
0.01 4.8362 × 10−16 −0.0000009900
0.10 1.4727 × 10−15 −0.0009000000
0.50 1.7157 × 10−15 −0.0625000000
1.00 7.8840 × 10−15 0.0000000000
2.00 2.7002 × 10−15 8.000000000
3.00 2.4732 × 10−14 54.00000000
4.00 2.2525 × 10−14 192.0000000
5.00 4.1896 × 10−14 500.0000000
6.00 6.3632 × 10−15 1080.000000
7.00 5.4291 × 10−14 2058.000000
8.00 7.0818 × 10−14 3584.00000
9.00 1.0890 × 10−14 5832.00000
10.00 6.5032 × 10−16 9000.000000
which has the exact solution y(x) = x 2 + x 3 . This example has also been solved in
Ramos (2005), Yıldırım and Öziş (2009), Chowdhury and Hashim (2007), and Zhang
et al. (2006) with linearization, VIM, HPM, and TSADM methods, respectively.
The proposed method is used and the numerical solutions of this example are
obtained in 30 training points which can be seen in Fig. 7.6. It should be noted that
the optimal value of L = 14 has been used in this example. The testing error function
is also shown in Fig. 7.7 with 50 equidistant points. In Table 7.5, the numerical results
in arbitrary testing data with N = 60 have been reported which show the efficiency
of the LS-SVM model for solving this kind of problem. The maximum norm of
Fig. 7.5 Error function for 50 equidistant testing points with N = 30 and L = 18 (Test case 2)
Table 7.4 Maximum norm of testing errors obtained for M = 50 and L = 18 with different values
of N (Test case 2)
N Error norm
8 3.78 × 10−1
12 2.14 × 10−4
20 1.31 × 10−9
30 7.77 × 10−12
40 1.32 × 10−12
50 2.30 × 10−13
60 7.08 × 10−14
(a) (b)
Fig. 7.6 a Numerical results with training points in [0, 10]. b obtained training errors (Test
case 3)
Fig. 7.7 Error function in

equidistant testing points
with N = 30 and L = 14
(Test case 3)
Table 7.5 The absolute errors of the present method in testing points with N = 60 and L = 14
(Test case 3)
0.00 2.9657 × 10−28 0.0000000000
0.01 3.6956 × 10−17 0.0001010000
0.10 2.1298 × 10−16 0.0110000000
0.50 6.8302 × 10−17 0.3750000000
1.00 8.2499 × 10−17 2.0000000000
2.00 8.3684 × 10−17 12.000000000
3.00 6.7208 × 10−17 36.000000000
4.00 2.2338 × 10−17 80.000000000
5.00 1.5048 × 10−16 150.00000000
6.00 3.9035 × 10−17 252.00000000
7.00 1.3429 × 10−16 392.00000000
8.00 2.4277 × 10−17 576.00000000
9.00 4.1123 × 10−17 810.00000000
10.00 6.5238 × 10−18 1100.0000000
testing errors with different values of N and M = 50 is recorded in Table 7.6 and
so, the convergence of the LS-SVM model is concluded.
Test case 4: Hadian Rasanan et al. (2020), Omidi et al. (2021) (Standard Lane-
Emden equation) Considering f (x, y) = y m , k = 2, h(x) = 0, A = 1, and B = 0 in
Eq. 7.18, then the standard Lane-Emden equation is
2
y (x) + y (x) + y m (x) = 0, x ≥ 0,
x (7.31)
y(0) = 1, y (0) = 0,
Table 7.6 Maximum norm of testing errors for M = 50 and L = 14 with different values of N
(Test case 3)
N Error norm
8 4.88 × 10−2
12 5.85 × 10−5
20 9.15 × 10−11
30 4.42 × 10−14
40 2.61 × 10−15
50 3.89 × 10−16
60 1.82 × 10−16
(a) (b)
Fig. 7.8 Numerical results with training points in [0, 10] (a) and obtained training errors (b) for
m = 0 (Test case 4)
where m ≥ 0 is a constant. Substituting m = 0, 1, and 5 into Eq. 7.31 leads to the

following exact solutions:
1 2 sin(x) x 2 −1
y(x) = 1 − x , y(x) = and y(x) = (1 + )2, (7.32)
3! x 3
respectively. By applying the LS-SVM formulation to solve the standard Lane-
Emden equation, the approximate solutions are calculated. The numerical solutions
of this example for m = 0 and 30 training points together with the error function
are shown in Fig. 7.8. For testing our algorithm, based on 50 equidistant points, the
obtained error function is presented in Fig. 7.9 with 30 training points. Moreover, the
numerical approximations for arbitrary testing data have been reported in Table 7.7,
which shows the accuracy of our proposed method.
Fig. 7.9 Error function in equidistant testing points with m = 0, N = 30, and L = 30 (Test
case 4)
Table 7.7 The absolute errors of proposed method in testing points with m = 0, N = 60, and
L = 30 (Test case 4)
0 0.00000 1.00000000
0.1 1.2870 × 10−22 0.99833333
0.5 4.3059 × 10−22 0.95833333
1.0 7.0415 × 10−22 0.83333333
5.0 1.1458 × 10−20 −3.16666666
6.0 7.4118 × 10−21 −5.00000000
6.8 1.5090 × 10−20 −6.70666666
The numerical results for m = 1 have been presented in Fig. 7.10 with 30 training
points. Its related error function has also been plotted. The testing error function is
displayed in Fig. 7.11, where the maximum norm of error is equal to 1.56 × 10−12 .
The numerical results for arbitrary values of testing data are also shown in Table 7.8
for m = 1 and N = 60.
By using this model and for m = 0, 1, the maximum norm of errors for differ-
ent numbers of training points have been reported in Table 7.9, which shows the
convergence of the method.
(a) (b)
Fig. 7.10 Numerical results with training points in [0, 10] (a) and obtained training errors (b) for
m = 1 (Test case 4)
Fig. 7.11 Error function in equidistant testing points with m = 1, N = 30, and L = 30 (Test
case 4)
Table 7.8 The absolute errors of the present method in testing points with m = 1, N = 60, and
L = 30 (Test case 4)
0 1.80 × 10−32 1.0000000000
0.1 6.28 × 10−21 0.9983341665
0.5 1.52 × 10−20 0.9588510772
1.0 4.27 × 10−20 0.8414709848
5.0 2.86 × 10−19 −0.1917848549
6.0 3.71 × 10−19 −0.0465692497
6.8 1.52 × 10−19 0.0726637280
Table 7.9 Maximum norm of testing errors for M = 50 and L = 30 with different values of N
(Test case 4)
N Error norm (m = 0) Error norm (m = 1)
8 7.99 × 10−7 9.42 × 10−3
12 4.30 × 10−9 6.10 × 10−4
20 6.78 × 10−18 2.60 × 10−9
30 8.46 × 10−19 1.56 × 10−12
40 8.71 × 10−20 2.91 × 10−15
50 5.81 × 10−20 1.73 × 10−17
60 1.54 × 10−20 3.72 × 10−19
7.6 Conclusion
In this chapter, we developed the least squares-support vector machine model for
solving various forms of Lane-Emden type equations. For solving this problem,
the collocation LS-SVM formulation was applied and the problem was solved in
dual form. The rational Legendre functions were also employed to construct kernel
function, because of their good properties to approximate functions on semi-infinite
domains.
We used the shifted roots of the Legendre polynomial for training our algorithm
and equidistance points as the testing points. The numerical results by applying the
training and testing points show the accuracy of our proposed model. Moreover,
the exponential convergence of our proposed method was achieved by choosing the
different numbers of training points and basis functions. In other words, when the
number of training points is increased, the norm of errors decreased exponentially.
References
Abbasbandy, S., Modarrespoor, D., Parand, K., Rad, J.A.: Analytical solution of the transpira-
tion on the boundary layer flow and heat transfer over a vertical slender cylinder. Quaestiones
Mathematicae 36, 353–380 (2013)
Anderson, D., Yunes, N., Barausse, E.: Effect of cosmological evolution on Solar System constraints
and on the scalarization of neutron stars in massless scalar-tensor theories. Phys. Rev. D 94,
104064 (2016)
Aslanov, A.: Determination of convergence intervals of the series solutions of Emden-Fowler equa-
tions using polytropes and isothermal spheres. Phys. Lett. A 372, 3555–3561 (2008)
Aslanov, A.: A generalization of the Lane-Emden equation. Int. J. Comput. Math. 85, 1709–1725
(2008)
Bataineh, A.S., Noorani, M.S.M., Hashim, I.: Homotopy analysis method for singular IVPs of
Emden-Fowler type. Commun. Nonlinear Sci. Numer. Simul. 14, 1121–1131 (2009)
Bender, C.M., Milton, K.A., Pinsky, S.S., Simmons, L.M., Jr.: A new perturbative approach to
nonlinear problems. J. Math. Phys. 30, 1447–1455 (1989)
Bristeau, M.O., Pironneau, O., Glowinski, R., Periaux, J., Perrier, P.: On the numerical solution of
nonlinear problems in fluid dynamics by least squares and finite element methods (I) least square
formulations and conjugate gradient solution of the continuous problems. Comput. Methods
Appl. Mech. Eng. 17, 619–657 (1979)
Chandrasekhar, S., Chandrasekhar, S.: An introduction to the study of stellar structure, vol. 2. North
Chelmsford, Courier Corporation (1957)
Chowdhury, M.S.H., Hashim, I.: Solutions of a class of singular second-order IVPs by Homotopy-
perturbation method. Phys. Lett. A 365, 439–447 (2007)
Chowdhury, M.S.H., Hashim, I.: Solutions of Emden-Fowler equations by Homotopy-perturbation
method. Nonlinear Anal. Real World Appl. 10, 104–115 (2009)
Dehghan, M., Shakeri, F.: The use of the decomposition procedure of Adomian for solving a delay
differential equation arising in electrodynamics. Phys. Scr. 78, 065004 (2008)
Dehghan, M., Shakeri, F.: Approximate solution of a differential equation arising in astrophysics
using the variational iteration method. New Astron. 13, 53–59 (2008)
Dehghan, M., Tatari, M.: The use of Adomian decomposition method for solving problems in
calculus of variations. Math. Probl. Eng. 2006, 1–12 (2006)
Emden, R.: Gaskugeln: Anwendungen der mechanischen Warmetheorie auf kosmologische und
meteorologische Probleme, BG Teubner (1907)
Farzaneh-Gord, M., Rahbari, H.R.: Unsteady natural gas flow within pipeline network, an analytical
approach. J. Nat. Gas Sci. Eng. 28, 397–409 (2016)
Guo, B.Y., Shen, J., Wang, Z.Q.: A rational approximation and its applications to differential equa-
tions on the half line. J. Sci. Comput. 15, 117–147 (2000)
(2020)
He, J.H.: Variational approach to the Lane-Emden equation. Appl. Math. Comput. 143, 539–541
(2003)
Horedt, G.P.: Polytropes: Applications in Astrophysics and Related Fields. Klawer Academic Pub-
lishers, New York (2004)
Hossayni, S.A., Rad, J.A., Parand, K., Abbasbandy, S.: Application of the exact operational matrices
for solving the Emden-Fowler equations, arising in astrophysics. Int. J. Ind. Math. 7, 351–374
(2015)
Kara, A. H., Mahomed, F. M.: Equivalent lagrangians and the solution of some classes of non-linear
equations. Int. J. Non Linear Mechcs. 27, 919–927 (1992)
Kara, A.H., Mahomed, F.M.: A note on the solutions of the Emden-Fowler equation. Int. J. Non
Linear Mechcs. 28, 379–384 (1993)
Kazem, S., Rad, J.A., Parand, K., Abbasbandy, S.: A new method for solving steady flow of a third-
grade fluid in a porous half space based on radial basis functions. Zeitschrift für Naturforschung
A 66, 591–598 (2011)
Kazem, S., Rad, J.A., Parand, K., Shaban, M., Saberi, H.: The numerical study on the unsteady flow
of gas in a semi-infinite porous medium using an RBF collocation method. Int. J. Comput. Math.
89, 2240–2258 (2012)
Khoury, J., Sakstein, J., Solomon, A.R.: Superfluids and the cosmological constant problem. J.
Cosmol. Astropart. Phys. 2018, 024 (2018)
Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial
differential equations. IEEE Trans. Neural Netw. 9, 987–1000 (1998)
Lane, H.J.: On the theoretical temperature of the sun, under the hypothesis of a gaseous mass
maintaining its volume by its internal heat, and depending on the laws of gases as known to
terrestrial experiment. Am. J. Sci. 2, 57–74 (1870)
Lázaro, M., Santamaría, I., Pérez-Cruz, F., Artés-Rodríguez, A.: Support vector regression for the
simultaneous learning of a multivariate function and its derivatives. Neurocomputing 69, 42–61
(2005)
Liao, S.: A new analytic algorithm of Lane-Emden type equations. Appl. Math. Comput. 142, 1–16
(2003)
Liu, Q.X., Liu, J.K., Chen, Y.M.: A second-order scheme for nonlinear fractional oscillators based
on Newmark-β algorithm. J. Comput. Nonlinear Dyn. 13, 084501 (2018)
Lusch, B., Kutz, J.N., Brunton, S.L.: Deep learning for universal linear embeddings of nonlinear
dynamics. Nat. Commun. 9, 1–10 (2018)
Malek, A., Beidokhti, R.S.: Numerical solution for high order differential equations using a hybrid
neural network-optimization method. Appl. Math. Comput. 183, 260–271 (2006)
Mall, S., Chakraverty, S.: Chebyshev neural network based model for solving Lane-Emden type
equations. Appl. Math. Comput. 247, 100–114 (2014)
Mall, S., Chakraverty, S.: Numerical solution of nonlinear singular initial value problems of Emden-
Fowler type using Chebyshev Neural Network method. Neurocomputing 149, 975–982 (2015)
Mall, S., Chakraverty, S.: Application of Legendre neural network for solving ordinary differential
equations. Appl. Soft Comput. 43, 347–356 (2016)
Mandelzweig, V.B., Tabakin, F.: Quasilinearization approach to nonlinear problems in physics with
application to nonlinear ODEs. Comput. Phys. Commun 141, 268–281 (2001)
Marzban, H.R., Tabrizidooz, H.R., Razzaghi, M.: Hybrid functions for nonlinear initial-value prob-
lems with applications to Lane-Emden type equations. Phys. Lett. A 372, 5883–5886 (2008)
Mehrkanoon, S., Suykens, J.A.: LS-SVM based solution for delay differential equations. J. Phys.:
Conf. Ser. 410, 012041 (2013)
Mehrkanoon, S., Falck, T., Suykens, J.A.: Approximate solutions to ordinary differential equations
using least squares support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 23, 1356–
1367 (2012)
Omidi, M., Arab, B., Hadian Rasanan, A.H., Rad, J.A., Parand, K.: Learning nonlinear dynamics
with behavior ordinary/partial/system of the differential equations: looking through the lens of
orthogonal neural networks. Eng. Comput. 1–20 (2021)
Pakniyat, A., Parand, K., Jani, M.: Least squares support vector regression for differential equations
on unbounded domains. Chaos Solitons Fract. 151, 111232 (2021)
Parand, K., Nikarya, M., Rad, J.A., Baharifard, F.: A new reliable numerical algorithm based on the
first kind of Bessel functions to solve Prandtl-Blasius laminar viscous flow over a semi-infinite
flat plate. Zeitschrift für Naturforschung A 67 665-673 (2012)
Parand, K., Khaleqi, S.: The rational Chebyshev of second kind collocation method for solving a
class of astrophysics problems. Eur. Phys. J. Plus 131, 1–24 (2016)
Parand, K., Pirkhedri, A.: Sinc-collocation method for solving astrophysics equations. New Astron.
15, 533–537 (2010)
Parand, K., Rad, J.A.: Exp-function method for some nonlinear PDE’s and a nonlinear ODE’s. J.
King Saud Univ.-Sci. 24, 1–10 (2012)
Parand, K., Razzaghi, M.: Rational Legendre approximation for solving some physical problems
on semi-infinite intervals. Phys. Scr. 69, 353–357 (2004)
Parand, K., Shahini, M., Dehghan, M.: Rational Legendre pseudospectral approach for solving
nonlinear differential equations of Lane-Emden type. J. Comput. Phys. 228, 8830–8840 (2009)
Parand, K., Dehghan, M., Rezaei, A.R., Ghaderi, S.M.: An approximation algorithm for the solution
of the nonlinear Lane-Emden type equations arising in astrophysics using Hermite functions
collocation method. Comput. Phys. Commun. 181, 1096–1108 (2010)
Parand, K., Abbasbandy, S., Kazem, S., Rad, J.A.: A novel application of radial basis functions for
solving a model of first-order integro-ordinary differential equation. Commun. Nonlinear Sci.
Numer. Simul. 16, 4250–4258 (2011)
Parand, K., Nikarya, M., Rad, J.A.: Solving non-linear Lane-Emden type equations using Bessel
orthogonal functions collocation method. Celest. Mech. Dyn. Astron. 116, 97–107 (2013)
Parand, K., Hossayni, S.A., Rad, J.A.: An operation matrix method based on Bernstein polynomials
for Riccati differential equation and Volterra population model. Appl. Math. Model. 40, 993–1011
(2016)
Parand, K., Lotfi, Y., Rad, J.A.: An accurate numerical analysis of the laminar two-dimensional
flow of an incompressible Eyring-Powell fluid over a linear stretching sheet. Eur. Phys. J. Plus
132, 1–21 (2017)
Ramos, J.I.: Linearization techniques for singular initial-value problems of ordinary differential
equations. Appl. Math. Comput. 161, 525–542 (2005)
Ramos, J.I.: Series approach to the Lane-Emden equation and comparison with the homotopy
perturbation method. Chaos Solitons Fract. 38, 400–408 (2008)
Rodrigues, C., Simoes, F.M., da Costa, A.P., Froio, D., Rizzi, E.: Finite element dynamic analysis
of beams on nonlinear elastic foundations under a moving oscillator. Eur. J. Mech. A Solids 68,
9–24 (2018)
Shawagfeh, N.T.: Nonperturbative approximate solution for Lane-Emden equation. J. Math. Phys.
34, 4364–4369 (1993)
Singh, O.P., Pandey, R.K., Singh, V.K.: An analytic algorithm of Lane-Emden type equations arising
in astrophysics using modified Homotopy analysis method. Comput. Phys. Commun. 180, 1116–
1124 (2009)
Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.: Least Squares
Support Vector Machines. World Scientific, NJ (2002)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Wazwaz, A.M.: A new algorithm for solving differential equations of Lane-Emden type. Appl.
Math. Comput. 118, 287–310 (2001)
Wazwaz, A.M.: The modified decomposition method for analytic treatment of differential equations.
Appl. Math. Comput. 173, 165–176 (2006)
Wood, D.O.: Monographs on physics. In: The Emission of Electricity from Hot Bodies. Longmans,
Green and Company (1921)
Yıldırım, A., Öziş, T.: Solutions of singular IVPs of Lane-Emden type by Homotopy perturbation
method. Phys. Lett. A 369, 70–76 (2007)
Yıldırım, A., Öziş, T.: Solutions of singular IVPs of Lane-Emden type by the variational iteration
method. Nonlinear Anal. Theory Methods Appl. 70, 2480–2484 (2009)
Yousefi, S.A.: Legendre wavelets method for solving differential equations of Lane-Emden type.
Appl. Math. Comput. 181, 1417–1422 (2006)
Yüzbaşı, Ş, Sezer, M.: An improved Bessel collocation method with a residual error function to
solve a class of Lane-Emden differential equations. Math. Comput. Model. 57, 1298–1311 (2013)
Zhang, B.Q., Wu, Q.B., Luo, X.G.: Experimentation with two-step Adomian decomposition method
to solve evolution models. Appl. Math. Comput. 175, 1495–1502 (2006)
Chapter 8
Solving Partial Differential Equations
by LS-SVM
Mohammad Mahdi Moayeri and Mohammad Hemami
Abstract In recent years, much attention has been paid to machine learning-based
numerical approaches due to their applications in solving difficult high-dimensional
problems. In this chapter, a numerical method based on support vector machines is
proposed to solve second-order time-dependent partial differential equations. This
method is called the least squares support vector machines (LS-SVM) collocation
approach. In this approach, first, the time dimension is discretized by the Crank–
Nicolson algorithm, then, the optimal representation of the solution is obtained in
the primal setting. Using KKT optimality conditions, the dual formulation is derived,
and at the end, the problem is converted to a linear system of algebraic equations that
can be solved by standard solvers. The Fokker–Planck and generalized Fitzhugh–
Nagumo equations are considered as test cases to demonstrate the proposed effec-
tiveness of the scheme. Moreover, two kinds of orthogonal kernel functions are
introduced for each example, and their performances are compared.
Keywords Partial differential equation · Fokker–Planck equation ·

Fitzhugh–Nagumo equation · Collocation LS-SVM
8.1 Introduction
Nowadays, most mathematical models of problems in science have more than one
independent variable which usually represents time and space variables (Moayeri
et al. 2020b; Mohammadi and Dehghan 2020, 2019; Hemami et al. 2021). These
models lead to partial differential equations (PDEs). However, these equations usu-
ally do not have exact/analytical solutions due to their complexity. Thus, numerical
methods can help us approximate the solution of equations and simulate models. Let
us consider the following general form of a second-order PDE:
M. M. Moayeri (B) · M. Hemami

Department of Cognitive Modeling, Institute for Cognitive and Brain Sciences,
172 M. M. Moayeri and M. Hemami
∂ 2u ∂ 2u ∂ 2u ∂u ∂u
A + B + C +D +E + Fu + G = 0, (8.1)
∂x2 ∂ x∂ y ∂ y2 ∂x ∂y
where A, B, C, D, E, F, and G can be a constant, or a function of x, y, and u. It

is clear when G = 0, we have a homogeneous equation, and otherwise, we have a
nonhomogeneous equation. According to these coefficients, there are three kinds of
PDEs (Smith 1985). This equation is called elliptic when B 2 − 4 AC < 0, parabolic
when B 2 − 4 AC = 0, and hyperbolic when B 2 − 4 AC > 0. In science, most of the
problems which include time dimension are parabolic or hyperbolic such as heat and
wave equations (Smith 1985). In addition, the Laplace equation is also an example
of the elliptic type equation (Lindqvist 2019; Kogut et al. 2019). On the other hand,
if A, B, C, D, E, or F is a function of dependent variable u, the equation converts
to a nonlinear PDE which is usually more complex than the linear one.
Since analytical and semi-analytical methods can be used only for simple differen-
tial equations, scientists have developed various numerical approaches to solve more
complex and nonlinear ordinary/partial differential equations. These approaches can
be divided into six general categories as
• Finite difference method (FDM) (Smith 1985; Strikwerda 2004): The intervals
are discretized to a finite number of steps, and the derivatives are approximated
by finite difference formulas. Usually, the approximate solution of a problem is
obtained by solving a system of algebraic equations obtained by employing FDM
on PDE. Some examples of this method are implicit method (Hemami et al. 2021;
Moayeri et al. 2021), explicit method (Smith 1985; Strikwerda 2004), Crank–
Nicolson method (Liu and Hao 2022; Abazari and Yildirim 2019), etc. (Trefethen
1996; Meerschaert and Tadjeran 2006; Dehghan and Taleei 2010).
• Finite element method (FEM) (Bath and Wilson 1976; Ottosen et al. 1992): This
approach is based on mesh generating. In fact, FEM subdivides the space into
smaller parts called finite elements. Then, the variation calculus and low-order
local basis functions are used in these elements. Finally, all sets of element equa-
tions are combined into a global system of equations for the final calculation.
Different kinds of FEMs are standard FEM (Hughes 2012; Zienkiewicz et al.
2005), nonconforming FEM (Carstensen and Köhler 2017; Wilson et al. 2021),
mixed FEM (Bruggi and Venini 2008; Bossavit and Vérité 1982), discontinuous
FEM (Kanschat 2004; Chien and Wu 2001), etc. (Pozrikidis 2014; Wang et al.
2019; Yeganeh et al. 2017; Zhao et al. 2017).
• Finite volume method (FVM) (Chien and Wu 2001; Eymard et al. 2000): FVM is
similar to FEM and needs mesh construction. In this approach, these mesh elements
are called control volumes. In fact, in this method, volume integrals are used with
the divergence theorem. An additional feature is the local conservative of the
numerical fluxes; that is, the numerical flux is conserved from one discretization
cell to its neighbor. Due to this property, FVM is appropriate for the models
where the flux is of importance (Zhao et al. 1996; Kassab et al. 2003). Actually,
a finite volume method evaluates exact expressions for the average value of the
solution over some volume, and uses this data to construct approximations of the
solution within cells (Fallah et al. 2000). Some famous FVMs are cell-centered
8 Solving Partial Differential Equations by LS-SVM 173
FVM (Ghidaglia et al. 2001; Bertolazzi and Manzini 2001), vertex-centered FVM
(Asouti et al. 2011; Zhang and Zou 2013), Petrov–Galerkin FVM (Dubois 2000;
Moosavi and Khelil 2008), etc. (Zhao et al. 1996; Liu et al. 2014; Fallah 2004).
• Spectral method (Shizgal 2015): Unlike previous approaches, the spectral method
is a high-order global method. In the spectral method, the solution of the differ-
n
ential equation is considered as a finite sum of basis functions as i=0 ai φi (x).
Usually, basis functions φi (x) are one of the orthogonal polynomials because of
their orthogonality properties which make the calculations easier (Asghari et al.
2022). Now, different strategies are developed to calculate the coefficients in the
sum in order to satisfy the differential equation as well as possible. Generally, in
simple geometries for smooth problems, the spectral methods offer exponential
rates of convergence/spectral accuracy (Spalart et al. 1991; Mai-Duy 2006). This
method is divided into three main categories: collocation (Zayernouri and Kar-
niadakis 2014, 2015), Galerkin (Shen 1994; Chen et al. 2008), Petrov–Galerkin
methods (Gamba and Rjasanow 2018; Zayernouri et al. 2015), etc. (Moayeri et al.
2020b; Kopriva 2009; Delkhosh and Parand 2021; Latifi and Delkhosh 2020).
• Meshfree method (Fasshauer 2007; Liu 2003): The meshfree method is used to
establish a system of algebraic equations for the whole problem domain without
the use of a predefined mesh, or uses easily generable meshes in a much more
flexible or freer manner. It can be said that meshfree methods essentially use a
set of nodes scattered within the problem domain as well as on the boundaries
to represent the problem domain and its boundaries. The field functions are then
approximated locally using these nodes (Liu 2003). Meshfree methods have been
considered by many researchers in the last decade due to their high flexibility and
high accuracy of the numerical solution (Rad and Parand 2017a; Dehghan and
Shokri 2008). However, there are still many challenges and questions about these
methods that need to be answered (Abbasbandy and Shirzadi 2010, 2011), such
as optimal shape parameter selection in methods based on radial basis functions
and enforcing boundary conditions in meshfree methods. Some of the meshfree
methods are radial basis function approach (RBF) (Rad et al. 2012, 2014; Parand
et al. 2017; Kazem and Rad 2012; Kazem et al. 2012b; Rashedi et al. 2014;
Parand and Rad 2013), radial basis function generated finite difference (RBF-
FD) (Mohammadi et al. 2021; Abbaszadeh and Dehghan 2020a), meshfree local
Petrov–Galerkin (MLPG) (Rad and Parand 2017a, b; Abbaszadeh and Dehghan
2020b; Rad et al. 2015a), element free Galerkin (EFG) (Belytschko et al. 1994;
Dehghan and Narimani 2020), meshfree local radial point interpolation method
(MLRPIM) (Liu et al. 2005; Liu and Gu 2001; Rad and Ballestra 2015; Rad et al.
2015b), etc. (Hemami et al. 2020, 2019; Mohammadi and Dehghan 2020, 2019).
• Machine learning-based methods (Cortes and Vapnik 1995; Jordan and Mitchell
2015): With the growth of available scientific data and improving machine learn-
ing approaches, recently, researchers of scientific computing have been trying to
develop machine learning and deep learning algorithms for solving differential
equations (Aarts and Van Der Veer 2001; Cheung and See 2021). Especially, they
have had some successful attempts to solve some difficult problems that common
numerical methods are not well able to solve such as PDEs by noisy observations
or very high-dimension PDEs. Some of the recent machine learning-based numer-

ical methods are physics-informed neural networks (PINN) (Raissi et al. 2019;
Cai et al. 2022), neural network collocation method (NNCM) (Brink et al. 2021;
Liaqat et al. 2001), deep Galerkin method (DGM) (Sirignano and Spiliopoulos
2018; Saporito and Zhang 2021), deep Ritz method (DRM) (Yu 2018; Lu et al.
2021), differential equation generative adversarial networks (DEGAN) (Lee et al.
2021; Kadeethumm et al. 2021), fractional orthogonal neural networks (Hadian-
Rasanan et al. 2019, 2020), least squares support vector machines (LS-SVM)
(Shivanian et al. 2022; Hajimohammadi et al. 2022), etc. (Pang et al. 2019; Parand
et al. 2021a; Hajimohammadi and Parand 2021).
Each of these methods has its advantages and disadvantages; for instance, the spec-
tral method has a spectral convergence for smooth functions, but it is not a good
choice for problems with a complex domain. On the other hand, the finite element
method can handle domain complexity and singularity easily. Moreover, scientists
have combined these methods with each other to introduce new efficient numerical
approaches; for example, the spectral element approach (Karniadakis and Sherwin
2005), finite volume spectral element method (Shakeri and Dehghan 2011), radial
basis function-finite difference method (RBF-FD) (Mohammadi et al. 2021), and
deep RBF collocation method (Saha and Mukhopadhyay 2020).
In this chapter, we utilize a machine learning-based method using support vec-
tor machines called the least squares support vector machines collocation algorithm
(Mehrkanoon and Suykens 2015, 2012) in conjunction with the Crank–Nicolson
method to solve well-known second-order PDEs. In this algorithm, modified Cheby-
shev and Legendre orthogonal functions are considered as kernels. In the proposed
method, first, the temporal dimension is discretized by the Crank–Nicolson approach.
With the aid of the LS-SVM collocation method, the obtained equation at each time
step is converted to an optimization problem in a primal form whose constraints
are the discretized equation with its boundary conditions. Then, the KKT optimality
condition is utilized to derive the dual form. Afterward, the problem is minimized in
dual form using the Lagrange multipliers method (Parand et al. 2021b). Finally, the
problem is converted to a linear system of algebraic equations that can be solved by
a standard method such as QR or LU decomposition.
8.2 LS-SVM Method for Solving Second-Order Partial

Differential Equations
We consider a time-dependent second-order PDE with Dirichlet boundary condition

in the following form:
∂u ∂u ∂ 2u
= A(x, t, u)u + B(x, t, u) + C(x, t, u) 2 , (8.2)
∂t ∂x ∂x
x ∈ ⊂ R and t ∈ [0, T ].
There are two strategies to solve this PDE by SVM. The first way is to use the LS-
SVM approach for both time and space dimensions simultaneously as proposed by
the authors in (Mehrkanoon and Suykens 2015). Another approach is to use a semi-
discrete method, i.e. in order to solve Eq. 8.2, first, the time discretization method is
recalled and then, LS-SVM is applied to solve the problem. Details of the proposed
algorithm are described below.
8.2.1 Temporal Discretization
The Crank–Nicolson method is chosen for time discretization due to its good con-
vergence and unconditional stability. According to finite-difference methods, first,
derivative formula by applying the Crank–Nicolson approach on Eq. 8.2, we have
u i+1 (x) − u i (x) 1 ∂u i+1 ∂ 2 u i+1

= A(x, ti , u i )u i+1 (x) + B(x, ti , u i ) + C(x, ti , u i ) ∂x2
t 2 ∂t
1 ∂u i ∂ 2 ui
+ A(x, ti , u i )u i (x) + B(x, ti , u i ) + C(x, ti , u i ) 2 , i = 0, . . . , m − 1, (8.3)
2 ∂t ∂x
where u i (x) = u(x, ti ) and ti = it. Moreover, t = T /m is the size of time steps.
This equation can be rewritten as
t ∂u i+1 ∂ 2 u i+1
u i+1 (x) − A(x, ti , u i )u i+1 (x) + B(x, ti , u i ) + C(x, ti , u i ) =
2 ∂t ∂x2
t ∂u i 2 i
∂ u
u i (x) + A(x, ti , u i )u i (x) + B(x, ti , u i ) + C(x, ti , u i ) 2 . (8.4)
2 ∂t ∂x
Now, the LS-SVM algorithm is applied on Eq. 8.4 to find the solution of the problem
Eq. 8.2.
8.2.2 LS-SVM Collocation Method

j j j j
At each time step j, we have training data {xi , u i }i=0
n
, where xi ∈ and u i are
the input and output data, respectively.
n The solution of the problem in time step
j is approximated as û j (x) = i=1 wi ϕ(x) + b = wT ϕ(x) + b, where {ϕ(x)} are
arbitrary basis functions. Consider
t ∂u i ∂ 2ui
r i (x) = u i (x) + A(x, ti , u i )u i (x) + B(x, ti , u i ) + C(x, ti , u i ) 2 .
2 ∂t ∂x
The solution of u j (x) is obtained by solving the following optimization problem:
1 T γ
minimi zew,e w w + eT e (8.5)
2 2

s.t. Ãi wT ϕ(xi ) + b + B̃i wT ϕ (xi ) + C̃i wT ϕ (xi ) + r j (xi ) + ei i = 0, . . . , n,
wT ϕ(x0 ) = p1 ,
wT ϕ(xn ) = p2 ,
where Ãi = 1 − t
2
Ai , B̃ = − t
2 i
B , and C̃i = − t
2 i
C . The collocation points {xi }i=0
n
are the Gauss–Lobatto–Legendre or Chebyshev points (Bhrawy and Baleanu 2013;

Heydari and Avazzadeh 2020). Now, the dual form representation of the problem is
expressed. In addition, the Lagrangian of the problem Eq. 8.5 is as follows:
1 T γ n
G= w w + eT e − αi wT Ãi ϕ(xi ) + B̃i ϕ (xi ) + C̃i ϕ (xi )
2 2 i=2
T
−r (xi ) − ei − β1 w ϕ(x0 ) + b − p1 − β2 wT ϕ(xn ) + b − p2 ,
j
(8.6)
where {αi }i=2

n
, β1 , and β2 are Lagrange multipliers. Then, the KKT optimality con-
ditions, for l = 2, . . . , n, are as follows:
∂G n−1
=0→w= αi Ãi ϕ(xi ) + B̃i ϕ (xi ) + C̃i ϕ (xi ) + β1 ϕ(x0 ) + β2 ϕ(xn ),
∂w i=2
∂G n−1
=0→ αi Ãi + β1 + β2 = 0,
∂b i=2
∂G αl
= 0 → el = ,
∂el γ
∂G
= 0 → wT Ãl ϕ(xl ) + B̃l ϕ (xl ) + C̃l ϕ (xl ) + Ãl b − el = r j (xl ),
∂αl
∂G
= 0 → wT ϕ(x0 ) + b = p1 ,
∂β1
∂G
= 0 → wT ϕ(xn ) + b = p2 . (8.7)
∂β2
By substituting the first and third equations in the fourth one, the primal variables
are eliminated. Now, since the multiplications of the basis functions appear in them-
selves or their derivatives, we need to define the derivatives of the kernel function
K (xi , x j ) = ϕ(xi )T ϕ(xi ). According to Mercer’s theorem, derivatives of the feature
map can be written in terms of derivatives of the kernel function. So, we use differ-
ential operator ∇ m,n which is defined in the previous chapters 3 and 4. So, in time
step k, it can be written that

n−1
(0,0) (1,0) (2,0) (0,1) (1,1) (2,1)
α j Ãi χ j,i A˜ j + χ j,i B˜ j + χ j,i C˜ j + B̃i χ j,i A˜ j + χ j,i B˜ j + χ j,i C˜ j +
j=2

(0,2) (1,2) (2,2) (0,0) (0,1) (0,2)
C̃i χ j,i A˜ j + χ j,i B˜ j + χ j,i C˜ j + β1 χ1,i Ãi + χ1,i B̃i + χ1,i C̃i +
α
(0,0) (0,n) (0,2) i
β2 χn,i Ãi + χ1,i B̃i + χn,i C̃i + + Ãi b = rik , i = 2, . . . , n.
γ

n−1
(2,0) ˜
α j χ (0,0)
j,1 Ã j + χ (1,0)
j,1 B̃ j + χ j,1 C j
(0,0)
+ χ1,1 (0,0)
β1 + χn,1 β2 + b = p1 ,
j=2

n−1
(2,0) ˜
α j χ (0,0)
j,n Ã j + χ (1,0)
j,n B̃ j + χ j,n C j
(0,0)
+ χ1,n (0,0)
β1 + χn,n β2 + b = p2 ,
j=2

n−1
α j Ã j + β1 + β2 = 0. (8.8)
j=2
According to the aforementioned equations, the following linear system is defined:

⎡ (0,0) (0,0)
⎤⎡ ⎤ ⎡ ⎤
χ1,1 M1 χ1,n 1 β1 p1
⎢ P K+ 1I P AT ⎥
⎢ 1 ⎥⎢ α ⎥ ⎢ rk ⎥
⎥⎢ ⎥ = ⎢ ⎥,
γ n
⎢ (0,0) (8.9)
⎣ χn,1 (0,0)
M2 χn,n 1 ⎦ ⎣β2 ⎦ ⎣ p2 ⎦
1 A 1 0 b 0
where
A = [ Ã2:n−1 ], B = [ B̃2:n−1 ], C = [C̃2:n−1 ],
(0,0) (1,0) (2,0)
M1 = χ1,2:n−1 Diag(A) + χ1,2:n−1 Diag(B) + χ1,2:n−1 Diag(C),
(0,0) (0,1) (0,2)

P1 = Diag(A)χ2:n−1,1 + Diag(B)χ2:n−1,1 + Diag(C)χ2:n−1,1 ,

K = Diag(A) χ (0,0) Diag(A) + χ (1,0) Diag(B) + χ (2,0)
j,i Diag(C) +

Diag(B) χ (0,1) Diag(A) + χ (1,1) Diag(B) + χ (2,1) Diag(C) +

Diag(C) χ (0,2) Diag(A) + χ (1,2) Diag(B) + χ (2,2) Diag(C) ,
(0,0) (0,1) (0,2)

Pn = Diag(A)χ2:n−1,n + Diag(B)χ2:n−1,n + Diag(C)χ2:n−1,n ,
(0,0) (1,0) (2,0)

M2 = χn,2:n−1 Diag(A) + χn,2:n−1 Diag(B) + χn,2:n−1 Diag(C),
rk = [rik ]2:n−1
T
.
By solving system Eq. 8.9, {αi }i=2

n−1
, β1 , β2 , and b are calculated. Then, the solution
in time step k, in the dual form, can be obtained as follows:

n−1
û k (x) = αi Ã(xi )χ (0,0) (xi , x) + B̃(xi )χ (1,0) (xi , x) + C̃(xi )χ (2,0) (xi , x)
i=2
+β1 χ (0,0) (x0 , x) + β2 χ (0,0) (xn , x) + b.
Going through all the above steps, we can say that the solution of PDE Eq. 8.2 at
each time step is approximated.
8.3 Numerical Simulations
In this part, the proposed approach is applied to several examples. Fokker–Planck

and generalized Fitzhugh–Nagumo equations are provided as test cases, and each
test case contains three different examples. These examples have exact solutions, so
we can calculate the exact error to evaluate the precision of the proposed method.
Moreover, we solve examples by different kernel functions and compare the obtained
solutions to each other.
Modified Chebyshev/Legendre kernels of orders three and six are used in the
following examples and defined as follows (Ozer et al. 2011):
n
Li (x)LiT (z)
K (x, z) = i=0
, (8.10)
exp(δx − z2 )
where {Li (.)} are Chebyshev or Legendre polynomials, n is the order of polynomials
that in our examples, it is 3 or 6. In addition, δ is the decaying parameter.
In order to illustrate the accuracy of the proposed method, L 2 and root-mean-
square (RMS) errors are computed as follows:

n
i=1 (û(x i ) − u(xi ))2
RMS = ,
n

n

L2 = |û(xi ) − u(xi )|2 .
i=1
8.3.1 Fokker–Planck Equation
Fokker–Planck equations have various applications in astrophysics (Chavanis 2006),

biology (Frank et al. 2003), chemical physics (Grima et al. 2011), polymers (Chau-
viére and Lozinski 2001), circuit theory (Kumar 2013), dielectric relaxation (Tan-
imura 2006), economics (Furioli et al. 2017), electron relaxation in gases (Braglia
et al. 1981), nucleation (Reguera et al. 1998), optical bistability (Gronchi and Lugiato
1973), quantum optics (D’ariano et al. 1994), reactive systems (De Decker and Nico-
lis 2020), solid-state physics (Kumar 2013), finance (Rad et al. 2018), cognitive psy-
chology (Hadian-Rasanan et al. 2021), etc. (Risken 1989; Kazem et al. 2012a). In
fact, this equation is derived from the description of the Brownian motion of particles
(Ullersma 1966; Uhlenbeck and Ornstein 1930). For instance, one of the equations
describing the Brownian motion in potential is the Kramer equation which is a spe-
cial case of the Fokker–Planck equation (Jiménez-Aquino and Romero-Bastida 2006;
Friedrich et al. 2006). The general form of the Fokker–Planck equation is
∂u ∂ ∂2
= − ψ1 (x, t, u) + 2 ψ2 (x, t, u) u, (8.11)
∂t ∂x ∂x
in which ψ1 and ψ2 are the drift and diffusion coefficients, respectively. If these
coefficients depend on x and t, the PDE is called forward Kolmogorov equation
(Conze et al. 2022; Risken 1989). There is another type of Fokker–Planck equation
similar to the forward Kolmogorov equation called backward Kolmogorov equation
(Risken 1989; Flandoli and Zanco 2016) that is in the form (Parand et al. 2018):
∂u ∂ ∂2
= − ψ1 (x, t) + ψ2 (x, t) 2 u. (8.12)
∂t ∂x ∂x
Moreover, if ψ1 and ψ2 are dependent on u in addition to time and space, then
we have the nonlinear Fokker–Planck equation which has important applications in
biophysics (Xing et al. 2005), neuroscience (Hemami et al. 2021; Moayeri et al.
2021), engineering (Kazem et al. 2012a), laser physics (Blackmore et al. 1986),
nonlinear hydrodynamics (Zubarev and Morozov 1983), plasma physics (Peeters
and Strintzi 2008), pattern formation (Bengfort et al. 2016), and so on (Barkai 2001;
Tsurui and Ishikawa 1986).
There are some fruitful studies that explored the Fokker–Planck equation by clas-
sical numerical methods. For example, Vanaja (1992) presented an iterative algorithm
for solving this model. In numerical methods, for example, (Zorzano et al. 1999),
researchers employed the finite difference approach to the two-dimensional Fokker–
Planck equation. In 2006, Dehghan and Tatari (2006) developed He’s variational iter-
ation method (VIM) to approximate the solution of this equation. Moreover, Tatari
et al. (2007) investigated the application of the Adomian decomposition method
for solving different types of Fokker–Planck equations. One year later, Lakestani
and Dehghan (2008) obtained the numerical solution of the Fokker–Planck equation
using cubic B-spline scaling functions. In addition, Kazem et al. (2012a) proposed a
meshfree approach to solve linear and nonlinear Fokker–Planck equations. Recently,
a pseudo-spectral method was applied to approximate the solution of the Fokker–
Planck equation with high accuracy (Parand et al. 2018).
By simplifying Eq. 8.11, the Fokker–Planck equation takes the form of Eq. 8.2. It is
worth mentioning that if our equation is linear, then A(x, t, u) = 0. In the following,
we present a number of various linear and nonlinear examples. These examples are
defined over = [0, 1].
It should be noted that the average CPU times for all examples of the Fokker–
Planck equation and the Generalized FHN equation are about 2.48 and 2.5 s depend-
ing on the number of spatial nodes and the time steps. In addition, the order of time
complexity for the proposed method is O(Mn 3 ), in which M is the number of time
steps and n is the number of spatial nodes.
8.3.1.1 Example 1
Consider Eq. 8.11 with ψ1 (x) = −1 and ψ2 (x) = 1, and initial condition f (x) =
x. Using these fixed values, we can write A(x, t, u) = 0, B(x, t, u) = 1, and
C(x, t, u) = 1. The exact solution of this test problem is u(x, t) = x + t (Kazem
et al. 2012a; Lakestani and Dehghan 2008).
In this example, we consider n = 15, t = 0.001, γ = 1015 . Figure 8.1 shows the
obtained result of function u(x, t) by third-order Chebyshev kernel as an example.
Also, the L 2 and RMS errors of different kernels at various times are represented in
Tables 8.1 and 8.2. Note that in these tables, the number next to the polynomial name
indicates its order. Moreover, the value of the decaying parameter for each kernel is
specified in the tables.
8.3.1.2 Example 2
Consider the backward Kolmogorov equation (8.12) with ψ1 (x) = −(x + 1),
ψ2 (x, t) = x 2 exp(t), and initial condition f (x) = x + 1. This equation has the
exact solution u(x, t) = (x + 1) exp(t) (Kazem et al. 2012a; Lakestani and Dehghan
2008). In this example, we have A(x, t, u) = 0, B(x, t, u) = (x + 1), and
C(x, t, u) = x 2 exp(t). The parameters are set as n = 10, t = 0.0001, and
γ = 1015 .
Fig. 8.1 Approximated solution of Example 1 by the third- order Chebyshev kernel and n = 10,
t = 10−4
Table 8.1 Numerical absolute errors (L 2 ) of the method for Example 1 with different kernels
t Chebyshev-3 Chebyshev-6 Legendre-3 Legendre-6
(δ = 3) (δ = 1.5) (δ = 2) (δ = 1)
0.01 2.4178e-04 0.0010 1.4747e-04 4.0514e-04
0.25 3.8678e-04 0.0018 2.5679e-04 8.1972e-04
0.5 3.8821e-04 0.0018 2.5794e-04 8.2382e-04
0.7 3.8822e-04 0.0018 2.5795e-04 8.2385e-04
1 3.8822e-04 0.0018 2.5795e-04 8.2385e-04
Table 8.2 RMS errors of the method for Example 1 with different kernels
(δ = 3) (δ = 1.5) (δ = 3) (δ = 1)
0.01 6.2427e-05 2.6776e-04 3.8077e-05 1.0461e-04
0.25 9.9866e-05 0.0013 6.6303e-05 2.1165e-04
0.5 1.0024e-04 0.0013 6.6600e-05 2.1271e-04
0.7 1.0024e-04 0.0013 6.6602e-05 2.1272e-04
1 1.0024e-04 0.0013 6.6602e-05 2.1272e-04
Fig. 8.2 Approximated solution of Example 2 by the third- order Legendre kernel and n = 15,
t = 0.0001
(δ = 3) (δ = 1.5) (δ = 3) (δ = 1.5)
0.01 3.6904e-04 0.0018 2.0748e-04 7.1382e-04
0.25 6.0740e-04 0.0043 2.7289e-04 0.0020
0.5 8.6224e-04 0.0064 3.6194e-04 0.0031
0.7 0.0011 0.0083 4.5389e-04 0.0040
1 0.0016 0.0116 6.3598e-04 0.0055
(δ = 3) (δ = 1.5) (δ = 3) (δ = 1)
0.01 9.5285e-05 4.7187e-04 5.3572e-05 1.8431e-04
0.25 1.5683e-04 0.0011 7.0461e-05 5.1859e-04
0.5 2.2263e-04 0.0017 9.3453e-05 7.8761e-04
0.7 2.8540e-04 0.0021 1.1719e-04 0.0010
1 4.0064e-04 0.0030 1.6421e-04 0.0014
The obtained solution by the proposed method (with third-order Legendre kernel)
is demonstrated in Fig. 8.2. Additionally, Tables 8.3 and 8.4 depict the numerical
absolute errors and RMS errors of the presented method with different kernels. It
can be deduced that the Legendre kernel is generally a better option for this example.
8.3.1.3 Example 3
Consider the nonlinear Fokker–Planck equation with ψ1 (x, t, u) = 27 u, ψ2 (x, t, u) =

xu, and initial condition f (x) = x. By reconsidering this equation, it can be con-
cluded that A(x, t, u) = 0, B(x, t, u) = −3u + 2xu x , and C(x, t, u) = 2xu. The
exact solution of this problem is u(x, t) = t+1x
(Kazem et al. 2012a; Lakestani and
Dehghan 2008). In order to overcome the nonlinearity of the problem, we use the
value of u and its derivatives in the previous steps. For this example we set n = 20,
t = 0.0001, and γ = 1012 .
The approximated solution of this example is displayed in Fig. 8.3. Also, the
numerical errors for Example 3 are demonstrated in Tables 8.5 and 8.6.
Fig. 8.3 Approximated solution of Example 3 by the sixth- order Legendre kernel and n = 20,
t = 0.0001
(δ = 4) (δ = 3) (δ = 4) (δ = 4)
0.01 3.0002e-04 0.0010 2.1898e-04 5.0419e-04
0.25 3.0314e-04 0.0011 2.1239e-04 4.5307e-04
0.5 2.1730e-04 7.8259e-04 1.4755e-04 3.3263e-04
0.7 1.7019e-04 6.2286e-04 1.1327e-04 2.6523e-04
1 1.2321e-04 4.5854e-04 8.0082e-05 1.9611e-04
(δ = 4) (δ = 1.5) (δ = 4) (δ = 4)
0.01 6.7087e-05 2.2583e-04 4.8965e-05 1.1274e-04
0.25 6.7784e-05 2.3724e-04 4.7492e-05 1.0131e-04
0.5 4.8591e-05 1.7499e-04 3.2994e-05 7.4379e-05
0.7 3.8055e-05 1.3927e-04 2.5328e-05 5.9307e-05
1 2.7551e-05 1.0253e-04 1.7907e-05 4.3852e-05
8.3.2 Generalized Fitzhugh–Nagumo Equation
The Fitzhugh–Nagumo (FHN) model is based on electrical transmission in a nerve

cell at the axon surface (FitzHugh 1961; Gordon et al. 1999). In other words, this
model is a simplified equation of the famous Hodgkin–Huxley (HH) model (Hodgkin
and Huxley 1952), because it uses a simpler structure to interpret the electrical trans-
mission at the surface of the neuron as opposed to the complex HH design (Moayeri
et al. 2020a; Hemami et al. 2019, 2020; Moayeri et al. 2020b). In electrical analysis,
the HH model is simulated using a combination of a capacitor, resistor, rheostat,
and current source elements (Hemami et al. 2019, 2020; Moayeri et al. 2020a, b),
while the FHN model uses a combination of resistor, inductor, capacitor, and diode
elements (Hemami et al. 2019, 2020; Moayeri et al. 2020a, b). Also, the HH model
involves three channels of sodium, potassium, and leak in the generation of electrical
signals and is simulated by a rheostat and battery, while the FHN model simulates
behavior by only one diode and inductor (Hemami et al. 2019, 2020; Moayeri et al.
2020a, b). For this reason, this model has attracted the attention of many researchers,
and it has been used in various other fields of flame propagation (Van Gorder 2012),
logistic population growth (Appadu and Agbavon 2019), neurophysiology (Aronson
and Weinberger 1975), branching Brownian motion process (Ali et al. 2020), auto-
catalytic chemical reaction (Ïnan 2018), nuclear reactor theory (Bhrawy 2013), etc.
(Abbasbandy 2008; Abdusalam 2004; Aronson and Weinberger 1978; Browne et al.
2008; Kawahara and Tanaka 1983; Li and Guo 2006; Jiwari et al. 2014). The clas-
sical Fitzhugh–Nagumo equation is given by (Triki and Wazwaz 2013; Abbasbandy
2008)
∂u ∂ 2u
= 2 − u(1 − u)(ρ − u), (8.13)
∂t ∂x
where 0 ≤ ρ ≤ 1, and u(x, t) is the unknown function depending on the temporal

variable t and the spatial variable x. It should be noted that this equation combines
diffusion and nonlinearity which is controlled by the term u(1 − u)(ρ − u). The
generalized FHN equation can be represented as follows (Bhrawy 2013):
∂u ∂u ∂ 2u
= −v(t) + μ(t) 2 + η(t)u(1 − u)(ρ − u). (8.14)
∂t ∂x ∂x
Table 8.7 Numerical methods used to solve different types of FHN models
Authors Method Type of FHN Year
Li and Guo (2006) First integral method 1D-FHN 2006
Abbasbandy (2008) Homotopy analysis 1D-FHN 2008
method
Olmos and Shizgal (2009) Pseudospectral method 1D- and 2D-FHN systems 2009
Hariharan and Kannan Haar wavelet method 1D-FHN 2010
(2010)
Van Gorder and Vajravelu Variational formulation Nagumo-Telegraph 2010
(2010)
Dehghan and Taleei Homotopy perturbation 1D-FHN 2010
(2010) method
Bhrawy (2013) Jacobi–Gauss–Lobatto Generalized FHN 2013
collocation
Jiwari et al. (2014) Polynomial quadrature Generalized FHN 2014
method
Moghaderi and Dehghan two-grid finite difference 1D- and 2D-FHN systems 2016
(2016) method
Kumar et al. (2018) q-homotopy analysis 1D fractional FHN 2018
method
Hemami et al. (2019) CS-RBF method 1D- and 2D-FHN systems 2019
Hemami et al. (2020) RBF-FD method 2D-FHN systems 2020
Moayeri et al. (2020a) Generalized Lagrange 1D- and 2D-FHN systems 2020
method
Moayeri et al. (2020b) Legendre spectral element 1D- and 2D-FHN systems 2020
Abdel-Aty et al. (2020) Improved B-spline method 1D fractional FHN 2020
It seems clear that in this model when we assume that v(t) = 0, μ(t) = 1, and
η(t) = −1, we can conclude that model in relation Eq. 8.13.
Different types of FHN equations have been studied numerically by several
researchers, as shown in Table 8.7.
8.3.2.1 Example 1
Consider a non-classical FHN model Eq. 8.14 with v(t) = 0, μ(t) = 1, η(t) =
−1, ρ = 2, initial condition u(x, 0) = 21 + 21 tanh( 2√x 2 ), and domain over (x, t) ∈
[−10, 10] × [0, 1]. In this example, we have A(x, t, u) = −ρ + (1 − ρ)u − u 2 ,
B(x, t, u) = 0 and C(x, t, u) = −1. So the exact solution of this test problem is
x− 2ρ−1
√ t
u(x, t) = 1
2
+ 21 tanh( √2 )
2 2
(Bhrawy 2013; Jiwari et al. 2014; Wazwaz 2007).
t = 2.5e10−4
t δ Chebyshev-3 Chebyshev-6 Legendre-3 Legendre-6
0.01 30 9.8330e-07 9.6655e-07 9.8167e-07 9.7193e-07
0.25 30 2.4919e-05 2.4803e-05 2.5029e-05 2.4956e-05
0.50 30 5.1011e-05 5.0812e-05 5.1289e-05 5.1132e-05
0.07 30 7.2704e-05 7.2357e-05 7.3108e-05 7.2815e-05
1.00 30 1.0648e-04 1.0576e-04 1.0706e-04 1.0644e-04
0.01 50 4.7375e-06 1.4557e-06 3.9728e-06 1.3407e-06
0.25 50 5.2206e-05 2.6680e-05 4.4305e-05 2.6237e-05
0.50 50 6.9759e-05 5.1818e-05 6.3688e-05 5.1808e-05
0.07 50 8.6685e-05 7.3064e-05 8.2108e-05 7.3286e-05
1.00 50 1.1599e-04 1.0624e-04 1.1299e-04 1.0675e-04
In this example, we set n = 40, t = 2.5e10−4 , γ = 1015 . Figure 8.4 shows the
obtained result of function u(x, t) by sixth-order Legendre kernel as an example.
Also, the L 2 and RMS errors of different kernels at various times are represented in
Tables 8.8 and 8.9.
0.01 30 1.5547e-07 1.5283e-07 1.5522e-07 1.5368e-07
0.25 30 7.8801e-07 7.8433e-07 7.9149e-07 7.8917e-07
0.50 30 1.1406e-06 1.1362e-06 1.1469e-06 1.1433e-06
0.07 30 1.3740e-06 1.3674e-06 1.3816e-06 1.3761e-06
1.00 30 1.6836e-06 1.6723e-06 1.6927e-06 1.6829e-06
0.01 50 7.4906e-07 2.3017e-07 6.2816e-07 2.1198e-07
0.25 50 1.6509e-06 8.4368e-07 1.4010e-06 8.2970e-07
0.50 50 1.5598e-06 1.1587e-06 1.4241e-06 1.1585e-06
0.07 50 1.6382e-06 1.3808e-06 1.5517e-06 1.3850e-06
1.00 50 1.8340e-06 1.6798e-06 1.7865e-06 1.6878e-06
8.3.2.2 Example 2
Consider the Fisher type of classical FHN equation with

√
v(t) = 0, μ(t) = 1, η(t) =
−1, ρ = 21 , initial condition u(x, 0) = 43 + 41 tanh( 82 x), and domain over (x, t) ∈
[0, 1] × [0, 1]. So, we have A(x, t, u) = −ρ + (1 − ρ)u − u 2 , B(x, t, u) = 0
and C(x, t, u) = −1. This equation has the exact solution u(x, t) = 1+ρ + ( 21 −
ρ
√ 2
2
2
) tanh( 2(1 − ρ) x4 + 1−ρ 4
t) (Kawahara and Tanaka 1983; Jiwari et al. 2014;
Wazwaz and Gorguis 2004).
t = 2.5e10−4
0.01 8e3 2.1195e-06 1.7618e-06 1.7015e-06 1.5138e-06
0.25 8e3 3.5893e-06 3.1130e-06 2.0973e-06 2.7182e-06
0.50 8e3 3.5383e-06 2.9330e-06 2.0790e-06 2.6138e-06
0.07 8e3 3.4609e-06 2.9069e-06 1.8524e-06 3.1523e-06
1.00 8e3 3.3578e-06 2.7996e-06 2.3738e-06 2.4715e-06
0.01 1e4 9.1689e-06 9.2317e-06 9.1726e-06 9.1931e-06
0.25 1e4 1.5310e-05 1.5417e-05 1.5318e-05 1.5352e-05
0.50 1e4 1.5089e-05 1.5195e-05 1.5097e-05 1.5130e-05
0.07 1e4 1.4869e-05 1.4974e-05 1.4876e-05 1.4909e-05
1.00 1e4 1.4472e-05 1.4575e-05 1.4479e-05 1.4512e-05
0.01 8e3 3.3513e-07 2.7856e-07 2.6904e-07 2.3935e-07
0.25 8e3 5.6752e-07 4.9222e-07 3.3162e-07 4.2978e-07
0.50 8e3 5.5945e-07 4.6375e-07 3.2871e-07 4.1328e-07
0.07 8e3 5.4722e-07 4.5962e-07 2.9290e-07 4.9843e-07
1.00 8e3 5.3091e-07 4.4266e-07 3.7532e-07 3.9077e-07
0.01 1e4 1.4497e-06 1.4597e-06 1.4503e-06 1.4536e-06
0.25 1e4 2.4208e-06 2.4376e-06 2.4220e-06 2.4273e-06
0.50 1e4 2.3858e-06 2.4025e-06 2.3870e-06 2.3923e-06
0.07 1e4 2.3510e-06 2.3675e-06 2.3520e-06 2.3574e-06
1.00 1e4 2.2882e-06 2.3045e-06 2.2893e-06 2.2946e-06
In this example, we assume n = 40, t = 2.5e10−4 , and γ = 1015 . Figure 8.5

shows the obtained result of function u(x, t) by sixth-order Legendre kernel as an
example. Also, the L 2 and RMS errors of different kernels at various times are
represented in Tables 8.10 and 8.11.
8.3.2.3 Example 3
Consider the nonlinear time-dependent generalized Fitzhugh–Nagumo equation with

v(t) = cos(t), μ(t) = cos(t), η(t) = −2 cos(t), ρ = 43 , initial condition u(x, 0) =
3
8
+ 38 tanh( 3x8 ), and domain over (x, t) ∈ [−10, 10] × [0, 1]. In addition, we
have A(x, t, u) = −ρ + (1 − ρ)u − u 2 , B(x, t, u) = 1 and C(x, t, u) = −1. This
equation has the exact solution u(x, t) = ρ2 + ρ2 tanh( ρ2 (x − (3 − ρ) sin(t))) (Bhrawy
t = 2.5e10−4
0.01 80 4.6645e-05 1.1548e-04 3.2880e-05 2.3228e-04
0.25 80 2.3342e-04 1.7331e-04 1.8781e-04 3.2000e-03
0.50 80 2.4288e-04 1.7697e-04 1.9577e-04 4.7000e-03
0.07 80 2.4653e-04 1.6815e-04 1.9935e-04 5.4000e-03
1.00 80 2.5033e-04 1.4136e-04 2.0333e-04 5.6000e-03
0.01 100 6.2498e-05 1.1966e-04 4.8828e-05 1.9291e-04
0.25 100 3.6141e-04 1.9420e-04 2.9325e-04 2.5000e-03
0.50 100 3.7364e-04 1.9940e-04 3.0328e-04 3.5000e-03
0.07 100 3.7937e-04 1.9260e-04 3.0863e-04 3.9000e-03
1.00 100 3.8635e-04 1.7084e-04 3.1527e-04 4.0000e-03
2013; Jiwari et al. 2014; Triki and Wazwaz 2013). In this example, we set n =
40, t = 2.5e10−4 , and γ = 1015 . Figure 8.6 shows the obtained result of function
u(x, t) by sixth-order Legendre kernel as an example. Also, the L 2 and RMS errors
of different kernels at various times are represented in Tables 8.12 and 8.13.
0.01 80 7.3752e-06 1.8259e-05 5.1988e-06 3.6726e-05
0.25 80 3.6907e-05 2.7403e-05 2.9695e-05 5.0071e-04
0.50 80 3.8403e-05 2.7981e-05 3.0954e-05 7.4712e-04
0.07 80 3.8980e-05 2.6586e-05 3.1521e-05 8.4603e-04
1.00 80 3.9580e-05 2.2351e-05 3.2149e-05 8.8481e-04
0.01 100 9.8818e-06 1.8921e-05 7.7204e-06 3.0502e-05
0.25 100 5.7144e-05 3.0706e-05 4.6367e-05 3.9744e-04
0.50 100 5.9077e-05 3.1528e-05 4.7954e-05 5.5947e-04
0.07 100 5.9983e-05 3.0452e-05 4.8798e-05 6.1710e-04
1.00 100 6.1087e-05 2.7012e-05 4.9849e-05 6.2629e-04
8.4 Conclusion
In this chapter, we introduced a machine learning-based numerical algorithm called

the least squares support vector machine approach to solve partial differential equa-
tions. First, the temporal dimension was discretized by the Crank–Nicolson method.
Then, the collocation LS-SVM approach was applied to the obtained equations at
each time step. By employing the dual form, the problem was converted to a system
of algebraic equations that can be solved by standard solvers. The modified Cheby-
shev and Legendre orthogonal kernel functions were used in LS-SVM formulations.
In order to evaluate the effectiveness of the proposed method, it was applied to two
well-known second-order partial differential equations, i.e. Fokker–Planck and gen-
eralized Fitzhugh–Nagumo equations. The obtained results from various orthogonal
kernels were reported in terms of numerical absolute error and root-mean-square
error. It can be concluded that the proposed approach has acceptable accuracy and is
effective to solve linear and nonlinear partial differential equations.
References
Aarts, L.P., Van Der Veer, P.: Neural network method for solving partial differential equations.
Neural Proc. Lett. 14, 261–271 (2001)
Abazari, R., Yildirim, K.: Numerical study of Sivashinsky equation using a splitting scheme based
on Crank-Nicolson method. Math. Method. Appl. Sci. 16, 5509–5521 (2019)
Abbasbandy, S.: Soliton solutions for the Fitzhugh-Nagumo equation with the homotopy analysis
method. Appl. Math. Modell. 32, 2706–2714 (2008)
Abbasbandy, S., Shirzadi, A.: A meshless method for two-dimensional diffusion equation with an
integral condition. Eng. Anal. Bound. Elem. 34, 1031–1037 (2010)
Abbasbandy, S., Shirzadi, A.: MLPG method for two-dimensional diffusion equation with Neu-
mann’s and non-classical boundary conditions. Appl. Numer. Math. 61, 170–180 (2011)
Abbaszadeh, M., Dehghan, M.: Simulation flows with multiple phases and components via the radial
basis functions-finite difference (RBF-FD) procedure: Shan-Chen model. Eng. Anal. Bound.
Elem. 119, 151–161 (2020a)
Abbaszadeh, M., Dehghan, M.: Direct meshless local Petrov-Galerkin method to investigate
anisotropic potential and plane elastostatic equations of anisotropic functionally graded mate-
rials problems. Eng. Anal. Bound. Elem. 118, 188–201 (2020b)
Abdel-Aty, A.H., Khater, M., Baleanu, D., Khalil, E.M., Bouslimi, J., Omri, M.: Abundant distinct
types of solutions for the nervous biological fractional FitzHugh-Nagumo equation via three
different sorts of schemes. Adv. Diff. Eq. 476, 1–17 (2020)
Abdusalam, H.A.: Analytic and approximate solutions for Nagumo telegraph reaction diffusion
equation. Appl. Math. Comput. 157, 515–522 (2004)
Ali, H., Kamrujjaman, M., Islam, M.S.: Numerical computation of FitzHugh-Nagumo equation: a
novel Galerkin finite element approach. Int. J. Math. Res. 9, 20–27 (2020)
Appadu, A.R., Agbavon, K.M.: Comparative study of some numerical methods for FitzHugh-
Nagumo equation. AIP Conference Proceedings, AIP Publishing LLC, vol. 2116 (2019), p.
030036
Aronson, D.G., Weinberger, H.F.: Nonlinear diffusion in population genetics, combustion, and nerve
pulse propagation. Partial differential equations and related topics. Springer, Berlin (1975), pp.
5–49
Aronson, D.G., Weinberger, H.F.: Multidimensional nonlinear diffusion arising in population genet-
ics. Adv. Math. 30, 33–76 (1978)
hardware implementation of orthogonal polynomials. Eng. Comput. (inpress) (2022)
Asouti, V.G., Trompoukis, X.S., Kampolis, I.C., Giannakoglou, K.C.: Unsteady CFD computations
using vertex-centered finite volumes for unstructured grids on graphics processing units. Int. J.
Numer. Methods Fluids 67, 232–246 (2011)
Barkai, E.: Fractional Fokker-Planck equation, solution, and application. Phys. Rev. E. 63, 046118
(2001)
Bath, K.J., Wilson, E.: Numerical Methods in Finite Element Analysis. Prentice Hall, New Jersey
(1976)
Belytschko, T., Lu, Y.Y., Gu, L.: Element-free Galerkin methods. Int. J. Numer. Methods Eng. 37,
229–256 (1994)
Bengfort, M., Malchow, H., Hilker, F.M.: The Fokker-Planck law of diffusion and pattern formation
in heterogeneous environments. J. Math. Biol. 73, 683–704 (2016)
Bertolazzi, E., Manzini, G.: A cell-centered second-order accurate finite volume method for
convection-diffusion problems on unstructured meshes. Math. Models Methods Appl. Sci. 14,
1235–1260 (2001)
Bhrawy, A.H.: A Jacobi-Gauss-Lobatto collocation method for solving generalized Fitzhugh-
Nagumo equation with time-dependent coefficients. Appl. Math. Comput. 222, 255–264 (2013)
Bhrawy, A.H., Baleanu, D.: A spectral Legendre-Gauss-Lobatto collocation method for a space-
fractional advection diffusion equations with variable coefficients. Reports Math. Phy. 72, 219–
233 (2013)
Blackmore, R., Weinert, U., Shizgal, B.: Discrete ordinate solution of a Fokker-Planck equation in
laser physics. Transport Theory Stat. Phy. 15, 181–210 (1986)
Bossavit, A., Vérité, J.C.: A mixed FEM-BIEM method to solve 3-D eddy-current problems. IEEE
Trans. Magn. 18, 431–435 (1982)
Braglia, G.L., Caraffini, G.L., Diligenti, M.: A study of the relaxation of electron velocity distribu-
tions in gases. Il Nuovo Cimento B 62, 139–168 (1981)
Brink, A.R., Najera-Flores, D.A., Martinez, C.: The neural network collocation method for solving
partial differential equations. Neural Comput. App. 33, 5591–5608 (2021)
Browne, P., Momoniat, E., Mahomed, F.M.: A generalized Fitzhugh-Nagumo equation. Nonlinear
Anal. Theory Methods Appl. 68, 1006–1015 (2008)
Bruggi, M., Venini, P.: NA mixed FEM approach to stress-constrained topology optimization. Int.
J. Numer. Methods Eng. 73, 1693–1714 (2008)
Cai, S., Mao, Z., Wang, Z., Yin, M., Karniadakis, G.E.: Physics-informed neural networks (PINNs)
for fluid mechanics: a review. Acta Mech. Sinica. 1–12 (2022)
Carstensen, C., Köhler, K.: Nonconforming FEM for the obstacle problem. IMA J. Numer. Anal.
37, 64–93 (2017)
Chauviére, C., Lozinski, A.: Simulation of dilute polymer solutions using a Fokker-Planck equation.
Comput. Fluids. 33, 687–696 (2004)
Chavanis, P.H.: Nonlinear mean-field Fokker-Planck equations and their applications in physics,
astrophysics and biology. Comptes. Rendus. Phys. 7, 318–330 (2006)
Chen, Y., Yi, N., Liu, W.: A Legendre-Galerkin spectral method for optimal control problems
governed by elliptic equations. SIAM J. Numer. Anal. 46, 2254–2275 (2008)
Cheung, K.C., See, S.: Recent advance in machine learning for partial differential equation. CCF
Trans. High Perf. Comput. 3, 298–310 (2021)
Chien, C.C., Wu, T.Y.: A particular integral BEM/time-discontinuous FEM methodology for solving
2-D elastodynamic problems. Int. J. Solids Struct. 38, 289–306 (2001)
Conze, A., Lantos, N., Pironneau, O.: The forward Kolmogorov equation for two dimensional
options. Commun. Pure Appl. Anal. 8, 195 (2009)
D’ariano, G.M., Macchiavello, C., Moroni, S.: On the monte carlo simulation approach to Fokker-
Planck equations in quantum optics. Modern Phys. Lett. B. 8, 239–246 (1994)
De Decker, Y., Nicolis, G.: On the Fokker-Planck approach to the stochastic thermodynamics of
reactive systems. Physica A: Stat. Mech. Appl. 553, 124269 (2020)
Dehghan, M., Narimani, N.: The element-free Galerkin method based on moving least squares
and moving Kriging approximations for solving two-dimensional tumor-induced angiogenesis
model. Eng. Comput. 36, 1517–1537 (2020)
Dehghan, M., Shokri, A.: A numerical method for solution of the two-dimensional sine-Gordon
equation using the radial basis functions. Math. Comput. Simul. 79, 700–715 (2008)
Dehghan, M., Taleei, A.: A compact split-step finite difference method for solving the nonlinear
Schrödinger equations with constant and variable coefficients. Comput. Phys. Commun. 181,
80–90 (2010)
Dehghan, M., Tatari, M.: Numerical solution of two dimensional Fokker-Planck equations. Phys.
Scr. 74, 310–316 (2006)
Dehghan, M., Manafian Heris, J., Saadatmandi, A.: Application of semi-analytic methods for the
Fitzhugh-Nagumo equation, which models the transmission of nerve impulses. Math. Methods
Appl. Sci. 33, 1384–1398 (2010)
Delkhosh, M., Parand, K.: A new computational method based on fractional Lagrange functions to
solve multi-term fractional differential equations. Numer. Algor. 88, 729–766 (2021)
Dubois F.: Finite volumes and mixed Petrov-Galerkin finite elements: the unidimensional problem.
Numer. Methods Partial Diff. Eq. Int. J. 16, 335–360 (2000)
Eymard, R., Gallouët, T., Herbin, R.: Finite volume methods. Handbook of Numerical Analysis,
vol. 7 (2000), pp. 713–1018
Fallah, N.: A cell vertex and cell centred finite volume method for plate bending analysis. Comput.
Methods Appl. Mech. Eng. 193, 3457–3470 (2004)
Fallah, N.A., Bailey, C., Cross, M., Taylor, G.A.: Comparison of finite element and finite volume
methods application in geometrically nonlinear stress analysis. Appl. Math. Model. 24, 439–455
(2000)
Fasshauer, G.E.: Meshfree Approximation Methods with MATLAB. World Scientific, Singapore
(2007)
FitzHugh, R.: Impulses and physiological states in theoretical models of nerve membrane. Biophys.
J. 1, 445–466 (1961)
Flandoli, F., Zanco, G.: An infinite-dimensional approach to path-dependent Kolmogorov equations.
Annals Probab. 44, 2643–2693 (2016)
Frank, T.D., Beek, P.J., Friedrich, R.: Fokker-Planck perspective on stochastic delay systems: Exact
solutions and data analysis of biological systems. Astrophys. Biol. Phys. Rev. E 68, 021912
(2003)
Friedrich, R., Jenko, F., Baule, A., Eule, S.: Exact solution of a generalized Kramers-Fokker-Planck
equation retaining retardation effects. Phys. Rev. E. 74, 041103 (2006)
Furioli, G., Pulvirenti, A., Terraneo, E., Toscani, G.: Fokker-Planck equations in the modeling of
socio-economic phenomena. Math. Models Methods Appl. Sci. 27, 115–158 (2017)
Gamba, I.M., Rjasanow, S.: Galerkin-Petrov approach for the Boltzmann equation. J. Comput. Phys.
366, 341–365 (2018)
Ghidaglia, J.M., Kumbaro, A., Le Coq, G.: On the numerical solution to two fluid models via a cell
centered finite volume method. Eur. J. Mech. B Fluids. 20, 841–867 (2001)
Gordon, A., Vugmeister, B.E., Dorfman, S., Rabitz, H.: Impulses and physiological states in theo-
retical models of nerve membrane. Biophys. J. 233, 225–242 (1999)
Grima, R., Thomas, P., Straube, A.V.: How accurate are the nonlinear chemical Fokker-Planck and
chemical Langevin equations. Astrophys. Biol. J. Chem. Phys. 135, 084103 (2011)
Gronchi, M., Lugiato, A.: Fokker-Planck equation for optical bistability. Lettere Al Nuovo Cimento
23, 593–8 (1973)
Hadian-Rasanan, A.H., Bajalan, Parand, K., Rad, J.A.: Simulation of nonlinear fractional dynamics
arising in the modeling of cognitive decision making using a new fractional neural network. Math.
Methods Appl. Sci. 43, 1437–1466 (2020)
Hadian-Rasanan, A.H., Rad, J.A., Sewell, D.K.: Are there jumps in evidence accumulation, and
what, if anything, do they reflect psychologically- An analysis of Lévy-Flights models of decision-
making. PsyArXiv (2021). https://doi.org/10.31234/osf.io/vy2mh
Hadian-Rasanan, A.H., Rahmati, D., Girgin, S., Parand, K.: A single layer fractional orthogonal
(2019)
Hajimohammadi, Z., Shekarpaz, S., Parand, K.: The novel learning solutions to nonlinear differential
models on a semi-infinite domain. Eng. Comput. 1–18 (2022)
Hajimohammadi, Z., Parand, K.: Numerical learning approximation of time-fractional sub diffusion
model on a semi-infinite domain. Chaos Solitons Frac. 142, 110435 (2021)
Hariharan, G., Kannan, K.: Haar wavelet method for solving FitzHugh-Nagumo equation. Int. J.
Math. Comput. Sci. 4, 909–913 (2010)
Hemami, M., Parand, K., Rad, J.A.: Numerical simulation of reaction-diffusion neural dynamics
models and their synchronization/desynchronization: application to epileptic seizures. Comput.
Math. Appl. 78, 3644–3677 (2019)
Hemami, M., Rad, J.A., Parand, K.: The use of space-splitting RBF-FD technique to simulate the
controlled synchronization of neural networks arising from brain activity modeling in epileptic
seizures. J. Comput. Sci. 42, 101090 (2020)
Hemami, M., Rad, J.A., Parand, K.: Phase distribution control of neural oscillator populations using
local radial basis function meshfree technique with application in epileptic seizures: A numerical
simulation approach. Commun. Nonlinear SCI. Numer. Simul. 103, 105961 (2021)
Heydari, M.H., Avazzadeh, Z.: Chebyshev-Gauss-Lobatto collocation method for variable-order
time fractional generalized Hirota-Satsuma coupled KdV system. Eng. Comput. 1–10 (2020)
Hodgkin, A.L., Huxley, A.F.: Currents carried by sodium and potassium ions through the membrane
of the giant axon of Loligo. J. Physiol. 116, 449–72 (1952)
Hughes, T.J.: The finite element method: linear static and dynamic finite element analysis. Courier
Corporation, Chelmsford (2012)
ïnan, B.: A finite difference method for solving generalized FitzHugh-Nagumo equation. AIP Con-
ference Proceedings, AIP Publishing LLC, vol. 1926 (2018), p. 020018
Jiménez-Aquino, J.I., Romero-Bastida, M.: Fokker-Planck-Kramers equation for a Brownian gas
in a magnetic field. Phys. Rev. E. 74, 041117 (2006)
Jiwari, R., Gupta, R.K., Kumar, V.: Polynomial differential quadrature method for numerical solu-
tions of the generalized Fitzhugh-Nagumo equation with time-dependent coefficients. Ain Shams
Eng. J. 5, 1343–1350 (2014)
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349,
255–260 (2015)
Kadeethumm, T., O’Malley, D., Fuhg, J.N., Choi, Y., Lee, J., Viswanathan, H.S., Bouklas, N.: A
framework for data-driven solution and parameter estimation of PDEs using conditional genera-
tive adversarial networks. Nat. Comput. Sci. 1, 819–829 (2021)
Kanschat, G.: Multilevel methods for discontinuous Galerkin FEM on locally refined meshes.
Comput. Struct. 82, 2437–2445 (2004)
Karniadakis, G.E., Sherwin, S.J.: Spectral/hp Element Methods for Computational Fluid Dynamics.
Oxford University Press, New York (2005)
Kassab, A., Divo, E., Heidmann, J., Steinthorsson, E., Rodriguez, F.: BEM/FVM conjugate heat
transfer analysis of a three-dimensional film cooled turbine blade. Int. J. Numer. Methods Heat
Fluid Flow. 13, 581–610 (2003)
Kawahara, T., Tanaka, M.: Interactions of traveling fronts: an exact solution of a nonlinear diffusion
equation. Phys. Lett. A 97, 311–314 (1983)
Kazem, S., Rad, J.A.: Radial basis functions method for solving of a non-local boundary value
problem with Neumann’s boundary conditions. Appl. Math. Modell. 36, 2360–2369 (2012)
Kazem, S., Rad, J.A., Parand, K.: Radial basis functions methods for solving Fokker-Planck equa-
tion. Eng. Anal. Bound. Elem. 36, 181–189 (2012a)
Kazem, S., Rad, J.A., Parand, K.: A meshless method on non-Fickian flows with mixing length
growth in porous media based on radial basis functions: a comparative study. Comput. Math.
Appl. 64, 399–412 (2012b)
Kogut, P.I., Kupenko, O.P.: On optimal control problem for an ill-posed strongly nonlinear elliptic
equation with p-Laplace operator and L 1 -type of nonlinearity. Disceret Cont. Dyn-B 24, 1273–
1295 (2019)
Kopriva, D.: Implementing Spectral Methods for Partial Differential Equations. Springer, Berlin
(2009)
Kumar, S.: Numerical computation of time-fractional Fokker-Planck equation arising in solid state
physics and circuit theory. Z NATURFORSCH A. 68, 777–784 (2013)
Kumar, D., Singh, J., Baleanu, D.: A new numerical algorithm for fractional Fitzhugh-Nagumo
equation arising in transmission of nerve impulses. Nonlinear Dyn. 91, 307–317 (2018)
Lakestani, M., Dehghan, M.: Numerical solution of Fokker-Planck equation using the cubic B-spline
scaling functions. Numer. Method. Part. D. E. 25, 418–429 (2008)
Latifi, S., Delkhosh, M.: Generalized Lagrange Jacobi-Gauss-Lobatto vs Jacobi-Gauss-Lobatto col-
location approximations for solving (2 + 1)-dimensional Sine-Gordon equations. Math. Methods
Appl. Sci. 43, 2001–2019 (2020)
Lee, Y.Y., Ruan, S.J., Chen, P.C.: Predictable coupling effect model for global placement using
generative adversarial networks with an ordinary differential equation solver. IEEE Trans. Circuits
Syst. II: Express Briefs (2021), pp. 1–5
LeVeque, R.J.: Finite Volume Methods for Hyperbolic Problems. Cambridge University Press,
Cambridge (2002)
Li, H., Guo, Y.: New exact solutions to the FitzHugh-Nagumo equation. Appl. Math. Comput. 180,
524–528 (2006)
Liaqat, A., Fukuhara, M., Takeda, T.: Application of neural network collocation method to data
assimilation. Computer Phys. Commun. 141, 350–364 (2001)
Lindqvist, P.: Notes on the Stationary p-Laplace Equation. Springer International Publishing, Berlin
(2019)
Liu, G.R.: Mesh Free Methods: Moving Beyond the Finite Element Method. CRC Press, Florida
(2003)
Liu, G.R., Gu, Y.T.: A local radial point interpolation method (LRPIM) for free vibration analyses
of 2-D solids. J. Sound Vib. 246, 29–46 (2001)
Liu, J., Hao, Y.: Crank-Nicolson method for solving uncertain heat equation. Soft Comput. 26,
937–945 (2022)
Liu, G.R., Zhang, G.Y., Gu, Y., Wang, Y.Y.: A meshfree radial point interpolation method (RPIM)
for three-dimensional solids. Comput. Mech. 36, 421–430 (2005)
Liu, F., Zhuang, P., Turner, I., Burrage, K., Anh, V.: A new fractional finite volume method for
solving the fractional diffusion equation. Appl. Math. Model. 38, 3871–3878 (2014)
Lu, Y., Lu, J., Wang, M.: The Deep Ritz Method: a priori generalization analysis of the deep
Ritz method for solving high dimensional elliptic partial differential equations. Conference on
Learning Theory, PMLR (2021), pp. 3196–3241
Mai-Duy, N.: An effective spectral collocation method for the direct solution of high-order ODEs.
Commun. Numer. Methods Eng. 22, 627–642 (2006)
Meerschaert, M.M., Tadjeran, C.: Finite difference approximations for two-sided space-fractional
partial differential equations. Appl. Numer. Math. 56, 80–90 (2006)
Mehrkanoon, S., Suykens, J.A.K: Approximate solutions to ordinary differential equations using
least squares support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 23, 1356–1362
(2012)
Mehrkanoon, S., Suykens, J.A.K.: Learning solutions to partial differential equations using LS-
SVM. Neurocomputing 159, 105–116 (2015)
Moayeri, M.M., Hadian-Rasanan, A.H., Latifi, S., Parand, K., Rad, J.A.: An efficient space-splitting
Eng. Comput. 1–28 (2020a)
Moayeri, M.M., Rad, J.A., Parand, K.: Dynamical behavior of reaction-diffusion neural networks
and their synchronization arising in modeling epileptic seizure: A numerical simulation study.
Comput. Math. Appl. 80, 1887–1927 (2020b)
Moayeri, M.M., Rad, J.A., Parand, K.: Desynchronization of stochastically synchronized neural
populations through phase distribution control: a numerical simulation approach. Nonlinear Dyn.
104, 2363–2388 (2021)
Moghaderi, H., Dehghan, M.: Mixed two-grid finite difference methods for solving one-dimensional
and two-dimensional Fitzhugh-Nagumo equations. Math. Methods Appl. Sci. 40, 1170–1200
(2016)
Mohammadi, V., Dehghan, M.: Simulation of the phase field Cahn-Hilliard and tumor growth
models via a numerical scheme: element-free Galerkin method. Comput. Methods Appl. Mech.
Eng. 345, 919–950 (2019)
Mohammadi, V., Dehghan, M.: A meshless technique based on generalized moving least squares
combined with the second-order semi-implicit backward differential formula for numerically
solving time-dependent phase field models on the spheres. Appl. Numer. Math. 153, 248–275
(2020)
Mohammadi, V., Dehghan, M., De Marchi, S.: Numerical simulation of a prostate tumor growth
model by the RBF-FD scheme and a semi-implicit time discretization. J. Comput. Appl. Math.
388, 113314 (2021)
Moosavi, M.R., Khelil, A.: Accuracy and computational efficiency of the finite volume method
combined with the meshless local Petrov-Galerkin in comparison with the finite element method
in elasto-static problem. ICCES 5, 211–238 (2008)
Olmos, D., Shizgal, B.D.: Pseudospectral method of solution of the Fitzhugh-Nagumo equation.
Math. Comput. Simul. 79, 2258–2278 (2009)
Ottosen, N., Petersson, H., Saabye, N.: Introduction to the Finite Element Method. Prentice Hall,
New Jersey (1992)
machine pattern classification. Pattern Recogn. 44, 1435–1447 (2011)
Pang, G., Lu, L., Karniadakis, G.E.: fPINNs: fractional physics-informed neural networks. SIAM
J. Sci. Comput. 41, A2603–A2626 (2019)
Parand, K., Rad, J.A.: Kansa method for the solution of a parabolic equation with an unknown
spacewise-dependent coefficient subject to an extra measurement. Comput. Phys. Commun. 184,
582–595 (2013)
Parand, K., Hemami, M., Hashemi-Shahraki, S.: Two meshfree numerical approaches for solving
high-order singular Emden-Fowler type equations. Int. J. Appl. Comput. Math. 3, 521–546 (2017)
Parand, K., Latifi, S., Moayeri, M.M., Delkhosh, M.: Generalized Lagrange Jacobi Gauss-Lobatto
(GLJGL) collocation method for solving linear and nonlinear Fokker-Planck equations. Eng.
Anal. Bound. Elem. 69, 519–531 (2018)
Parand, K., Aghaei, A.A., Jani, M., Ghodsi, A.: Parallel LS-SVM for the numerical simulation of
fractional Volterra’s population model. Alexandria Eng. J. 60, 5637–5647 (2021a)
Parand, K., Aghaei, A.A., Jani, M., Ghodsi, A.: A new approach to the numerical solution of
Fredholm integral equations using least squares-support vector regression. Math Comput. Simul.
180, 114–128 (2021b)
Peeters, A.G., Strintzi, D.: The Fokker-Planck equation, and its application in plasma physics.
Annalen der Physik. 17, 142–157 (2008)
Pozrikidis, C.: Introduction to Finite and Spectral Element Methods Using MATLAB, 2nd edn.
Oxford CRC Press (2014)
Qin, C., Wu, Y., Springenberg, J.T., Brock, A., Donahue, J., Lillicrap, T., Kohli, P.: Training gener-
ative adversarial networks by solving ordinary differential equations. Adv. Neural Inf. Process.
Syst. 33, 5599–5609 (2020)
Rad, J.A., Ballestra, L.V.: Pricing European and American options by radial basis point interpolation.
Appl. Math. Comput. 251, 363–377 (2015)
Rad, J.A., Parand, K.: Numerical pricing of American options under two stochastic factor models
with jumps using a meshless local Petrov-Galerkin method. Appl. Numer. Math. 115, 252–274
(2017a)
Rad, J.A., Parand, K.: Pricing American options under jump-diffusion models using local weak
form meshless techniques. Int. J. Comput. Math. 94, 1694–1718 (2017b)
Rad, J.A., Kazem, S., Parand, K.: A numerical solution of the nonlinear controlled Duffing oscillator
by radial basis functions. Comput. Math. Appl. 64, 2049–2065 (2012)
Rad, J.A., Kazem, S., Parand, K.: Optimal control of a parabolic distributed parameter system via
radial basis functions. Commun. Nonlinear Sci. Numer. Simul. 19, 2559–2567 (2014)
Rad, J.A., Parand, K., Abbasbandy, S.: Pricing European and American options using a very fast
and accurate scheme: the meshless local Petrov-Galerkin method. Proc. Natl. Acad. Sci. India
Sect. A: Phys. Sci. 85, 337–351 (2015a)
Rad, J.A., Parand, K., Abbasbandy, S.: Local weak form meshless techniques based on the radial
point interpolation (RPI) method and local boundary integral equation (LBIE) method to evaluate
European and American options. Commun. Nonlinear Sci. Numer. Simul. 22, 1178–1200 (2015b)
Rad, J.A., Höök, J., Larsson, E., Sydow, L.V.: Forward deterministic pricing of options using
Gaussian radial basis functions. J. Comput. Sci. 24, 209–217 (2018)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning
framework for solving forward and inverse problems involving nonlinear partial differential equa-
tions. J. Comput. Phys. 378, 686–797 (2019)
Rashedi, K., Adibi, H., Rad, J.A., Parand, K.: Application of meshfree methods for solving the
inverse one-dimensional Stefan problem. Eng. Anal. Bound. Elem. 40, 1–21 (2014)
Reguera, D., Rubı, J.M., Pérez-Madrid, A.: Fokker-Planck equations for nucleation processes revis-
ited. Physica A: Stat. Mech. Appl. 259, 10–23 (1998)
Risken, H.: The Fokker-Planck Equation: Method of Solution and Applications. Springer, Berlin
(1989)
Saha, P., Mukhopadhyay, S.: A deep learning-based collocation method for modeling unknown
PDEs from sparse observation (2020). arxiv.org/pdf/2011.14965pdf
Saporito, Y.F., Zhang, Z.: Path-Dependent deep galerkin method: a neural network approach to
solve path-dependent partial differential equations. SIAM J. Financ. Math. 12, 912–40 (2021)
Shakeri, F., Dehghan, M.: A finite volume spectral element method for solving magnetohydrody-
namic (MHD) equations. Appl. Numer. Math. 61, 1–23 (2011)
Shen, J.: Efficient spectral-Galerkin method I. Direct solvers of second-and fourth-order equations
using Legendre polynomials. SIAM J. Sci. Comput. 15, 1489–1505 (1994)
Shivanian, E., Hajimohammadi, Z., Baharifard, F., Parand, K., Kazemi, R.: A novel learning
approach for different profile shapes of convecting-radiating fins based on shifted Gegenbauer
LSSVM. New Math. Natural Comput. 1–27 (2022)
Shizgal, B.: Spectral Methods in Chemistry and Physics. Scientific Computing. Springer, Berlin
(2015)
Sirignano, J., Spiliopoulos, K.: DGM: a deep learning algorithm for solving partial differential
equations. J. Comput. phys. 375, 1339–1364 (2018)
Smith, G.D.: Numerical Solutions of Partial Differential Equations Finite Difference Methods, 3rd
edn. Oxford University Press, New York (1985)
Spalart, P.R., Moser, R.D., Rogers, M.M.: Spectral methods for the Navier-Stokes equations with
one infinite and two periodic directions. J. Comput. Phy. 96, 297–324 (1991)
Strikwerda, J.C.: Finite Difference Schemes and Partial Differential Equations. Society for Industrial
and Applied Mathematics, Pennsylvania (2004)
Tanimura, Y.: Stochastic Liouville, Langevin, Fokker-Planck, and master equation approaches to
quantum dissipative systems. J. Phys. Soc. Japan 75, 082001 (2006)
Tatari, M., Dehghan, M., Razzaghi, M.: Application of the Adomian decomposition method for the
Fokker-Planck equation. Phys. Scr. 45, 639–650 (2007)
Trefethen, L.N.: Finite Difference and Spectral Methods for Ordinary and Partial Differential Equa-
tions. Cornell University, New York (1996)
Triki, H., Wazwaz, A.M.: On soliton solutions for the Fitzhugh-Nagumo equation with time-
dependent coefficients. Appl. Math. Model. 37, 3821–8 (2013)
Tsurui, A., Ishikawa, H.: Application of the Fokker-Planck equation to a stochastic fatigue crack
growth model. Struct. Safety. 63, 15–29 (1986)
Uhlenbeck, G.E., Ornstein, L.S.: On the theory of the Brownian motion. Phys. Rev. 36, 823–841
(1930)
Ullersma, P.: An exactly solvable model for Brownian motion: II. Derivation of the Fokker-Planck
equation and the master equation. Physica 32, 56–73 (1966)
Van Gorder, R.A., Vajravelu K.: A variational formulation of the Nagumo reaction-diffusion equa-
tion and the Nagumo telegraph equation. Nonlinear Anal.: Real World Appl. 11, 2957–2962
(2010)
Van Gorder, R.A.: Gaussian waves in the Fitzhugh-Nagumo equation demonstrate one role of the
auxiliary function H (x, t) in the homotopy analysis method. Commun. Nonlinear Sci. Numer.
Simul. 17, 1233–1240 (2012)
Vanaja, V.: Numerical solution of a simple Fokker-Planck equation. Appl. Numer. Math. 9, 533–540
(1992)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Wang, C.H., Feng, Y.Y., Yue, K., Zhang, X.X.: Discontinuous finite element method for combined
radiation-conduction heat transfer in participating media. Int. Commun. Heat Mass. 108, 104287
(2019)
Wazwaz, A.M., Gorguis, A.: An analytic study of Fisher’s equation by using adomian decomposition
method. Appl. Math. Comput. 154, 609–20 (2004)
Wazwaz, A.M.: The tanh-coth method for solitons and kink solutions for nonlinear parabolic equa-
tions. Appl. Math. Comput. 188, 1467–75 (2007)
Wilson, P., Teschemacher, T., Bucher, P., Wüchner, R.: Non-conforming FEM-FEM coupling
approaches and their application to dynamic structural analysis. Eng. Struct. 241, 112342 (2021)
Xing, J., Wang, H., Oster, G.: From continuum Fokker-Planck models to discrete kinetic models.
Biophys. J. 89, 1551–1563 (2005)
Yang, L., Zhang, D., Karniadakis, G.E.: Physics-informed generative adversarial networks for
stochastic differential equations. SIAM J. Sci. Comput. 46, 292–317 (2020)
Yeganeh, S., Mokhtari, R., Hesthaven, J.S.: Space-dependent source determination in a time-
fractional diffusion equation using a local discontinuous Galerkin method. Bit Numer. Math.
57, 685–707 (2017)
Yu, B.:The Deep Ritz Method: a deep learning-based numerical algorithm for solving variational
problems. Commun. Math. Stat. 6, 1–12 (2018)
Zayernouri, M., Karniadakis, G.E.: Fractional spectral collocation method. SIAM J Sci. Comput.
36, A40–A62 (2014)
Zayernouri, M., Karniadakis, G.E.: Fractional spectral collocation methods for linear and nonlinear
variable order FPDEs. J. Comput. Phys. 293, 312–338 (2015)
Zayernouri, M., Ainsworth, M., Karniadakis, G.E.: A unified Petrov-Galerkin spectral method for
fractional PDEs. Comput. Methods Appl. Mech. Eng. 283, 1545–1569 (2015)
Zhang, Z., Zou, Q.: Some recent advances on vertex centered finite volume element methods for
elliptic equations. Sci. China Math. 56, 2507–2522 (2013)
Zhao, D.H., Shen, H.W., Lai, J.S. III, G.T.: Approximate Riemann solvers in FVM for 2D hydraulic
shock wave modeling. J. Hydraulic Eng. 122, 692–702 (1996)
Zhao, Y., Chen, P., Bu, W., Liu, X., Tang, Y.: Two mixed finite element methods for time-fractional
diffusion equations. J. Sci. Comput. 70, 407–428 (2017)
Zienkiewicz, O.C., Taylor, R.L., Zhu, J.Z.: The finite element method: its basis and fundamentals.
Elsevier (2005)
Zorzano, M.P., Mais, H., Vazquez, L.: Numerical solution of two dimensional Fokker-Planck equa-
tions. Appl. Math. Comput. 98, 109–117 (1999)
Zubarev, D.N., Morozov, V.G.: Statistical mechanics of nonlinear hydrodynamic fluctuations. Phys-
ica A: Stat. Mech. Appl. 120, 411–467 (1983)
Chapter 9
Solving Integral Equations by LS-SVR
Kourosh Parand, Alireza Afzal Aghaei, Mostafa Jani, and Reza Sahleh
Abstract The other important type of problem in science and engineering is integral
equations. Thus, developing precise numerical algorithms for approximating the
solution to these problems is one of the main questions of scientific computing. In
this chapter, the least squares support vector algorithm is utilized for developing a
numerical algorithm for solving various types of integral equations. The robustness
and also the convergence of the proposed method are discussed in this chapter by
providing several numerical examples.
Keywords Integral equations · Galerkin LS-SVR · Collocation LS-SVR ·

Numerical simulation
9.1 Introduction
Any equation with an unknown function under the integral sign is called an inte-
gral equation. These equations frequently appear in science and engineering, for
instance, different mathematical models such as diffraction problems Eswaran
(1990), scattering in quantum mechanics Barlette et al. (2001), plasticity Kanaun and
K. Parand (B)
Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada
A. A. Aghaei · M. Jani · R. Sahleh
Department of Computer and Data Science, Faculty of Mathematical Sciences, Shahid Beheshti
M. Jani
R. Sahleh
200 A. Ahmadzadeh et al.
Martinez (2012), conformal mapping Reichel (1986), water waves Manam (2011),
and Volterra’s population model Wazwaz (2002) are expressed as integral equations
Assari and Dehghan (2019), Bažant and Jirásek (2002), Bremer (2012), Kulish and
Novozhilov (2003), Lu et al. (2020), Parand and Delkhosh (2017), Volterra (1928).
Recently, the applications of integral equations in machine learning problems have
also been discussed by researchers Keller and Dahm (2019), Chen et al. (2018),
Dahm and Keller (2016). Furthermore, integral equations are closely related to
differential equations, and in some cases, these equations can be converted to each
other. Integro-differential equations are also a type of integral equations, in which not
only the unknown function is placed under the integral operator, but also its deriva-
tives appear in the equation. In some cases, the partial derivative of the unknown
function may appear in the equation. In this case, the equation is called the partial
integro-differential equation. Distributed fractional differential equations are also a
type of fractional differential equations, which are very similar to integral equations.
In these equations, the derivative fraction of the unknown function appears under the
integral in such a way that the fractional derivative order is the same as the integral
variable.
Due to the importance and wide applications of integral equations, many
researchers have developed efficient methods for solving these types of equations
Abbasbandy (2006), Assari and Dehghan (2019), Fatahi et al. (2016), Golberg (2013),
Mandal and Chakrabarti (2016), Marzban et al. (2011), Nemati et al. (2013), Wazwaz
(2011). In this chapter, starting with the explanation of integral equations, and in
the following, a new numerically efficient method based on the least squares sup-
port vector regression approach is proposed for the simulation of some integral
equations.
9.2 Integral Equations
Integral equations are divided into different categories based on their properties. How-
ever, there are three main types of integral equations Wazwaz (2011), i.e., Fredholm,
Volterra, and Volterra-Fredholm integral equations. While the Fredholm integral
equations have constant integration bounds, in Volterra integral equations at least
one of the bounds depends on the independent variable. Volterra-Fredholm inte-
gral equations also include both types of equations. These equations themselves are
divided into subcategories such as linear/nonlinear, homogeneous/inhomogeneous,
and first/second kind. In the next section, different classes of these equations are
presented and discussed.
9 Solving Integral Equations by LS-SVR 201
9.2.1 Fredholm Integral Equations
Fredholm integral equations are an essential class of integral equations used in

image processing Mohammad (2019) and reinforcement learning problems Keller
and Dahm (2019), Dahm and Keller (2016). Since analytical methods for finding
the exact solution are only available in specific cases, various numerical techniques
have been developed to approximate the exact solution of these types of equations
as well as a larger class of these models such as Hammerstein, the system of equa-
tions, multi-dimensional equations, and nonlinear equations. Here, a brief overview
of some numerical methods for these integral equations is given. The local radial
basis function method was presented in Assari et al. (2019) for solving Fredholm
integral equations. Also, Bahmanpour et al. developed the Müntz wavelets method
Bahmanpour et al. (2019) to simulate these models. On the other hand, Newton-
Raphson and Newton-Krylov methods were used for solving one- and
two-dimensional nonlinear Fredholm integral equations Parand et al. (2019). In
another work, the least squares support vector regression method was proposed to
solve Fredholm integral equations Parand et al. (2021). To see more, the interested
reader can see Amiri et al. (2020), Maleknejad and Nosrati Sahlan (2010), Li and
Wu (2021), and Rad and Parand (2017a, b).
For any a, b, λ ∈ R, the following equation is called a Fredholm integral equation
Wazwaz (2011, 2015), Rahman (2007)
b
u(x) = f (x) + λ K (x, t)u(t)dt,
a
where u(x) is the unknown function and K (x, t) is the kernel of the equation.
9.2.2 Volterra Integral Equations
Volterra integral equations are another category of integral equations. This kind of
equation appears in many scientific applications, such as population dynamics Jafari
et al. (2021), the spread of epidemics Wang et al. (2006), and semi-conductor devices
Unterreiter (1996). Also, these equations can be obtained from initial value prob-
lems. Different numerical methods are proposed to solve these types of equations,
for example, a collocation method using Sinc and rational Legendre functions pro-
posed to solve Volterra’s population model Parand et al. (2011). In another work,
the least squares support vector regression method was proposed to solve Volterra
integral equations Parand et al. (2020). Also, the Runge-Kutta method was imple-
mented to solve the second kind of linear Volterra integral equations Maleknejad and
Shahrezaee (2004). To see more, the interested reader can see Messina and Vecchio
(2017), Tang et al. (2008), and Maleknejad et al. (2011). In this category, the equation
is defined as follows Wazwaz (2011, 2015), Rahman (2007):
h(x)
u(x) = f (x) + λ K (x, t)u(t)dt, (9.1)
g(x)
where h(x) and g(x) are known functions.
9.2.3 Volterra-Fredholm Integral Equations
Volterra-Fredholm integral equations are a combination of the Fredholm and Volterra

integral equations. These equations are obtained from parabolic boundary value prob-
lems Wazwaz (2011, 2015), Rahman (2007) and can also be derived from spatio-
temporal epidemic modeling Brunner (1990), Maleknejad and Hadizadeh (1999).
Several numerical methods have been proposed for solving the Volterra-Fredholm
integral model. For example, Brunner has developed a Spline collocation method for
solving nonlinear Volterra-Fredholm integral equation appeared in spatio-temporal
development of epidemic Brunner (1990), Maleknejad and colleagues have utilized
a collocation method based on orthogonal triangular functions for this kind of prob-
lems Maleknejad et al. (2010), and Legendre wavelet collocation method is applied
to these problems by Yousefi and Razzaghi (2005). To see more, the interested reader
can see Parand and Rad (2012), Ghasemi et al. (2007), Babolian et al. (2009), and
Babolian and Shaerlar (2011).
In a one-dimensional linear case, these equations can be written as follows Wazwaz
(2011, 2015), Rahman (2007):
b h(x)
u(x) = f (x) + λ1 K 1 (x, t)u(t)dt + λ2 K 2 (x, t)u(t)dt,
a g(x)
which include Fredholm and Volterra integral operators.
Remark 9.1 (First and second kind integral equations) It should be noted that if
u(x) only appears under the integral sign, it is called the first kind; otherwise, if
the unknown function appears inside and outside the integral sign, it is named the
second kind. For instance, Eq. 9.1 is a second kind Volterra integral equation and the
h(x)
equation below f (x) = λ g(x) K (x, t)u(t)dt is the first kind.
Remark 9.2 (Linear and nonlinear integral equations) Suppose we have the integral
h(x)
equation ψ(u(x)) = f (x) + λ g(x) K (x, t)φ(u(t))dt. If either ψ(x) or φ(x) be a
nonlinear function, the equation is called nonlinear. In the case of the first kind
h(x)
Volterra integral equation, we have f (x) = λ g(x) K (x, t)φ(u(t))dt.
Remark 9.3 (Homogeneous and inhomogeneous integral equations) In the field

of integral equations, the function f (x), which appeared in previous equations, is
defined as the data function. This function plays an important role in determining the
solution of the integral equation. It is known if the function f (x) is on the function
range of K (x, t), then this equation has a solution. For instance, if the kernel is in the
form of K (x, t) = sin(x) sin(t), the function f (x) should be a coefficient of sin(x)
Golberg (2013). Otherwise, the equation has no solution Wazwaz (2011). Due to
the importance of function f (x) in integral equations, scientists have categorized
equations based on the existence or non-existence of this function. If there is a
function f (x) in the integral equation, it is called inhomogeneous; otherwise, it
is identified as homogeneous. In other words, the Fredholm integral equation of
b
the second kind u(x) = f (x) + λ a K (x, t)u(t)dt is inhomogeneous and equation
b
u(x) = λ a K (x, t)u(t)dt is homogeneous.
9.2.4 Integro-Differential Equations
As mentioned at the beginning of this chapter, integro-differential equations are

integral equations in which the derivatives of the unknown function also appear in
the equation. For instance, the equation
b
d n u(x)
= f (x) + λ K (x, t)u(t)dt,
dxn a

d k u
= bk 0 ≤ k ≤ n − 1
dxk x0
is the second kind Fredholm integro-differential equation, in which n is a positive inte-

ger and bk is the initial value for determining the unknown function. Several numeri-
cal methods have been proposed for solving the model discussed here. For example,
in 2006, several finite difference schemes including the forward Euler explicit, the
backward Euler implicit, the Crank-Nicolson implicit, and Crandall’s implicit have
been developed for solving partial integro-differential equations by Dehghan (2006).
In another research, Dehghan and Saadatmandi have utilized Chebyshev finite differ-
ence algorithm for both linear and nonlinear Fredholm integro-differential equations.
To see more, the interested reader can see El-Shahed (2005), Wang and He (2007),
Parand et al. (2011, 2016, 2011, 2014).
9.2.5 Multi-dimensional Integral Equations
Due to the existence of different variables, the study and modeling of physical prob-
lems usually lead to the creation of multi-dimensional cases Mikhlin (2014). Partial
differential equations and multi-dimensional integral equations are the most famous
examples of modeling these problems. The general form of these equations is defined
as follows:

μu(x) = f (x) + λ K (x, t)σ (u(t))dt, x, t ∈ S ⊂ Rn , (9.2)
S
where x = (x1 , x2 , ..., xn ), t = (t1 , t2 , ..., tn ), and λ is the eigenvalue of the integral
equation. Also, for convenience in defining the first and second kind equations, the
constant μ ∈ R has been added to the left side of the equation. If this constant is
zero, it is called the first kind equation; otherwise, it is the second kind. In the general
case, this type of integral equation cannot be evaluated usually using exact closed-
form solutions, and powerful computational algorithms are required. However, some
works are available on numerical simulations of this model. For example, Mirzaei and
Dehghan (2010) moving least squares method for one- and two-dimensional Fred-
holm integral equations, Legendre collocation method for Volterra-Fredholm integral
equations Zaky et al. (2021), and Jacobi collocation method for multi-dimensional
Volterra integral equations Abdelkawy et al. (2017). To see more, the interested
reader can see Bhrawy et al. (2016), Esmaeilbeigi et al. (2017), and Mirzaee and
Alipour (2020).
9.2.6 System of Integral Equations
Another type of problem in integral equations is systems of equations. These systems

appear in the form of several equations and several unknowns in which their equations
are integral. The general form of these systems is defined as follows:
b
n
u i (x) = f i (x) + K i j (x, t)vi j (t)dt, i = 1, ..., n.
a j=1
Here, the functions vi j are defined as gi (u j (t)) for some functions gi .

For example, the following is a system of two equations and two unknowns of the
Fredholm integral equations:
⎧ b
⎪
⎪
⎨u 1 (x) = f 1 (x) + (K 11 (x, t) v11 (t) + K 12 (x, t) v12 (t))dt,
a b
⎪
⎪
⎩u 2 (x) = f 2 (x) + (K 21 (x, t) v21 (t) + K 22 (x, t) v22 (t))dt.
a
Remark 9.4 (Relationship between differential equations and integral equations)

Differential equations and integral equations are closely related, and some of them
can be converted into another. Sometimes it is beneficial to convert a differential
equation into an integral equation because this approach can prevent the instability
of numerical solvers in differential equations Wazwaz (2011, 2015), Rahman (2007).
Fredholm integral equations can be converted to differential equations with boundary
values, and Volterra integral equations can be converted to differential equations with
initial values Parand et al. (2020).
9.3 LS-SVR for Solving IEs
In this section, an efficient method for the numerical simulation of integral equations
is proposed. Using the ideas behind weighted residual methods, the proposed tech-
nique introduces two different algorithms named collocation LS-SVR (CLS-SVR)
and Galerkin LS-SVR (GLS-SVR). For the sake of simplicity, here we denote an
integral equation in the operator form
N (u) = f, (9.3)
in which
N (u) = μu − K1 (u) − K2 (u).
Here, μ ∈ R is a constant which specifies the first or second kind integral equation for
μ = 0 and μ = 0, respectively. The operators K1 and K2 are Fredholm and Volterra
integral operators, respectively. These operators are defined as

K1 (u) = λ1 K 1 (x, t)u(t)dt,
1
and
K2 (u) = λ2 K 2 (x, t)u(t)dt.
2
The proposed method can solve a wide range of integral equations, so we split the
subject into three sections, based on the structure of the unknown function which
should be approximated.
9.3.1 One-Dimensional Case
In order to approximate the solution of Eq. 9.3 using the LS-SVR formulations, some
training data is needed. In contrast to the LS-SVR, in which there is a set of labeled
training data, there are no labels for any arbitrary set of training data in solving
Eq. 9.3. To handle this problem, the approximate solution is expanded as a linear
combination of unknown coefficients, and some basis functions ϕi for i = 1, . . . , d

d
u(x) ũ(x) = w T ϕ(x) + b = wi ϕi (x) + b. (9.4)
i=1
In a general form, the LS-SVR primal problem can be formulated as follows:
1 T γ
min w w + eT e
w,e 2 2 (9.5)
s.t. N (ũ) − f, ψk = ek , k = 1, . . . , n,
in which n is the number of training points, {ψk }nk=1 is a set of test functions in the
test space, and ·, · is the inner product of two functions. In order to take advantage
of the kernel trick, the dual form of this optimization problem is constructed. If the
operator N is linear, the optimization problem of Eq. 9.5 is convex, and the dual form
can be derived easily. To do so, we denote the linear operators as L and construct
the Lagrangian function
1 γ n
L(w, e, α) = w T w + e T e − αk L ũ − f, ψk − ek , (9.6)
2 2 k=1
in which αk ∈ R are Lagrangian multipliers. The conditions for the optimality of the
Eq. 9.6 yield
⎧
⎪ ∂L
n
⎪
⎪ = 0 → wk = αi Lϕk , ψi , k = 1, . . . , d.
⎪
⎪
⎪ ∂wk
⎪
⎪ i=0
⎪
⎪
⎪ ∂L
⎪
⎪
⎪ = 0 → γ ek + αk = 0, k = 1, . . . , n.
⎨ ∂ek
⎪ ∂L
n (9.7)
⎪
⎪ =0→ αi L1, ψi = 0,
⎪
⎪ ∂b
⎪
⎪
⎪
⎪
i=0
⎪
⎪ d
⎪
⎪ ∂L
⎪
⎩ ∂α =0→ wi Lϕi − f, ψk + L0, ψi = ek . k = 1, . . . , n.
k i=1
By eliminating w and e, the following linear system is obtained:

⎡ ⎤⎡ ⎤ ⎡ ⎤
0 T
L b 0
⎣ ⎦⎣ ⎦ = ⎣ ⎦, (9.8)

L + I /γ α y
in which
α = [α1 , . . . , αn ]T ,
y = [ f, ψ1 , f, ψ2 , . . . , f, ψn ]T , (9.9)
= L1, ψi .
L
The kernel trick is also applied within the matrix :
i, j = Lϕ, ψi T Lϕ, ψ j
= LL K (x, t), ψi , ψ j , i, j = 1, 2, . . . , n,
with any valid Mercer kernel K (x, t). The approximate solution in the dual form
takes the form
n
ũ(x) = (x, xi ) + b,
αi K (9.10)
i=1
where
(x, xi ) = Lϕ, ψi , ϕ = L K (x, t), ψi .
K
9.3.2 Multi-dimensional Case
One of the most used techniques to solve multi-dimensional integral equations is to

approximate the solution using a nested summation with a tensor of unknown coef-
ficients and some basis functions. For instance, in order to solve a two-dimensional
integral equation, we can use this function:

d
d
u(x, y) ũ(x, y) = wi, j ϕi (x)ϕ j (y) + b.
i=1 j=1
Note that the upper summation bound d and basis functions ϕ can vary in each
dimension. Fortunately, there is no need to reconstruct the proposed model. In order to
use LS-SVR for solving multi-dimensional equations, we can vectorize the unknown
tensor w, basis functions ϕi , ϕ j , and training points X . For example, in the case of
2D integral equations, we first vectorize the d × d matrix w:
w = [w1,1 , w1,2 , . . . , w1,d , w2,1 , w2,2 , . . . , w2,d , . . . , wd,1 , wd,2 , . . . , wd,d ],
and then use new indexing function
wi, j = wi∗d+ j .
For three-dimensional case, the indexing function
wi, j,k = wi∗d 2 + j∗d+k
can be used. Also, this indexing should be applied to the basis function and training
data tensor. After using this technique, the proposed dual form Eq. 9.8 can be utilized
for a one-dimensional case. Solving the dual form returns the vector α which can be
converted to a tensor using the inverse of the indexing function.
9.3.3 System of Integral Equations
In the system of integral equations, there are k equations and unknown functions:
Ni (u 1 , u 2 , . . . , u k ) = f i , i = 1, 2, . . . , k. (9.11)
For solving these types of equations, the approximated solution can be formulated
as follows:
ũ i (x) = wiT ϕ(x) + bi , i = 1, 2, . . . , k,
where ϕ(x) is feature map vector, and wi and bi are the unknown coefficients. In the
next step, the unknown coefficients are set side by side in a vector form; thus, we
have
w = [w1,1 , w1,2 , . . . , w1,d , w2,1 , w2,2 , . . . , w2,d , . . . , wk,1 , wk,2 , . . . , wk,d ].
Same as the previous formulation for solving high-dimensional equations, the basis
functions can vary for each approximate function, but for simplicity, the same func-
tions are used. Since these functions are shared, they can be seen in a d-dimensional
vector. Using this formulation, the optimization problem for solving Eq. 9.11 can be
constructed as
1 T γ
min w w + eT e
w,e 2 2 (9.12)
s.t. Ni (ũ 1 , ũ 2 , . . . , ũ k ) − f i , ψ j = ei, j , j = 1, . . . , n,
where i = 1, 2, . . . , k. Also, the matrix ei, j should be vectorized the same as unknown
coefficients w. For any linear operator N , denoted by L, the dual form of the opti-
mization problem Eq. 9.12 can be derived. Here, we obtain the dual form for a system
of two equations and two unknown functions. This process can be generalized for
the arbitrary number of equations.
Suppose the following system of equations is given:

L1 (u 1 , u 2 ) = f 1
.
L2 (u 1 , u 2 ) = f 2
By approximating the solutions using

ũ 1 (x) = w1T ϕ(x) + b1 ,

ũ 2 (x) = w2T ϕ(x) + b2 ,
the optimization problem Eq. 9.12 takes the form
1 T γ
min w w + eT e
w,e 2 2
s.t. L1 (ũ 1 , ũ 2 ) − f 1 , ψ j = e j , j = 1, . . . , n,
s.t. L2 (ũ 1 , ũ 2 ) − f 2 , ψ j = e j , j = n + 1, . . . , 2n,
where
w = [w1 , w2 ] = [w1,1 , w1,2 , . . . , w1,d , w2,1 , w2,2 , . . . , w2,d ],
e = [e1 , e2 ] = [e1,1 , e1,2 , . . . , e1,d , e2,1 , e2,2 , . . . , e2,d ].
For the dual solution, the Lagrangian function is constructed as follows:
1 T γ
L(w, e, α) = w w + eT e
2 2
n
− α j L1 (ũ 1 , ũ 2 ) − f 1 , ψ j − e j
j=1

n
− αn+ j L2 (ũ 1 , ũ 2 ) − f 2 , ψ j − en+ j ,
j=1
then the conditions for optimality of the Lagrangian function are given by
⎧ n
⎪
⎪
⎪
⎪ α j L1 (ϕk , 0) − f 1 , ψ j − e j +
⎪
⎪
⎪
⎪
⎪
⎪
j=1
⎪
⎪ n
⎪
⎪ αn+ j L2 (ϕk , 0) − f 2 , ψ j − en+ j , k = 1, 2, . . . , d.
⎪
⎨
∂L j=1
= 0 → wk = n
∂wk ⎪
⎪
⎪
⎪ α j L1 (0, ϕk ) − f 1 , ψ j − e j +
⎪
⎪
⎪
⎪
⎪
⎪
j=1
⎪
⎪ n
⎪
⎪ αn+ j L2 (0, ϕk ) − f 2 , ψ j − en+ j , k = d + 1, d + 2, . . ., 2d.
⎪
⎩
j=1
(9.13)
∂L
= 0 → γ ek + αk = 0, k = 1, . . . , 2n.
∂ek
⎧
⎪
⎪ ∂L n n
⎪
∂L ⎨ ∂b1 = 0 →
⎪ αi L1 (1, 0), ψi + αn+i L2 (1, 0), ψi ,
=0→
i=1
n
i=1
n
∂b ⎪
⎪ ∂L
⎪
⎪ = 0 → α L (0, 1), ψ + αn+i L2 (0, 1), ψi ,
⎩ ∂b2 i 1 i
i=1 i=1
⎧
⎪
⎪ d
⎪
⎪ w j L1 (ϕ j , 0) − f 1 , ψk +
⎪
⎪
⎪
⎪
⎪
⎪
j=1
⎪
⎪ d
⎪
⎪
⎪
⎪ wd+ j L1 (0, ϕ j ) − f 1 , ψk = ek k = 1, 2, . . . , n.
∂L ⎨
j=1
=0→
∂αk ⎪
⎪ d
⎪
⎪ w j L2 (ϕ j , 0) − f 2 , ψk +
⎪
⎪
⎪
⎪
⎪
⎪
j=1
⎪
⎪ d
⎪
⎪
⎪
⎪ wd+ j L2 (0, ϕ j ) − f 2 , ψk = ek k = n + 1, n + 2, . . . , 2n.
⎩
j=1
(9.14)
By defining
Ai, j = L1 (ϕi , 0), ψ j ,
Bi, j = L2 (ϕi , 0), ψ j ,
Ci, j = L1 (0, ϕi ), ψ j ,
Di, j = L2 (0, ϕi ), ψ j ,
E j = L1 (1, 0), ψi ,
F j = L2 (1, 0), ψi ,
G j = L1 (0, 1), ψi ,
H j = L2 (0, 1), ψi ,
and
A B
Z= ,
C D

E F
V = ,
G H
the relation Eq. 9.14 can be reformulated as

⎧
⎪
⎪ Zα = w
⎪
⎨e = −α/γ
⎪
⎪ b = VTα
⎪
⎩ T
Z w − e = y.
Eliminating w and e yields

⎡ ⎤⎡ ⎤ ⎡ ⎤
0 VT b 0
⎣ ⎦⎣ ⎦ = ⎣ ⎦, (9.15)
V + I /γ α y
where
α = [α1 , . . . , α2n ]T ,
y = [ f 1 , ψ1 , f 1 , ψ2 , . . . , f 1 , ψn , f 2 , ψ1 , f 2 , ψ2 , . . . , f 2 , ψn ]T ,
and
AT C T A B
= ZT Z = . (9.16)
B T DT C D
The kernel trick also appears at each block of matrix . The approximated solution
in the dual form can be computed using

n
n
ũ 1 (x) = 1 (x, xi ) +
αi K 2 (x, xi ) + b1 ,
αi K
i=1 i=1
n n
ũ 2 (x) = 3 (x, xi ) +
αi K 4 (x, xi ) + b2 ,
αi K
i=1 i=1
where
1 (x, xi ) = L1 (ϕ, 0), ψi , ϕ = L1 (K (x, t), 0), ψi ,
K
2 (x, xi ) = L2 (ϕ, 0), ψi , ϕ = L2 (K (x, t), 0), ψi ,
K
3 (x, xi ) = L1 (0, ϕ), ψi , ϕ = L1 (0, K (x, t)), ψi ,
K
4 (x, xi ) = L2 (0, ϕ), ψi , ϕ = L2 (0, K (x, t)), ψi .
K
9.3.4 CLS-SVR Method
In this section, the collocation form of the LS-SVR model is proposed for solving
integral equations. Similar to the weighted residual methods, by using the Dirac delta
function as the test function in the test space, we can construct the collocation LS-
SVR model, abbreviated as CLS-SVR. In this case, the primal form of optimization
problem Eq. 9.5 can be expressed as
1 T γ
min w w + eT e
w,e 2 2 (9.17)
s.t. N (ũ)(xk ) − f (xk ) = ek , k = 1, . . . , n,
where n is the number of training data and N is a linear or nonlinear functional

operator. If the operator is linear, then the dual form of the optimization problem Eq.
9.17 can be computed using the following linear system of equations:
⎡ ⎤⎡ ⎤ ⎡ ⎤
0 T
L b 0
⎣ ⎦⎣ ⎦ = ⎣ ⎦, (9.18)

L + I /γ α y
in which
α = [α1 , . . . , αn ]T ,
y = [ f (x1 ), f (x2 ), . . . , f (xn )]T ,
= [L1(x1 ), L1(x2 ), . . . , L1(xn )] ,
L (9.19)
i, j = Lϕ(xi )T Lϕ(x j )
= LL K (xi , x j ), i, j = 1, 2, . . . , n.
The approximated solution in the dual form takes the following form:

n
ũ(x) = (x, xi ) + b,
αi K
i=1
where
(x, xi ) = Lϕ(xi )T ϕ(x) = L K (x, xi ).
K
9.3.5 GLS-SVR Method
The Galerkin approach is a famous method for solving a wide of problems. In this
approach, the test functions ψ are chosen equal to the basis functions. If the basis
functions are orthogonal together, this approach leads to a sparse system of algebraic
equations. Some examples of this feature are provided in the next section. For now,
let us define the model. In the primal space, the model can be constructed as follows:
1 T γ
min w w + eT e
w,e 2 2
(9.20)
s.t. [L ũ(x) − f (x)]ϕk (x)d x = ek , k = 0, . . . , d,

where d is the number of basis functions and N is a linear or nonlinear functional

operator. If the operator is linear, then the dual form of the optimization problem Eq.
9.20 can be computed using the following linear system:
⎡ ⎤⎡ ⎤ ⎡ ⎤
0 T
L b 0
⎣ ⎦⎣ ⎦ = ⎣ ⎦, (9.21)

L + I /γ α y
in which
α = [α1 , . . . , αn ]T ,

L= L1(x)ϕ1 (x)d x, L1(x)ϕ2 (x)d x, . . . , L1(x)ϕd (x)d x, ,

T
y= f (x)ϕ1 (x)d x, f (x)ϕ2 (x)d x, . . . , f (x)ϕd (x)d x ,
T
i, j = Lϕ(x)ϕi (x)d x Lϕ(x)ϕ j (x)d x

= LL K (s, t)ϕi (s)ϕ j (t)dsdt, i, j = 1, 2, . . . , d.

(9.22)
The approximated solution in the dual form takes the form

n
ũ(x) = (x, xi ) + b,
αi K
i=1
where
T
(x, xi ) =
K Lϕ(s)ϕi (s)ds ϕ(x) = L K (x, s)ϕi (s)ds.

9.4 Numerical Simulations
In this section, some integral equations are considered as test problems and the
efficiency of the proposed method is shown by approximating the solution of these
test problems. Also, the shifted Legendre polynomials are used as the kernel of LS-
SVR. Since the Legendre polynomial of degree 0 is a constant,
d the bias term b can
be removed and the approximate solution is defined as i=0 wi Pi (x). As a result,
the system Eq. 9.8 reduces to α = y, which can be efficiently solved using the
Cholesky decomposition or the conjugate gradient method. The training points for
the following examples are the roots of the shifted Legendre polynomials, and the
test data are equidistant points in the problem domain.
This method has been implemented in Maple 2019 software with 15 digits of
accuracy. The results are obtained on an Intel Core i5 CPU with 8 GB of RAM. In
all of the presented numerical tables, the efficiency of the method is computed using
the mean absolute error function:
Table 9.1 The convergence of the CLS-SVR and GLS-SVR methods for Example 9.1 by different
d values. The number of training points for each approximation is d + 1. The CPU time is also
reported in seconds
d CLS-SVR GLS-SVR
Train Test Time Train Test Time
4 7.34E-05 7.65E-05 0.01 3.26E-05 5.58E-05 0.08
6 1.96E-06 2.32E-06 0.02 8.80E-07 1.68E-06 0.17
8 5.35E-08 7.50E-08 0.03 2.41E-08 4.61E-08 0.21
10 1.47E-09 2.14E-09 0.03 6.65E-10 1.48E-09 0.26
12 5.12E-11 8.92E-11 0.07 1.90E-11 4.26E-11 0.38
1
n
L(u, ũ) = |u i − ũ i |,
n i=1
where u i and ũ i are the exact and the predicted value at xi , respectively.
Example 9.1 Suppose the following Volterra integral equation of the second kind
with the exact solution ln (1 + x) Wazwaz (2011).
x
1
x − x 2 − ln(1 + x) + x 2 ln(1 + x) = 2tu(t)dt.
2 0
In Table 9.1, the errors of the approximate solutions for the CLS-SVR and GLS-
SVR are given. Figure 9.1 shows the approximate and the residual function of the
approximate solution. Also, the non-zero elements of the matrix of the GLS-SVR
method are drawn. It is observed that most of the matrix elements are approximately
zero.
Example 9.2 Consider the following Fredholm integral equation of the first kind.
As stated before, these equations are ill-posed, and their solution may not be unique.
A solution for this equation is exp(x) Wazwaz (2011):
1
1 x 4
e = e x−t u(t)dt.
4 0
In Table 9.2, the obtained results of solving these equations with different values for
γ with d = 6 and n = 7 are reported.
Example 9.3 Suppose the Volterra-Fredholm integral equation is as follows:

x 1
u(x) = 2e x − 2x − 2 + (x − t)u(t)dt + xu(t)dt.
0 0
[H]
(a)
(b)
(c) (d)
Fig. 9.1 The plots of Example 9.1. a Exact versus LS-SVR. b Absolute of the residual function.
c, d Sparsity of matrix in the GLS-SVR with fuzzy zero 10−3 and 10−4
Table 9.2 The obtained solution norm and training error of the CLS-SVR method for Example 9.2
γ w 2 e 2
1E-01 0.041653 0.673060
1E+00 0.312717 0.505308
1E+01 0.895428 0.144689
1E+02 1.100491 0.017782
1E+03 1.126284 0.001820
1E+04 1.128930 0.000182
1E+05 1.129195 0.000018
Table 9.3 The convergence of the CLS-SVR and GLS-SVR method for Example 9.3 by different
d values. The number of training points for each approximation is d + 1. The CPU time is reported
in seconds
d CLS-SVR GLS-SVR
4 1.53E-07 1.18E-04 0.02 4.22E-06 1.18E-04 0.13
6 1.16E-10 2.46E-07 0.04 6.04E-09 2.46E-07 0.18
8 5.90E-14 2.71E-10 0.04 5.13E-12 2.72E-10 0.28
10 4.12E-15 2.44E-13 0.07 2.21E-14 2.40E-13 0.38
12 6.84E-15 8.80E-15 0.12 2.12E-14 1.55E-14 0.60
The exact solution of this equation is exp(x) Malaikah (2020). Table 9.3 shows the
convergence of the proposed method for this equation. In Fig. 9.2 are plotted the
exact solution and the residual function of the approximate solution. Also, it is seen
that the matrix of the GLS-SVR method has good sparsity.
Example 9.4 For a multi-dimensional case, consider the following 2D Fredholm

integral equation Derakhshan and Zarebnia (2020):
1 1
1
u(x, y) = x cos(y) − (sin(1) + 3) sin(1) + (s sin(t) + 1)u(s, t)dsdt.
6 0 0
The exact solution of this equation is u(x, y) = x cos(y). The numerical results of
the CLS-SVR and GLS-SVR methods are given in Table 9.4. Figure 9.3 shows the
plot of the exact solution and the residual function. Also, the sparsity of the methods
in the two-dimensional case is attached. It can be seen that the sparsity dominates in
the higher dimensional problems Parand et al. (2021).
Example 9.5 Consider the following system of an integral equation with the exact
solution u 1 (x) = sin(x), u 2 (x) = cos(x) Wazwaz (2011).
[H]
(a)
(b)
(c) (d)
Fig. 9.2 The plots of Example 9.3. a Exact versus LS-SVR. b Absolute of the residual function.
c, d Sparsity of the matrix in the GLS-SVR with fuzzy zero 10−3 and 10−4
d values. The CPU time is also reported in seconds
d n CLS-SVR GLS-SVR
1 4 1.93E-03 1.93E-03 0.01 8.54E-04 8.12E-04 0.06
2 9 1.33E-05 1.33E-05 0.05 1.69E-05 5.86E-06 0.16
3 16 5.56E-08 5.56E-08 0.11 3.12E-07 2.34E-08 0.38
4 25 1.73E-10 1.73E-10 0.26 1.59E-08 8.16E-11 0.80
5 36 2.19E-11 2.19E-11 0.73 2.19E-11 2.19E-11 1.57
(a) (b)
(c) (d)
Fig. 9.3 The plots of Example 9.4. a Exact solution. b Absolute of the residual function. c, d
Sparsity of matrix in the GLS-SVR with fuzzy zero 10−3 and 10−4
Table 9.5 The convergence of mean squared error value for the CLS-SVR and GLS-SVR methods
in Example 9.5 by different d values. The number of training points for each approximation is
d + 1. The CPU time is reported in seconds
d u1 u2 Time
Train Test Train Test
4 9.02E-13 1.20E-06 2.28E-13 2.00E-05 0.34
7 5.43E-28 6.04E-11 2.22E-27 1.67E-12 0.48
10 3.07E-26 3.35E-19 3.29E-27 2.18E-17 0.66
13 8.19E-27 1.36E-24 2.09E-27 3.86E-26 0.91
16 2.38E-27 2.31E-27 1.87E-28 1.85E-28 1.38
(a) (b)
(c) (d)
Fig. 9.4 The plots of Example 9.5. a Exact versus LS-SVR. b Absolute of the residual functions.
c, d Sparsity of matrix in the GLS-SVR with fuzzy zero 10−3 and 10−4
⎧ π
⎪
⎨u 1 (x) = sin x − 2 − 2x − π x + ((1 + xt)u 1 (t) + (1 − xt)u 2 (t))dt,
0 π
⎪
⎩u 2 (x) = cos x − 2 − 2x + π x + ((1 − xt)u 1 (t) − (1 + xt)u 2 (t))dt.
0
The numerical simulation results of this example are given in Table 9.5. Also, Fig. 9.4
plots the exact solution and approximate solution of the methods. Since the matrix
in this case is a symmetric block matrix, the resulting matrix of the GLS-SVR has
an interesting structure.
In Fig. 9.4a, the the solutions and the approximations are not discriminable.
Example 9.6 For a nonlinear example, consider the following Volterra-Fredholm

integral equation of the second kind Amiri et al. (2020):
d values. The number of training points for each approximation is d + 1. The CPU time is also
reported in seconds
d CLS-SVR GLS-SVR
2 9.23E-05 8.66E-03 0.08 8.55E-04 8.76E-03 0.72
4 4.24E-07 8.45E-05 0.12 4.28E-06 8.5E-05 1.54
6 1.12E-09 3.22E-07 0.15 1.16E-08 3.23E-07 3.84
8 1.99E-12 6.97E-10 0.23 7.82E-11 7.39E-10 12.04
10 3.07E-13 1.22E-12 0.33 1.31E-10 1.71E-10 38.37
(a) (b)
Fig. 9.5 The plots of Example 9.5. a Exact versus LS-SVR. b Absolute of the residual function
t π
1 1 1 2
u(x) = (35 cos(x) − 1) + sin(t)u 2 (t)dt + (cos3 (x) − cos(x))u(t)dt.
36 12 0 36 0
The exact solution of this equation is u(x) = cos(x) Since the equation is nonlinear,
the corresponding optimization problem leads to a nonlinear programming problem.
Also, the dual form yields a system of nonlinear algebraic equations. In Table 9.6
and Fig. 9.5, the numerical results and the convergence of the method can be seen.
9.5 Conclusion
In this chapter, a new computational method for solving different types of integral
equations, including multi-dimensional cases and systems of integral equations, is
proposed. In linear equations, learning the solution reduces to solving a positive defi-
nite system of linear equations. This formulation was similar to the LS-SVR method
for solving regression problems. By using the ideas behind spectral methods, we
have presented CLS-SVR and GLS-SVR methods. Although the CLS-SVR method
is more computationally efficient, the resulting matrix in the GLS-SVR method has a
sparsity property. In the last section, some integral equations have been solved using
the CLS-SVR and GLS-SVR methods. The numerical results show that these meth-
ods have high efficiency and exponential convergence rate for integral equations.
References
Abbasbandy, S.: Numerical solutions of the integral equations: homotopy perturbation method and
Adomian’s decomposition method. Appl. Math. Comput. 173, 493–500 (2006)
Abdelkawy, M.A., Amin, A.Z., Bhrawy, A.H., Machado, J.A.T., Lopes, A.M.: Jacobi collocation
approximation for solving multi-dimensional Volterra integral equations. Int. J. Nonlinear Sci.
Numer. Simul. 18, 411–425 (2017)
Amiri, S., Hajipour, M., Baleanu, D.: On accurate solution of the Fredholm integral equations of
the second kind. Appl. Numer. Math. 150, 478–490 (2020)
Amiri, S., Hajipour, M., Baleanu, D.: A spectral collocation method with piecewise trigonometric
basis functions for nonlinear Volterra-Fredholm integral equations. Appl. Math. Comput. 370,
124915 (2020)
Assari, P., Dehghan, M.: A meshless local discrete Galerkin (MLDG) scheme for numerically
solving two-dimensional nonlinear Volterra integral equations. Appl. Math. Comput. 350, 249–
265 (2019)
Assari, P., Dehghan, M.: On the numerical solution of logarithmic boundary integral equations
arising in laplace’s equations based on the meshless local discrete collocation method. Adv.
Appl. Math. Mech. 11, 807–837 (2019)
Assari, P., Asadi-Mehregan, F., Dehghan, M.: On the numerical solution of Fredholm integral
equations utilizing the local radial basis function method. Int. J. Comput. Math. 96, 1416–1443
(2019)
Babolian, E., Shaerlar, A.J.: Two dimensional block pulse functions and application to solve
Volterra-Fredholm integral equations with Galerkin method. Int. J. Contemp. Math. Sci. 6, 763–
770 (2011)
Babolian, E., Masouri, Z., Hatamzadeh-Varmazyar, S.: Numerical solution of nonlinear Volterra-
Fredholm integro-differential equations via direct method using triangular functions. Comput.
Math. Appl. 58, 239–247 (2009)
Bahmanpour, M., Kajani, M.T., Maleki, M.: Solving Fredholm integral equations of the first kind
using Müntz wavelets. Appl. Numer. Math. 143, 159–171 (2019)
Barlette, V.E., Leite, M.M., Adhikari, S.K.: Integral equations of scattering in one dimension. Am.
J. Phys. 69, 1010–1013 (2001)
Bažant, Z.P., Jirásek, M.: Nonlocal integral formulations of plasticity and damage: survey of
progress. J. Eng. Mech. 128, 1119–1149 (2002)
Bhrawy, A.H., Abdelkawy, M.A., Machado, J.T., Amin, A.Z.M.: Legendre-Gauss-Lobatto collo-
cation method for solving multi-dimensional Fredholm integral equations. Comput. Math. Appl.
4, 1–13 (2016)
Bremer, J.: A fast direct solver for the integral equations of scattering theory on planar curves with
corners. J. Comput. Phys. 231, 1879–1899 (2012)
Brunner, H.: On the numerical solution of nonlinear Volterra-Fredholm integral equations by col-
location methods. SIAM J. Numer. Anal. 27, 987–1000 (1990)
Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations.
Adv. Neural Inf. Process. Syst. 31 (2018)
Dahm, K., Keller, A.: Learning light transport the reinforced way. In: International Conference on
Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, pp. 181–195 (2016)
Dehghan, M.: Solution of a partial integro-differential equation arising from viscoelasticity. Int. J.
Comput. Math. 83, 123–129 (2006)
Dehghan, M., Saadatmandi, A.: Chebyshev finite difference method for Fredholm integro-
differential equation. Int. J. Comput. Math. 85, 123–130 (2008)
Derakhshan, M., Zarebnia, M.: On the numerical treatment and analysis of two-dimensional Fred-
holm integral equations using quasi-interpolant. Comput. Appl. Math. 39, 1–20 (2020)
El-Shahed, M.: Application of He’s homotopy perturbation method to Volterra’s integro-differential
equation. Int. J. Nonlinear Sci. Numer. Simul. 6, 163–168 (2005)
Esmaeilbeigi, M., Mirzaee, F., Moazami, D.: A meshfree method for solving multidimensional
linear Fredholm integral equations on the hypercube domains. Appl. Math. Comput. 298, 236–
246 (2017)
Eswaran, K.: On the solutions of a class of dual integral equations occurring in diffraction problems.
Proc. Math. Phys. Eng. Sci. 429, 399–427 (1990)
Fatahi, H., Saberi-Nadjafi, J., Shivanian, E.: A new spectral meshless radial point interpolation
(SMRPI) method for the two-dimensional Fredholm integral equations on general domains with
error analysis. J. Comput. Appl. 294, 196–209 (2016)
Ghasemi, M., Kajani, M.T., Babolian, E.: Numerical solutions of the nonlinear Volterra-Fredholm
integral equations by using homotopy perturbation method. Appl. Math. Comput. 188, 446–449
(2007)
Golberg, M.A.: Numerical Solution of Integral Equations. Springer, Berlin (2013)
Jafari, H., Ganji, R.M., Nkomo, N.S., Lv, Y.P.: A numerical study of fractional order population
dynamics model. Results Phys. 27, 104456 (2021)
Kanaun, S., Martinez, R.: Numerical solution of the integral equations of elasto-plasticity for a
homogeneous medium with several heterogeneous inclusions. Comput. Mater. Sci. 55, 147–156
(2012)
Keller, A., Dahm, K.: Integral equations and machine learning. Math. Comput. Simul. 161, 2–12
(2019)
Kulish, V.V., Novozhilov, V.B.: Integral equation for the heat transfer with the moving boundary. J.
Thermophys. Heat Trans. 17, 538–540 (2003)
Li, X.Y., Wu, B.Y.: Superconvergent kernel functions approaches for the second kind Fredholm
integral equations. Appl. Numer. Math. 167, 202–210 (2021)
Lu, Y., Yin, Q., Li, H., Sun, H., Yang, Y., Hou, M.: Solving higher order nonlinear ordinary differen-
tial equations with least squares support vector machines. J. Ind. Manag. Optim. 16, 1481–1502
(2020)
Malaikah, H.M.: The adomian decomposition method for solving Volterra-Fredholm integral equa-
tion using maple. Appl. Math. 11, 779–787 (2020)
Maleknejad, K., Hashemizadeh, E., Ezzati, R.: A new approach to the numerical solution of Volterra
integral equations by using Bernstein’s approximation. Commun. Nonlinear Sci. Numer. Simul.
16, 647–655 (2011)
Maleknejad, K., Hadizadeh, M.: A new computational method for Volterra-Fredholm integral equa-
tions. Comput. Math. Appl. 37, 1–8 (1999)
Maleknejad, K., Nosrati Sahlan, M.: The method of moments for solution of second kind Fredholm
integral equations based on B-spline wavelets. Int. J. Comput. Math. 87, 1602–1616 (2010)
Maleknejad, K., Shahrezaee, M.: Using Runge-Kutta method for numerical solution of the system
of Volterra integral equation. Appl. Math. Comput. 149, 399–410 (2004)
Maleknejad, K., Almasieh, H., Roodaki, M.: Triangular functions (TF) method for the solution
of nonlinear Volterra-Fredholm integral equations. Commun. Nonlinear Sci. Numer. Simul. 15,
3293–3298 (2010)
Manam, S.R.: Multiple integral equations arising in the theory of water waves. Appl. Math. Lett.
24, 1369–1373 (2011)
Mandal, B.N., Chakrabarti, A.: Applied Singular Integral Equations. CRC Press, FL (2016)
Marzban, H.R., Tabrizidooz, H.R., Razzaghi, M.: A composite collocation method for the nonlin-
ear mixed Volterra-Fredholm-Hammerstein integral equations. Commun. Nonlinear Sci. Numer.
Simul. 16, 1186–1194 (2011)
Messina, E., Vecchio, A.: Stability and boundedness of numerical approximations to Volterra inte-
gral equations. Appl. Numer. Math. 116, 230–237 (2017)
Mikhlin, S.G.: Multidimensional Singular Integrals and Integral Equations. Elsevier (2014)
Miller, K.S., Ross, B.: An Introduction to the Fractional Calculus and Fractional Differential Equa-
tions. Wiley, New York (1993)
Mirzaee, F., Alipour, S.: An efficient cubic B?spline and bicubic B?spline collocation method for
numerical solutions of multidimensional nonlinear stochastic quadratic integral equations. Math.
Methods Appl. Sci. 43, 384–397 (2020)
Mirzaei, D., Dehghan, M.: A meshless based method for solution of integral equations. Appl. Numer.
Math. 60, 245–262 (2010)
Mohammad, M.: A numerical solution of Fredholm integral equations of the second kind based on
tight framelets generated by the oblique extension principle. Symmetry 11, 854–869 (2019)
Nemati, S., Lima, P.M., Ordokhani, Y.: Numerical solution of a class of two-dimensional nonlinear
Volterra integral equations using Legendre polynomials. J. Comput. Appl. 242, 53–69 (2013)
Oldham, K., Spanier, J.: The Fractional Calculus theory and Applications of Differentiation and
Integration to Arbitrary Order. Elsevier, Amsterdam (1974)
Parand, K., Delkhosh, M.: Solving the nonlinear Schlomilch’s integral equation arising in iono-
spheric problems. Afr. Mat. 28, 459–480 (2017)
Parand, K., Rad, J.A.: An approximation algorithm for the solution of the singularly perturbed
Volterra integro-differential and Volterra integral equations. Int. J. Nonlinear Sci. 12, 430–441
(2011)
Parand, K., Rad, J.A.: Numerical solution of nonlinear Volterra-Fredholm-Hammerstein integral
equations via collocation method based on radial basis functions. Appl. Math. Comput. 218,
5292–5309 (2012)
Parand, K., Abbasbandy, S., Kazem, S., Rad, J.A.: A novel application of radial basis functions for
solving a model of first-order integro-ordinary differential equation. Commun. Nonlinear Sci.
Numer. Simul. 16, 4250–4258 (2011)
Parand, K., Delafkar, Z., Pakniat, N., Pirkhedri, A., Haji, M.K.: Collocation method using sinc and
Rational Legendre functions for solving Volterra’s population model. Commun. Nonlinear Sci.
Numer. Simul. 16, 1811–1819 (2011)
Parand, K., Rad, J.A., Nikarya, M.: A new numerical algorithm based on the first kind of modified
Bessel function to solve population growth in a closed system. Int. J. Comput. Math. 91, 1239–
1254 (2014)
Parand, K., Hossayni, S.A., Rad, J.A.: An operation matrix method based on Bernstein polynomials
for Riccati differential equation and Volterra population model. Appl. Math. Model. 40, 993–1011
(2016)
Parand, K., Yari, H., Taheri, R., Shekarpaz, S.: A comparison of Newton-Raphson method with
Newton-Krylov generalized minimal residual (GMRes) method for solving one and two dimen-
sional nonlinear Fredholm integral equations. Sema. 76, 615–624 (2019)
Parand, K., Aghaei, A.A., Jani, M., Ghodsi, A.: A new approach to the numerical solution of
Fredholm integral equations using least squares-support vector regression. Math. Comput. Simul.
180, 114–128 (2021)
Parand, K., Razzaghi, M., Sahleh, R., Jani, M.: Least squares support vector regression for solving
Volterra integral equations. Eng. Comput. 38(38), 789–796 (2022)
Rad, J.A., Parand, K.: Numerical pricing of American options under two stochastic factor models
with jumps using a meshless local Petrov-Galerkin method. Appl. Numer. Math. 115, 252–274
(2017)
Rad, J.A., Parand, K.: Pricing American options under jump-diffusion models using local weak
form meshless techniques. Int. J. Comput. Math. 94, 1694–1718 (2017)
Rahman, M.: Integral Equations and their Applications. WIT Press (2007)
Reichel, L.: A fast method for solving certain integral equations of the first kind with application
to conformal mapping. J. Comput. Appl. Math. 14, 125–142 (1986)
Tang, T., Xu, X., Cheng, J.: On spectral methods for Volterra integral equations and the convergence
analysis. J. Comput. Math. 26, 825–837 (2008)
Unterreiter, A.: Volterra integral equation models for semiconductor devices. Math. Methods Appl.
Sci. 19, 425–450 (1996)
Volterra, V.: Variations and fluctuations of the number of individuals in animal species living
together. ICES Mar. Sci. Symp. 3, 3–51 (1928)
Wang, G.Q., Cheng, S.S.: Nonnegative periodic solutions for an integral equation modeling infec-
tious disease with latency periods. In Intern. Math. Forum 1, 421–427 (2006)
Wang, S.Q., He, J.H.: Variational iteration method for solving integro-differential equations. Phys.
Lett. A 367, 188–191 (2007)
Wazwaz, A.M.: First Course in Integral Equations. A World Scientific Publishing Company (2015)
Wazwaz, A.M.: A reliable treatment for mixed Volterra-Fredholm integral equations. Appl. Math.
Comput. 127, 405–414 (2002)
Wazwaz, A.M.: Linear and Nonlinear Integral Equations. Springer, Berlin (2011)
Yousefi, S., Razzaghi, M.: Legendre wavelets method for the nonlinear Volterra-Fredholm integral
equations. Math. Comput. Simul. 70, 1–8 (2005)
Zaky, M.A., Ameen, I.G., Elkot, N.A., Doha, E.H.: A unified spectral collocation method for non-
linear systems of multi-dimensional integral equations with convergence analysis. Appl. Numer.
Math. 161, 27–45 (2021)
Chapter 10
Solving Distributed-Order Fractional
Equations by LS-SVR
Amir Hosein Hadian Rasanan, Arsham Gholamzadeh Khoee,

and Mostafa Jani
Abstract Over the past years, several artificial intelligence methods have been
developed for solving various types of differential equations; due to their potential to
deal with different problems, so many paradigms of artificial intelligence algorithms
such as evolutionary algorithms, neural networks, deep learning methods, and sup-
port vector machine algorithms, have been applied to solve them. In this chapter,
an artificial intelligence method has been employed to approximate the solution of
distributed-order fractional differential equations, which is based on the combina-
tion of least squares support vector regression algorithm and a well-defined spectral
method named collocation approach. The importance of solving distributed-order
fractional differential equations is being used to model some significant phenomena
that existed in nature, such as the ones in viscoelasticity, diffusion, sub-diffusion,
and wave. We used the modal Legendre functions, which have been introduced in
Chap. 4 and have been applied several times in the previous chapters as the least
squares support vector regression algorithm’s kernel basis in the presented method.
The uniqueness of the obtained solution is proved for the resulting linear systems.
Finally, the efficiency and applicability of the proposed algorithm are determined by
the results of applying it to different linear and nonlinear test problems.
Keywords Fractional calculus · Distributed order · Caputo operator · Numerical

approximation
A. H. Hadian Rasanan (B)

Department of Cognitive Modeling, Institute for Cognitive and Brain Sciences,
A. G. Khoee
Department of Computer Science, School of Mathematics, Statistics, and Computer Science,
University of Tehran, Tehran, Iran
M. Jani
Department of Computer and Data Science, Faculty of Mathematical Sciences,
10.1 Introduction
Fractional differential equations are frequently used for modeling real-world prob-
lems in science and engineering, such as fractional time evolution, polymer physics,
rheology, and thermodynamics Hilfer (2000), cognitive science Cao et al. (2021),
Hadian-Rasanan et al. (2021), neuroscience Datsko et al. (2015). Since the analytical
methods can only solve some special simple cases of fractional differential equations,
numerical methods are developed for solving these equations involving linear and
nonlinear cases. On the other hand, since the fractional calculus foundation, several
definitions have been presented by scientists such as Liouville Hilfer (2000), Caputo
(1967), Rad et al. (2014), Riesz Podlubny (1998), and Atangana-Baleanu-Caputo
(2017). So solving fractional differential equations is still a challenging problem.
In modeling some phenomena, the order of derivatives depends on time (i.e.,
derivative operators’ order as a function of t). It is for the memory property of the
variable-order fractional operators Heydari et al. (2020). A similar concept to the v
ariable- order differentiation is the distributed order differentiation, which indicates
a continuous form of weighted summation of various orders chosen in an interval.
In many problems, differential equations can have different orders. For instance, the
first-order differential equation can be written by the summation of various orders
along with their proper weight functions as follows

n
α
ω j Dt j u, (10.1)
j=0
in which α j s are descending and have equal distance to each other, ω j s can be
determined by using the data. It is worth mentioning, by calculating the limit of the
above equation, convert to the following form, which is convergence:
1
ω(x)D α u d x. (10.2)
0
Distributed-order fractional differential equations (DOFDEs) have been consid-

ered since 2000 Ding et al. (2021). DOFDEs have many applications in differ-
ent areas of physics and engineering. For example, DOFDEs can describe vari-
ous phenomena in viscoelastic material Bagley and Torvik (1985), Umarov and
Gorenflo (2005), statistical and solid mechanics Carpinteri and Mainardi (2014),
citetch10rossikhin1997applications. System identification is also a significant appli-
cation of DOFDEs Hartley (1999). Moreover, in the study of the anomalous diffusion
process, DOFDEs provide more compatible models with experimental data Caputo
(2003), citetch10sokolov2004distributed. Also, in signal processing and system con-
trols, DOFDEs have been used frequently Parodi and Gómez (2014). In this chapter,
an algorithm based on a combination of the least squares support vector regres-
sion (LS-SVR) algorithm, and modal Legendre collocation method is presented for
10 Solving Distributed-Order Fractional Equations by LS-SVR 227
solving a DOFDE given by Mashayekhi and Razzaghi (2016), Yuttanan and Razzaghi
(2019):
b
G 1 ( p, D p u(t))dp + G 2 (t, u(t), D αi u(t)) = F(t), t > 0, (10.3)
a
with the initial conditions:

u (k) (0) = 0, (10.4)
where i ∈ N, αi > 0, k = 0, 1, . . . , max{b, αi } and G 1 , G 2 can be linear or non-

linear functions. In the proposed algorithm, the unknown function is expanded by
using modal Legendre polynomials; then, using the support vector regression algo-
rithm and collocating the training points in the residual function, the problem is
converted to a constrained optimization problem. By solving this optimization prob-
lem, the solution of DOFDEs is obtained. Since DOFDEs have a significant role in
various fields of science, many researchers have developed several numerical algo-
rithms for solving them. In the following section, we review the existing methods
used to solve DOFDEs and some artificial intelligence algorithms that are developed
to solve different types of differential equations.
10.1.1 A Brief Review of Other Methods Existing

in the Literature
Najafi et al. analyzed the stability of three classes of DOFDEs subjected to the
nonnegative density function Najafi et al. (2011). Atanacković et al. studied the exis-
tence, and the uniqueness of mild and classical solutions for the specific general
form, which are arisen in distributed derivative models of viscoelasticity and identi-
fication theory Atanacković et al. (2007). He and his collaborators also studied some
properties of the distributed-order fractional derivative in viscoelastic rod solutions
Atanackovic et al. (2005). Refahi et al. presented DOFDEs to generalized the iner-
tia and characteristics of polynomial concepts concerning the nonnegative density
function Refahi et al. (2012). Aminikhah et al. employed a combined Laplace trans-
form and new Homotopy perturbation method for solving a particular class of the
distributed-order fractional Riccati equation Aminikhah et al. (2018). For a time
distributed order multi-dimensional diffusion-wave equation which is contained a
forcing term, Atanacković et al. reinterpreted a Cauchy problem Atanackovic et al.
(2009).
Katsikadelis proposed a method based on the finite-difference method to solve
both linear and nonlinear DOFDEs numerically Katsikadelis(2014). Dielthem and
m
Ford proposed a method for DOFDEs of the general form 0 A (r, D∗r u(t))dr =
f (t) where m ∈ R+ and D∗r is the fractional derivative of Caputo type of order r and
introduced its analysis Diethelm and Ford (2009). Zaky and Machado first derived
the generalized necessary conditions for optimal control with dynamics described
by DOFDEs and then proposed a practical numerical scheme for solving these equa-
tions Zaky and Machado (2017). Hadian-Rasanan et al. provided an artificial neural
network framework for approximating various types of Lane-Emden equations such
as fractional-order, and Lane-Emden equations system Hadian-Rasanan et al. (2020).
Razzaghi and Mashayekhi presented a numerical method to solve the DOFDEs
based on hybrid function approximation Mashayekhi and Razzaghi (2016). Mashoof
and Refahi proposed methods based on the fractional-order integration’s operational
matrix with an initial value point for solving the DOFDEs Mashoof and Sheikhani
(2017). Li et al. presented a high order numerical scheme for solving diverse DOFDEs
by applying the reproducing kernel Li et al. (2017). Gao and Sun derived two implicit
difference schemes for two-dimensional DOFDEs Gao et al. (2016).
In the first chapter, SVM has been fully and comprehensively discussed.
Mehrkanoon et al. introduced a novel method based on LS-SVMs to solve ODEs
Mehrkanoon et al. (2012). Mehrkanoon and Suykens also presented another new
approach for solving delay differential equations Mehrkanoon et al. (2013). Ye et al.
presented an orthogonal Chebyshev kernel for SVMs Ye et al. (2006). Leake et al.
compared the application of the theory of connections by employing LS-SVMs Leake
et al. (2019). Baymani et al. developed a new technique by utilizing -LS-SVMs to
achieve the solution of the ODEs in an analytical form Baymani et al. (2016). Chu
et al. presented an improved method for the numerical solution of LS-SVMs. They
indicated that by using a reduced system of linear equations, the problem could be
solved. They believed that their proposed approach is about twice as effective as the
previous algorithms Chu et al. (2005). Pan et al. proposed an orthogonal Legendre
kernel function for SVMs using the properties of kernel functions and comparing it
to the previous kernels Pan et al. (2012). Ozer et al. introduced a new set of func-
tions with the help of generalized Chebyshev polynomials, and they also increase
the generalization capability of their previous work Ozer et al. (2011).
Lagaris et al. proposed a method based on artificial neural networks that can
solve some categories of ODEs and PDEs and later compared their method to those
obtained using the Galerkin finite element method Lagaris et al. (1998). Meade and
Fernandez illustrated theoretically how a feedforward neural network could be con-
structed to approximate arbitrary linear ODEs Meade et al. (1994). They also indi-
cated the way of directly constructing a feedforward neural network to approximate
the nonlinear ordinary differential equations without training Meade et al. (1994).
Dissanayake and Phan-Thien presented a numerical method for solving PDEs, which
is based on neural-network-based functions Dissanayake and Phan-Thien (1994).
To solve ODEs and elliptic PDEs, Mai-Duy and Tran-Cong proposed mesh-free
procedures that rely on multiquadric radial basis function networks Mai-Duy and
Tran-Cong (2001). Effati and Pakdaman proposed a new algorithm by utilizing feed-
forward neural networks for solving fuzzy differential equations Effati and Pakdaman
(2010). To solve the linear second kind integral equations of Volterra and Fredholm
types, Golbabai and Seifollahi presented a novel method based on radial basis func-
tion networks that applied a neural network as the approximate solution of the integral
equations Golbabai and Seifollahi (2006). Jianyu et al. illustrated a neural network
to solve PDEs using the radial basis functions as the activation function of the hidden
nodes Jianyu et al. (2003).
The rest of the chapter is organized as follows. Some preliminaries are presented
in Sect. 10.2, including the fractional derivatives and the numerical integration. We
present the proposed method for simulating distributed-order fractional dynamics,
and we depict LS-SVR details in Sect. 10.3. Numerical results are given in Sect. 10.4
to show the validity and efficiency of the proposed method. In Sect. 10.5, we draw
concluding remarks.
10.2 Preliminaries
As we point out earlier, modal Legendre functions are used in the proposed algorithm.
We skip clarifying the properties and the procedure of making modal Legendre
functions in this chapter as they have been discussed thoroughly in Chap. 4. Albeit,
the fractional derivative focusing on the Caputo definition of the fractional derivative
and the numerical integration are described in this section.
10.2.1 Fractional Derivative
By considering f (x) as a function, the Cauchy formula for n-th order can be obtained,
and by generalizing the Cauchy formula for non-integer orders, we can achieve the
Riemann-Liouville definition of fractional integral, and due to this, the well-known
Gamma function is used as the factorial function for non-integer numbers. Therefore,
by denoting the Riemann-Liouville definition of the fractional order of integral as
RL β
a Ix f (x), it can be defined as follows Hadian et al. (2020):
x
RL β 1
a Ix f (x) = (x − t)β−1 f (t) dt. (10.5)
(β) a
in which β is a real number which indicates the order of integral. Moreover, there
is another definition for fractional derivative, namely, Caputo definition, and by
β
denoting it as Ca Dx f (x), it can be defined as follows Hadian et al. (2020):
⎧ x
⎨ 1 f (k) (t)dt
if β ∈
/N
C β R L (k−β) (k) (k−β) a (x−t)β+1−k
a Dx f (x) = a Ix f (x) = , (10.6)
⎩ d β
if β ∈ N
dxβ
It is worth mentioning that Eqs. (10.7) and (10.8) are the most useful properties of
the Caputo derivative Hadian et al. (2020)
⎧ (γ +1) γ −α
⎨ (γ −α+1) x , 0≤α≤γ
C α γ
0 Dx x = , (10.7)
⎩
0, α>γ
and since the Caputo derivative is a linear operator Hadian et al. (2020), we have
C α
a Dx (λ f (x) + μg(x)) = λ Ca Dxα f (x) + μ Ca Dxα g(x) λ, μ ∈ R. (10.8)
10.2.2 Numerical Integration
In numerical analysis, the quadrature rule is an approximation of the definite integral

of a function, and it is usually stated as a weighted sum of function values at specified
points within the considered domain of integration. An n-point Gaussian quadrature
rule, named after Carl Friedrich Gauss, the German mathematician, is a quadrature
rule constructed to yield an exact result for polynomials of degree 2n − 1 or less,
b
by a suitable choice of the nodes xi and weights ωi = a h i (x)ω(x) d x for i =
1, 2, . . . , n. The most common domain of integration for such a rule is taken as
[−1, 1], but we assume the interval [a, b] for the integral boundary so the rule is
stated as
b
n
f (x) d x ≈ ωi f (xi ) (10.9)
a i=1
Now we consider the following theorem

Theorem 10.1 (Mastroianni and Milovanovic (2008)) Suppose x0 , x1 , . . . , xn as the
roots of the orthogonal polynomial p N +1 of order N + 1. Then there exist a unique
b
set of quadrature weights ω0 , ω1 , . . . , ω N defined as ω j = a h j (x)ω(x) d x
b
n
ω(x) p(x) d x ≈ ωi p(xi ) ∀ p ∈ P2N +1 (10.10)
a i=1
In which the weights are positive and calculated as follows:
kn+1 f n 2ω
ωi = , 0≤i ≤n (10.11)
kn f n (xi ) f n+1 (xi )
which ki s are the leading coefficients of f i

In the next section, the mentioned method is proposed by considering the Least
squares support vector regression definition.
10.3 LS-SVR Method for Solving Distributed-Order

Fractional Differential Equations
In this section, first, the LS-SVR is introduced, then some of its characteristics are
discussed, then the considered method based on LS-SVR, which helps solve the
DOFDEs, is included. Having the concepts of LS-SVM, we present our solution
based on the LS-SVM regression in this section. Due to this, consider the following
equations
b
G 1 ( p, D p u(t)) dp + G 2 (t, u(t), D αi u(t)) = F(t), t ∈ [0, η], (10.12)
a
with the following initial conditions:
u (k) (0) = 0, (10.13)
where k = 0, 1, . . . , max{b, αi }. Now, we discuss the convergence and consis-

tency of method analysis in a linear situation
b
g( p)D p u(t) dp + A(u(t)) = F(t), (10.14)
a
since A is a linear operator, we can rewrite the Eq. 10.14 as
L(u(t)) = F(t), (10.15)
now we consider the LS-SVR for solving the Eq. 10.14
u(x) ≈ u N (x) = ω T φ(x), (10.16)
which ω = [ω0 , . . . , ω N ]T are the weight coefficients and φ = [φ0 , . . . , φ N ]T are the
basis functions. To determine the unknown coefficients, we consider the following
optimization problem
1 γ
min ω T ω + ε T ε, (10.17)
ω,ε 2 2
such that for every i = 0, . . . , N

b
g( p)D p (u(ti )) dp + Au(ti ) − F(ti ) = εi (10.18)
a
hence we have
Lu(ti ) − F(ti ) = εi (10.19)
N
consider the Lagrangian L(u) = j=0 ω j l j we have
1 T γ N
L = ω ω + εT ε − λi (L(u(ti )) − F(ti ) − εi ), (10.20)
2 2 i=0
and we also have

⎛ ⎞
1 2 γ 2 ⎝
N N N N
L = ω + ε − λi ω j L(φ j (ti )) − Fi − εi ⎠ , (10.21)
2 j=0 j 2 i=0 i i=0 j=0
we can calculate the extremums of Eq. 10.21 as follows

⎧
⎪ ∂L N
⎪
⎪ = 0 ωk = i=0 λi (L(φk (ti ))) = 0 k = 0, 1, . . . , N
⎪
⎪ ∂ω
⎪
⎪ k
⎪
⎪ Ski
⎨ ∂L
=0 2γ εk + λk = 0 k = 0, 1, . . . , N , (10.22)
⎪
⎪ ∂εk
⎪
⎪
⎪
⎪
⎪
⎪ ∂L
⎪
⎩ = 0 Nj=0 ω j (L(φ j (tk ))) = Fk − εk k = 0, 1, . . . , N
∂λk
and it can be summarized in a vector form as follows

⎧
⎪
⎨ω − Sλ = 0 (I)
2γ ε = −λ ; ε = − 2γ λ (II) .
1
(10.23)
⎪
⎩ T
S ω− f −ε =0 (III)
Matrix S is as below
b
Si j = L(φi (t j )) = g( p)D p φi (t j ) dp + Aφi (t j ), (10.24)
a
and by considering this equation, Mi j := D p φi (ti ) can be defined. We use the Gaus-
b ( p)
sian numerical integration to calculate a g( p)Mi j dp.
There exist two ways to determine numerical integration
b
N
f (x)d x ≈ ωi f (xi ) (10.25)
a i=1
Ordinarily, we can apply the Newton–Cotes method by considering each xi as a

fixed point and finding ωi s. Alternatively, we can optimize the xi s and ωi s. The xi s
are the Legendre polynomials roots relocated to interval [a, b]. We notice calculating
Au(ti ) as A is a differential operator (which can be fractional) is explained. With the

help of Eq. 10.23-(II) and substituting Eq. 10.23-(I) in Eq. 10.23-(III), we have
1
S T Sλ + λ = f, (10.26)
2γ
By defining A := S T S + 1
2γ
I , the following equation is achieved
Aα = f . (10.27)
Remark 10.1 The Matrix A is positive definite if and only if for all vectors x ∈ Rn+1
we have x T Ax ≥ 0. So for every vector x, we can address

1 1
x T ST S + I x = x T S T Sx + x T x, (10.28)
2γ γ
as γ > 0 the matrix A is positive definite.
Theorem 10.2 In linear system Eq. 10.23, there is a unique solution.
Proof By considering the Remark 10.1, the theorem is proved.
Since the system Eq. 10.27 is positive definite and the sparseness of A with
solving Eq. 10.27, we can reach α, and with the help of criteria (I) of Eq. 10.23, we
can calculate ω.
In the nonlinear cases, the nonlinear equation can be converted to a sequence
of linear equations using Quasi-Linearization Method (QLM). Then we are able to
solve the obtained linear equations separately.
In the next section, some numerical examples are provided for indicating the
efficiency and accuracy of the presented method, and the convergence of the proposed
approach is represented.
10.4 Numerical Results and Discussion
In this section, various numerical examples are presented to express the mentioned
method’s accuracy and efficiency, here are three linear and three nonlinear examples,
and the comparison of our results with other related works. Moreover, the interval
= [0, T ] is considered as the intended domain. Additionally, the utilized norm for
comparing the exact solution and the approximated one, which is denoted by e 2
is defined as follows:
T 21
e 2 = (u(t) − u app (t))2 dt , (10.29)
0
in which u(t) is the exact solution and the u app (t) is the approximated solution. All
numerical experiments are computed in Maple software on a 3.5 GHz Intel Core i5
CPU machine with 8 GB of RAM.
10.4.1 Test Problem 1
As the first example, consider the following equation Yuttanan and Razzaghi (2019)
1.5
t 1.8 − t 0.5
(3 − p)D p u(t) dp = 2 , (10.30)
0.2 ln(t)
with the following initial condition
u(0) = u (0) = 0. (10.31)
in which u(t) = t 2 is the exact solution. Then by applying the proposed method, the
following equation will be concluded: Fig. 10.1 indicates the errors, for
Example 10.4.1 for three different aspects, which are the number of Gaussian points,
Gamma, and the number of basis functions.
By looking at Fig. 10.1a, it can be concluded that by increasing the Gaussian
points, the error is converging to zero; by a glance at Fig. 10.1b, it is obvious the
error is decreasing exponentially, and Fig. 10.1c, it seems that increasing the number
of bases, caused error reducing, constantly.
Suppose the following nonlinear equation Xu et al. (2019)

1
t4 − t3
(5 − p)D p u(t) dp = sin(u(t)) + 24 − sin(t 4 ), (10.32)
0 ln(t)
with the initial condition below

u(0) = 0. (10.33)
which u(t) = t 4 is the exact solution. Now, look at the following figure, which
demonstrates the error behavior in the different number of Gaussian points, amount
of Gamma, and the number of basis.
By considering Fig. 10.2a is shown that by increasing the Gaussian points, the error
converges to zero more. Figure 10.2b shows that the error reduced quite exponentially,
and Fig. 10.2c indicates that decreasing error is due to increasing the bases number.
(a) Calculated error for different Gaus- (b) Calculated error for different
sian points numbers Gamma values
(c) Calculated error for different basis

numbers for example 10.4.1
Fig. 10.1 Calculated error for three different aspects, for Example 10.4.1
This example has also been solved by Xu et al. (2019). Table 10.1 is a comparison
between the method mentioned earlier and the one that Xu proposed. It can be
concluded that for different Gaussian numbers, the proposed method is achieving
better results.
Suppose the following example Xu et al. (2019)

1
t 5 (t − 1)
(7 − p)D p u(t)dp = −u 3 (t) − u(t) + 720 − t 6, (10.34)
0 ln(t)
(a) Calculated error for different Gaussian (b) Calculated error for different Gamma
points numbers values

numbers
Fig. 10.2 Calculated error for three different aspects for Example 10.4.2
Table 10.1 The table of the Example 10.4.2 in comparison with Xu method Xu et al. (2019)
M Method of Xu et al. (2019) Presented method with N = 4
with N = 4
2 2.4768E-008 6.9035E-010
3 9.0026E-011 2.3935E-014
4 2.3632E-013 8.4114E-018
5 2.7741E-015 3.0926E-023
by considering the following initial condition
u(0) = 0. (10.35)
u(t) = t 6 is the exact solution of this example. Now consider Fig. 10.3 which displays
an error in different aspects.
By looking at Fig. 10.3a, it is clear that as the Gaussian points are increasing, the
error is decreasing. Same in Fig. 10.3b, c when Gamma and number of bases are
decreasing, respectively, the error decrease too.
In Table 10.2, the results of the proposed technique and the method presented in
Xu et al. (2019) are compared. By a glance at the table, it is clear that the suggested
method is working way better than Xu’s method.

numbers
Table 10.2 The table of the comparison of our results for Example 10.4.3 and Xu method Xu et al.
(2019)
M Method in Xu Proposed method Method of Xu Presented method
et al. (2019) with with N = 7 et al. (2019) with with N = 9
N =7 N =9
2 4.3849E-009 1.1538E-010 1.3008E-008 1.1550E-010
3 1.5375E-011 1.8923E-015 3.8919E-011 1.8932E-015
4 3.7841E-014 3.0791E-019 8.0040E-014 2.4229E-019
5 3.6915E-016 3.1181E-019 1.2812E-015 2.4578E-019
Suppose an example as follows Xu et al. (2019):

1
t 2.1 (e − 1)
e p D p u(t) dp = −u 3 (t) − u(t) + (7.1) + t 9.3 + t 3.1 , (10.36)
0 ln(t)
with considering the following initial condition
u(0) = 0. (10.37)
The exact solution is u(t) = t 3.1 . Now consider the following figure, which presents
that the error is converging to zero.
Figure 10.4a–c confirms that if the Gaussian points, Gamma, and number of basis
increases, respectively, then the error will converge to zero. Table 10.3 shows the
results obtained in Xu et al. (2019) and the result obtained by using the presented
method.
For the final example consider the following equation Xu et al. (2019)

1 2
t5 − t3
(6 − p)D p u(t) dp = (7.1) , (10.38)
120 0 ln(t)
with the following initial condition
u(0) = 0. (10.39)
Where u(t) = t 5 is the exact solution. Now suppose Fig. 10.5, which is showing the
behavior of error by considering different aspects.

numbers
(2019)
M Method in Xu Proposed method Method of Xu Presented method
et al. (2019) with with N = 6 et al. (2019) with with N = 14
N =6 N = 14
2 4.4119E-005 1.5437E-005 2.4317E-006 2.5431E-007
3 4.3042E-005 1.5435E-005 2.4984E-007 2.5221E-007
4 4.3034E-005 1.5435E-005 2.4195E-007 2.5221E-007
5 4.3034E-005 1.5435E-005 2.4190E-007 2.5221E-007
By looking at Fig. 10.5a c, it can be concluded that the error which is proposed
in Eq. 10.29 converges to zero, and in Fig. 10.5a, it converges to zero exponentially,
which means the approximate results are converging to the exact solution.

numbers
(2019)
N Method in Xu et al. (2019) Proposed method with M = 4
with M = 4
5 4.2915E-010 4.8113E-002
6 3.3273E-010 1.6388E-013
7 1.9068E-010 1.4267E-013
8 9.4633E-011 1.2609E-013
9 6.7523E-011 1.1283E-013
Table 10.4 indicates the comparison between our proposed method and the method
proposed in Xu et al. (2019). By studying this table, it can be concluded that besides
N = 5, our method is achieving results more accurately.
10.5 Conclusion
In this chapter, the LS-SVR algorithm was utilized for solving the Distributed-order
fractional differential equations, which is based on utilizing modal Legendre as the
basis function. The mentioned algorithm has better accuracy in comparison with other
proposed methods for DOFDEs. On the other hand, the uniqueness of the solution
is provided. One of the important parameters for obtaining high precision in solving
this kind of equation is the Gamma parameter. The impact of this parameter on the
accuracy of the numerical algorithm is indicated in numerical examples. Moreover,
it has to be considered that the Gamma value cannot be increased too much because
it might become more than machine accuracy. Even though all of the computations
are done with MAPLE, which is a symbolic application, this point is considered. For
indicating the applicability of the proposed algorithm, five examples are proposed in
which the obtained accuracy of these examples is compared with other methods.
References
Aminikhah, H., Sheikhani, A.H.R., Rezazadeh, H.: Approximate analytical solutions of distributed
order fractional Riccati differential equation. Ain Shams Eng. J. 9, 581–588 (2018)
Atanackovic, T.M., Pilipovic, S., Zorica, D.: Time distributed-order diffusion-wave equation. II.
Applications of laplace and fourier transformations. Proc. R. Soc. A: Math. Phys. Eng. Sci. 465,
1893–1917 (2009)
Atanackovic, T.M., Budincevic, M., Pilipovic, S.: On a fractional distributed-order oscillator. J.
Phys. A: Math. Gen. 38, 6703 (2005)
Atanacković, T.M., Oparnica, L., Pilipović, S.: On a nonlinear distributed order fractional differential
equation. J. Math. Anal. 328, 590–608 (2007)
Atangana, A., Gómez-Aguilar, J.F.: A new derivative with normal distribution kernel: theory, meth-
ods and applications. Phys. A: Stat. Mech. Appl. 476, 1–14 (2017)
Bagley, R.L., Torvik, P.J.: Fractional calculus in the transient analysis of viscoelastically damped
structures. AIAA J. 23, 918–925 (1985)
Baymani, M., Teymoori, O., Razavi, S.G.: Method for solving differential equations. Am. J. Comput.
Sci. Inf. Eng. 3, 1–6 (2016)
Cao, K.C., Zeng, C., Chen, Y., Yue, D.: Fractional decision making model for crowds of pedestrians
in two-alternative choice evacuation. IFAC-PapersOnLine 50, 11764–11769 (2017)
Caputo, M.: Linear models of dissipation whose Q is almost frequency independent-II. Geophys.
J. Int. 13, 529–539 (1967)
Caputo, M.: Diffusion with space memory modelled with distributed order space fractional differ-
ential equations. Ann. Geophys. 46, 223–234 (2003)
Carpinteri, A., Mainardi, F.: Fractals and Fractional Calculus in Continuum Mechanics. Springer
(2014)
Chu, W., Ong, C.J., Keerthi, S.S.: An improved conjugate gradient scheme to the solution of least
squares SVM. IEEE Trans. Neural Netw. 16, 498–501 (2005)
Datsko, B., Gafiychuk, V., Podlubny, I.: Solitary travelling auto-waves in fractional reaction-
diffusion systems. Commun. Nonlinear Sci. Numer. Simul. 23, 378–387 (2015)
Diethelm, K., Ford, N.J.: Numerical analysis for distributed-order differential equations. J. Comput.
Appl. Math. 225, 96–104 (2009)
Ding, W., Patnaik, S., Sidhardh, S., Semperlotti, F.: Applications of distributed-order fractional
operators: a review. Entropy 23, 110 (2021)
Dissanayake, M.W.M.G., Phan-Thien, N.: Neural-network-based approximations for solving partial
differential equations. Commun. Numer. Methods Eng. 10, 195–201 (1994)
Effati, S., Pakdaman, M.: Artificial neural network approach for solving fuzzy differential equations.
Inf. Sci. 180, 1434–1457 (2010)
Gao, G.H., Sun, Z.Z.: Two alternating direction implicit difference schemes for two-dimensional
distributed-order fractional diffusion equations. J. Sci. Comput. 66, 1281–1312 (2016)
Golbabai, A., Seifollahi, S.: Numerical solution of the second kind integral equations using radial
basis function networks. Appl. Math. Comput. 174, 877–883 (2006)
Hadian-Rasanan, A.H., Rad, J.A., Sewell. D. K.: Are there jumps in evidence accumulation, and
what, if anything, do they reflect psychologically? An analysis of Lévy-Flights models of decision-
making. PsyArXiv (2021). https://doi.org/10.31234/osf.io/vy2mh
Hadian-Rasanan, A.H., Rahmati, D., Gorgin, S., Parand, K.: A single layer fractional orthogonal
(2020)
Hartley, T.T.: Fractional system identification: an approach using continuous order-distributions.
NASA Glenn Research Center (1999)
Heydari, M.H., Atangana, A., Avazzadeh, Z., Mahmoudi, M.R.: An operational matrix method
for nonlinear variable-order time fractional reaction-diffusion equation involving Mittag-Leffler
kernel. Eur. Phys. J. Plus 135, 1–19 (2020)
Hilfer, R.: Applications of Fractional Calculus in Physics. World Scientific (2000)
Jianyu, L., Siwei, L., Yingjian, Q., Yaping, H.: Numerical solution of elliptic partial differential
equation using radial basis function neural networks. Neural Netw. 16, 729–734 (2003)
Katsikadelis, J.T.: Numerical solution of distributed order fractional differential equations. J. Com-
put. Phys. 259, 11–22 (2014)
Lagaris, I.E., Likas, A., Fotiadis, D.I.: Artificial neural networks for solving ordinary and partial
differential equations. IEEE Trans. Neural Netw. Learn. Syst. 9, 987–1000 (1998)
Leake, C., Johnston, H., Smith, L., Mortari, D.: Analytically embedding differential equation con-
straints into least squares support vector machines using the theory of functional connections.
Mach. Learn. Knowl. Extr. 1, 1058–1083 (2019)
Li, X., Li, H., Wu, B.: A new numerical method for variable order fractional functional differential
equations. Appl. Math. Lett. 68, 80–86 (2017)
Mai-Duy, N., Tran-Cong, T.: Numerical solution of differential equations using multiquadric radial
basis function networks. Neural Netw. 14, 185–199 (2001)
Mashayekhi, S., Razzaghi, M.: Numerical solution of distributed order fractional differential equa-
tions by hybrid functions. J. Comput. Phys. 315, 169–181 (2016)
Mashoof, M., Sheikhani, A.R.: Simulating the solution of the distributed order fractional differential
equations by block-pulse wavelets. UPB Sci. Bull. Ser. A: Appl. Math. Phys 79, 193–206 (2017)
Mastroianni, G., Milovanovic, G.: Interpolation Processes: Basic Theory and Applications. Springer
Science & Business Media, Berlin (2008)
Meade, A.J., Jr., Fernandez, A.A.: The numerical solution of linear ordinary differential equations
by feedforward neural networks. Math. Comput. Model. 19, 1–25 (1994)
Meade, A.J., Jr., Fernandez, A.A.: Solution of nonlinear ordinary differential equations by feedfor-
ward neural networks. Math. Comput. Model. 20, 19–44 (1994)
Mehrkanoon, S., Suykens, J.A.: LS-SVM based solution for delay differential equations. J. Phys.:
Conf. Ser. 410, 012041 (2013)
Mehrkanoon, S., Falck, T., Suykens, J.A.: Approximate solutions to ordinary differential equations
using least squares support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 23, 1356–
1367 (2012)
Najafi, H.S., Sheikhani, A.R., Ansari, A.: Stability analysis of distributed order fractional differential
equations. Abst. Appl. Anal. 2011, 175323 (2011)
Pan, Z.B., Chen, H., You, X.H.: Support vector machine with orthogonal Legendre kernel. In: 2012
Parodi, M., Gómez, J.C.: Legendre polynomials based feature extraction for online signature veri-
fication. Consistency analysis of feature combinations. Pattern Recognit. 47, 128–140 (2014)
Podlubny, I.: Fractional differential equations: an introduction to fractional derivatives, fractional
differential equations, to methods of their solution and some of their applications. Elsevier (1998)
Rad, J.A., Kazem, S., Shaban, M., Parand, K., Yildirim, A.: Numerical solution of fractional differ-
ential equations with a Tau method based on Legendre and Bernstein polynomials. Math. Methods
Appl. Sci. 37, 329–342 (2014)
Refahi, A., Ansari, A., Najafi, H.S., Merhdoust, F.: Analytic study on linear systems of distributed
order fractional differential equations. Matematiche 67, 3–13 (2012)
Rossikhin, Y.A., Shitikova, M.V.: Applications of fractional calculus to dynamic problems of linear
and nonlinear hereditary mechanics of solids. Appl. Mech. Rev. 50, 15–67 (1997)
Sokolov, I. M., Chechkin, A.V., Klafter, J.: Distributed-order fractional kinetics (2004).
arXiv:0401146
Umarov, S., Gorenflo, R.: Cauchy and nonlocal multi-point problems for distributed order pseudo-
differential equations: part one. J. Anal. Appl. 245, 449–466 (2005)
Xu, Y., Zhang, Y., Zhao, J.: Error analysis of the Legendre-Gauss collocation methods for the
nonlinear distributed-order fractional differential equation. Appl. Numer. Math. 142, 122–138
(2019)
Yuttanan, B., Razzaghi, M.: Legendre wavelets approach for numerical solutions of distributed
order fractional differential equations. Appl. Math. Model. 70, 350–364 (2019)
Zaky, M.A., Machado, J.T.: On the formulation and numerical simulation of distributed-order frac-
tional optimal control problems. Commun. Nonlinear Sci. Numer. Simul. 52, 177–189 (2017)
Part IV
Orthogonal Kernels in Action
Chapter 11
GPU Acceleration of LS-SVM, Based on
Fractional Orthogonal Functions
Armin Ahmadzadeh, Mohsen Asghari, Dara Rahmati, Saeid Gorgin,

and Behzad Salami
Abstract SVM classifiers are employed widely in classification problems. How-

ever, the computation complexity prevents it from becoming applicable without
acceleration approaches. General-Purpose Computing using Graphics Processing
Units (GPGPU) is one of the most used techniques for accelerating array-based
operations. In this chapter, a method for accelerating the SVM with the kernel of
fractional orthogonal functions applied on GPU devices is introduced. The experi-
mental result for the first kind of Chebyshev function, used as an SVM kernel in this
book, resulted in a 2.2X speedup compared to GPU acceleration with CPU execution.
In the fit function and training part of the code, a 58X speedup on the Google Colab
GPU devices is obtained. This chapter proposes more details of GPU architecture
and how it can be used as a co-processor along with the CPU to accelerate the SVM
classifiers.
Keywords Software accelerator · Kernel functions · Graphical processing unit
A. Ahmadzadeh (B) · M. Asghari

School of Computer Science, Institute for Research in Fundamental Sciences,
Farmanieh Campus, Tehran, Iran
M. Asghari
D. Rahmati
Computer Science and Engineering (CSE) Department, Shahid Beheshti University,
Tehran, Iran
S. Gorgin
Department of Electrical Engineering and Information Technology, Iranian Research
Organization for Science and Technology (IROST), Tehran, Iran
B. Salami
Barcelona Supercomputing Center (BSC), Barcelona, Spain
11.1 Parallel Processing
Nowadays, computers and their benefits have revolutionized human life. Due to their
vast applications, the need for accuracy and multi-feature application demand have
made them more complex. These complexities are due to data accessibility and pro-
cessing complexity. Chip vendors always try to produce memories with less latency
and broader bandwidth to overcome data accessibility. Besides, the parallel process-
ing approach overcomes the processing complexity Asghari et al. (2022). Parallel
processing is an approach to divide large and complex tasks into several small tasks.
This approach allows execution of the processing parts simultaneously. GPUs are
special-purpose processors mainly designed for graphical workloads. Using GPUs
for general-purpose computing enlightened the way for the emergence of a new era in
high-performance computing Owens et al. (2007), Ahmadzadeh et al. (2018), includ-
ing accelerators for cryptosystems Ahmadzadeh et al. (2018), Gavahi et al. (2015),
Luo et al. (2015), multimedia compression standards Xiao et al. (2019), scientific
computing Moayeri et al. (2020), Parand et al. (2021), machine learning and cluster-
ing accelerators Rahmani et al. (2016), simulation of molecular dynamics Allec et al.
(2019), and quantum computing simulation Moayeri et al. (2020). The designers are
gradually making improvements in GPU architecture to accelerate it, resulting in
real-time processing in many applications. Many smaller computational units han-
dle simple parallel tasks within a GPU that traditionally CPUs were supposed to
hold. This usage of GPUs instead of CPUs is called General-Purpose Computing on
Graphics Processing Unit (GPGPU). One admiring improvement of the GPGPUs
is the release of CUDA (Compute Unified Device Architecture), NVIDIA’s paral-
lel computing platform and programming model. CUDA was introduced in 2007,
allowing many researchers and scientists to deploy their compute-intensive tasks on
GPUs. CUDA provides libraries and APIs that could be used inside multiple pro-
gramming languages such as C/C++ and Python, general-purpose applications such
as multimedia compression, cryptosystems, machine learning, etc. Moreover, they
are generally faster than CPUs, performing more instructions in a given time. There-
fore, together with CPUs, GPUs provide heterogeneous and scalable computing,
acting as co-processors while reducing the CPU’s workload.
The rest of this chapter is organized as follows. In Sect. 11.2, the NVIDIA GPU
architecture and how it helps us as an accelerator is presented. PyCUDA is also
described in Sect. 11.2. In Sect. 11.3, the proposed programming model is analyzed
before acceleration. Section 11.4 defines the hardware and software requirements
for implementing our GPU acceleration platform. The method of accelerating the
Chebyshev kernel is described in Sect. 11.5 in detail. In Sect. 11.6, we propose critical
optimizations that should be considered for more speedup. Then, we recommend a
Quadratic Problem Solver (QPS) based on GPU in Sect. 11.7 for more speedup in
our platform. This chapter is concluded in Sect. 11.8.
11 GPU Acceleration of LS-SVM, Based on Fractional … 249
11.2 GPU Architecture
The GPUs are specifically well-suited for three-dimensional graphical processing.

They are massively parallel processing units that have their usage in general-purpose
applications. Because of the power of parallelism and the flexible programming
model of modern GPUs, GPGPU computation has become an attractive platform
for both research and industrial applications. As some examples, developers can
use GPUs for physics-based simulation, linear algebra operations, Fast Fourier
Transform (FFT) and reconstruction, flow visualization, database operations, etc.,
to achieve high performance, while having accurate results.
A GPU consists of many streaming multi-processors (SM) modules, and each SM
contains several processors that are referred to as CUDA cores in NVIDIA products
(Fig. 11.1) Cheng et al. (2014). These multi-processors are utilized by scheduling
thousands of threads in thread blocks and dispatching the blocks onto SMs.The
latency of data load/store operation is high in GPUs. However, unlike CPUs that have
low latency in data loading/storing and lower processing throughput, they are high-
throughput processors. Threads of a block are arranged as WARPs to be scheduled for
execution. Every 32 consecutive threads make a WARP, and all threads of a WARP are
executed simultaneously in the SM as a Single Instruction Multiple Thread (SIMT)
operation. In a SIMT operation, all threads of a WARP execute the same instruction,
and any thread makes the output of the operation on its private data. By using WARP
scheduling and switching between stalling WARPs and ready-to-execute WARPs,
GPUs can hide the memory latency. In GPUs, the codes are executed in the form
of Kernels. Kernels are blocks of codes that are separated from the main code and
are called from the CPU (known as a host) and executed on GPU (known as device)
(Fig. 11.2) Cheng et al. (2014).
The kernel code should fully exploit the computational task’s data parallelism in
order to benefit fruitfully from the GPU’s massively parallel architecture. A group
of threads that can be launched together is called a thread block. There are different
CUDA memory types in a GPU. The most important memory types include global
memory, shared memory, and register memory Dalrymple (2014). Global memory
is the slowest of the three types, but is accessible to all of the threads and has the
largest capacity. Shared memory is accessible to all of the threads that share the same
thread block, and register memory is accessible only to a single thread, but it is the
fastest of the three memory types.
The host can read (write) from (to) GPU’s global memory shown in Fig. 11.3.
These accesses are made possible by the CUDA API. The global memory and
constant memory are the only memories in which their content can be accessed
and manipulated by the host Cheng et al. (2014). The CUDA programming model
uses fine-grained data parallelism and thread-level parallelism, nested within coarse-
grained data parallelism and task parallelism. The task can be partitioned into sub-
tasks executed independently in parallel by blocks of threads Cheng et al. (2014).
The GPU runs a program function called kernels using thousands of threads. By con-
sidering the GPU limitations and parallel concepts, a CUDA application can achieve
Fig. 11.1 GPU Architecture (CUDA Cores, Shared Memory, Special Function Units, WARP
Scheduler, and Register Files) Cheng et al. (2014)
higher performance by maximizing the utilization of WARP and processor resources

and the memory hierarchy resources.
There are two main limitations to GPGPU computations. Indeed, to increase the
number of cores and SMs and achieve their energy efficiency, GPU cores are designed
very simply and with low clock rates. They only provide performance gains when
a program is carefully parallelized into a large number of threads. Hence, trying to
make a general code work on GPUs is a difficult task and often results in inefficient
Fig. 11.2 GPU kernel and thread hierarchy (blocks of threads and grid of blocks) Cheng et al.
(2014)
running. Instead, it is more efficient to execute only those pieces of programs suitable
to GPU-style parallelism, such as linear algebra code, and leave the other sections
of the program code for the CPU. Therefore, writing suitable programs that utilize
the GPU cores is intricate even for the algorithms well-suited to be programmed in
parallel. There has been a generation of libraries that provide GPU implementations
of standard linear algebra kernels (BLAS), which help separate code into pieces sub-
tasks used in these libraries and achieve higher performance Cheng et al. (2014). This
described the first limitation. The second limitation of GPUs for GPGPU computation
is that GPUs use a separate memory from the host memory. In other words, the GPU
has a special memory and access hierarchy and uses a different address space from
the host memory.
The host (the CPU) and the GPU cannot share data easily. This limitation of
communication between the host and GPU is especially problematic when both host
and GPU are working on shared data simultaneously. In this case, the data must
be returned, and the host must wait for the GPU and vice versa. The worse section
of this scenario is transferring data between the host and the GPU as it is slow,
Fig. 11.3 GPU memory hierarchy (Global, Constant, Local, Shared, Registers) Cheng et al. (2014)
especially compared to the speed of host memory or the GPU dedicated memory.
The maximum speed of data transferred is limited to the PCI express bus speed,
as the GPU is connected to the host using the PCI express bus. Consequently, data
transmission often has the highest cost in GPU computation. Several methods have
been tried to avoid data transfers when costs exceed the gain of GPU computation
AlSaber et al. (2013). Therefore, GPU programming has many limitations that need
to be considered to achieve higher performance. For example, the amount of shared
memory capacity and the sequence of process executions and branch divergence, or
other bottlenecks, which are needed to be fixed by considering the GPU architecture.
The NVIDIA processor’s different architectures, such as Tesla, Maxwell, Pascal,
Volta, and Turing, are introduced here for better exploration.
• Tesla architecture emerged by introducing the GeForce 8800 product line, which
has unified the vertex and the pixel processor units. This architecture is based on
scalable array processors by facilitating an efficient parallel processor. In 2007,
C2050, another GPU with this architecture, found its way to the market Pienaar
et al. (2011). The performance of the Tesla C2050 reaches 515 GFLOPS in double-
precision arithmetic. It benefits from a 3 GB GDDR5 memory with 144 GB/s

bandwidth. This GPU has 448 CUDA-enabled cores at a frequency of 1.15 GHz.
• Another NVIDIA GPU architecture’s codename is Kepler. This architecture was
introduced in April 2012 Corporation (2019), Wang et al. (2015). The VLSI tech-
nology feature size is 28 nm, which is better than previous architectures. This
architecture was the first NVIDIA GPU that is designed considering energy effi-
ciency. Most of the NVIDIA GPUs are based on Kepler architecture, such as K20
and K20x, which are Tesla computing devices with double precision. The perfor-
mance of the Tesla K20x is 1312 GFLOPS in double-precision computations and
3935 GFLOPS in single-precision computations. This GPU has 6 GB GDDR5
memory with 250 GB/s bandwidth. The K20x family has 2688 CUDA cores that
work at a frequency of 732 MHz. This architecture was also followed by the new
NVIDIA GPUs, called Maxwell (GTX 980) and also Geforce 800M series.
• The Maxwell architecture was introduced in 2014, and the GeForce 900 series is a
member of this architecture. This architecture was also manufactured in TSMC 28
nm as the same process as the previous model (Kepler) NVIDIA (2014). Maxwell
architecture has a new generation of streaming multi-processors (SM); this archi-
tecture decreased the power consumption of the SMs. The Maxwell line of prod-
ucts came with a 2 MB L2 cache. The GTX980 GPU has the Maxwell internal
architecture, which delivers five TFLOPS of single-precision performance Corpo-
ration (2019). In this architecture, one CUDA core includes an integer and also a
floating-point logic that can work simultaneously. Each SM has a WARP sched-
uler, dispatch unit, instruction cache, and register file. Besides the 128 cores in
each SM, there are eight load/store units (LD/ST) and eight special function units
(SFU) Wang et al. (2015). The GTX 980 has four WARP schedulers in each SM,
enabling four concurrent WARPs to be issued and executed. LD/ST units calculate
source and destination addresses in parallel, letting eight threads load and store
data at each address inside the on-chip cache or DRAM Mei et al. (2016). SFUs
execute transcendental mathematical functions such as Cosine, Sine, and Square
root. At each clock, one thread can execute a single instruction in SFU, so each
WARP has to be executed four times if needed.
• Another NVIDIA GPU architecture codename is Pascal was introduced in April
2016 as the successor of the Maxwell architecture. As members of this architecture,
we can refer to GPU devices such as Tesla P100 and GTX 1080, manufactured in
TSMC’s 16 nm FinFET process technology. The GTX 1080 Ti GPU has the Pascal
internal architecture, which delivers 10 TFLOPS of single-precision performance.
This product has a 3584 CUDA core, which works at 1.5 GHz frequency, and the
memory size is 11 GB with 484 GB/s bandwidth. GTX 1080 Ti has 28 SMs, and
each SM has 128 cores, 256 KB of register file capacity, 96 KB shared memory,
and 48 KB of total L1 cache with CUDA capability 6.1 NVIDIA (2016). This
computation capability has features such as dynamic parallelism and atomic addi-
tion operating on 64-bit float values in GPU global memory and shared memory
NVIDIA (2016).
• The next-generation NVIDIA GPU architecture is Volta which succeeds the Pascal
GPUs. This architecture was introduced in 2017, which is the first chip with Tensor
cores specially designed for deep learning to achieve more performance over the
regular CUDA cores. NVIDIA GPUs with this architecture, such as Tesla V100,
are manufactured in TSMC 12 nm FinFET process. Tesla V100 performance is 7.8
TFLOPS in double-precision and 15.7 TFLOPS in single-precision floating-point
computations. This GPU card is designed with a CUDA capability of 7.0, and each
SMs have 64 CUDA cores; the total cores are 5120 units and the memory size is
16 GB with 900 GB/s bandwidth, which employs the HBM2 memory features
Mei et al. (2016). The maximum power consumption is 250 W; designed very
efficiently for power consumption and performance per watt NVIDIA (2017).
• Another NVIDIA GPU architecture, which was introduced in 2018, is the Tur-
ing architecture, and the famous RTX 2080 Ti product is based on it NVIDIA
(2018). The Turing architecture enjoys the capability of real-time ray tracing with
dedicated ray-tracing processors and dedicated artificial intelligence processors
(Tensor Cores). The performance of 4352 CUDA cores at 1.35 GHz frequency is
11.7 TFLOPS in single-precision computation, which uses GDDR6/HBM2 mem-
ory controller NVIDIA (2018). Its global memory size is 11 GB, the bandwidth of
this product is 616 GB/s, and the maximum amount of shared memory per thread
block is 64 KB. However, the maximum power consumption is 250 W, manu-
factured with TSMC 12 nm FinFET process, and this GPU supports the compute
capability 7.5 Ahmadzadeh et al. (2018), NVIDIA (2018), Kalaiselvi et al. (2017),
Choquette et al. (2021).
11.2.1 CUDA Programming with Python
CUDA provides APIs with libraries for C/C++ programming languages. However,
Python also has its own libraries for accessing CUDA APIs and GPGPU capabilities.
PyCUDA (2021) performs Pythonic access to the CUDA API for parallel computa-
tion in which claims:
1. PyCUDA has an automatic object clean-up after the object’s lifetime.
2. With some abstractions, it is more convenient than programming with NVIDIA’s
C-based runtime.
3. PyCUDA has all the CUDA’s driver API.
4. It supports automatic error checking.
5. PyCUDA’s base layer is written in C++; therefore, it is fast.
11.3 Analyzing Codes and Functions
In order to make an implemented CPU-based application faster, it is necessary to

analyze it. It is essential to find the parts of the code that may be made parallel. An
approach is to find the loops in the SVM code. Especially, in this chapter, we focus
on SVM with Chebyshev’s first kind kernel function. The proposed work is divided
into two parts: the training (fit) function and the test (project) function.
11.3.1 Analyzing the Training Function
At the beginning of the fit function, the two-nested for-loop calculates the K matrix.
The matrix K is 2D with the train size elements for each dimension. Inside the nested
loops, a Chebyshev function is calculated in each iteration. The recursive form of
Chebyshev calculation is not appropriate. In recursive functions, handling the stack
and always returning the results of each function call is problematic. GPU-based
implementations of the repeated memory access tasks by recursive functions are
not efficient. However, the Tn polynomial has an explicit form, which is called the
Chebyshev polynomial. For more details of the Chebyshev function and its explicit
form, refer to Chap. 3. The matrix P is the result of an inner product of matrix K
with the vectorized outer product on the target array. The next complex part of the
fit function is a quadratic problem (QP) solver. For the case of GPU acceleration,
CUDA has its own QP solver APIs.
11.3.2 Analyzing the Test Function
In the project function (test or inference), the complex part is the iterations inside the
nested loop, which in turn contains a call to the Chebyshev function. As mentioned in
Sect. 11.3.1, we avoid recursive functions due to the lack of efficient stack handling
in GPUs, and thus using them is not helpful. Therefore, for the test function, we do
the same as we did in the training function for calling the Chebyshev function.
11.4 Hardware and Software Requirements
So far, we have elaborated on the proper GPU architecture to be selected for

an SVM application. In our application (accelerating the LS-SVM programs), a
single-precision GPU device is computationally sufficient while it complies with
performance-cost limitations. This is in contrast to the training phase in which a
double-precision GPU is preferred. This work has been tested on an NVIDIA Tesla
T4 GPU device based on Turing architecture, a power-efficient GPU device with
16GB memory and 2560 CUDA cores and 8.1 TFLOPS performance; however, any
other CUDA-supported GPU devices may be used for this purpose. As a first step to
replicate this platform, the CUDA Toolkit and appropriate driver for your device must
be installed. Then, Python3 must be installed in the operating system. Python may be
installed independently or with the Anaconda platform and its libraries. As an alter-
Fig. 11.4 The output of nvidia-smi command
native, we recommend using Google Colab Welcome To Colaboratory (2021) to get

all the hardware requirements and software drivers available with only registration
and a few clicks on a web page.
Install PyCUDA with the following command
pip3 install pycuda
To check the CUDA toolkit and its functionality, execute this command
nvidia-smi
The above command results in Fig. 11.4 in which the GPU card information with
its running processes. In order to test PyCUDA, Program 1 which is implemented in
Python3 can be executed.
Program 1: First PyCUDA program to test the library and driver
1: import pycuda.driver as cuda

2: import pycuda.autoinit
3: from pycuda.compiler import SourceModule
4:
5: if __name__ == "__main__":
6:
7: mod = SourceModule("""
8: #include <stdio.h>
9: __global__ void myfirst_kernel(){
10: printf("Tread[%d, %d]: Hellow PyCUDA!!!\\n",
threadIdx.x, threadIdx.y);
11: }
12: """)
13:
14: function = mod.get_function("myfirst_kernel")
15: function(block=(4,4,1))
In this program, the PyCUDA library and its compiler are loaded. Then a source
module is created, which contains C-based CUDA kernel code. At line 15, the func-
tion is called with a 4X4 matrix of threads inside a single grid.
11.5 Accelerating the Chebyshev Kernel
For accelerating the Chebyshev Kernel, its explicit form (refer to Chap. 3) is more
applicable. As mentioned before, handling recursive functions due to the lack
of stack support in GPUs is complex. Therefore, the T n 1 statements from the
Chebyshev_Tn function should be replaced with its explicit form as listed in
Program 2.
1 The statement of one iteration in the recursive form of a first kind Chebyshev (refer to Chap. 3).
Program 2: The explicit form of Chebyshev_Tn function.
1: import pycuda.driver as cuda

2: from pycuda.compiler import SourceModule
3: import numpy
4: import pdb
5:
6: def Chebyshev(x,y,n=3,f=’r’):
7: m=len(x)
8: chebyshev_result = 0
9: p_cheb = 1
10:
11: for j in range(0,m):
12: for i in range(0,n+1):
13: a= np.cos(i * np.arccos(x[j]))
14: b= np.cos(i * np.arccos(y[j]))
15: chebyshev_result += (a * b)
16: weight = np.sqrt(1 - (x[j] * y[j]) + 0.0002)
17: p_cheb *= chebyshev_result / weight
18: return p_cheb
In the next step, let’s take a look at the statement that calls the function
Chebyshev. The third line in Program 3, means calling Chebyshev(X[i],
X[j], n=3, f=’r’). This function is called n_sample2 times, and as n_sample
equals 60, then it will be called 3600 times.
Program 3: The statement which calls the function Chebyshev.
1: for i in range(n_samples):
2: for j in range(n_samples):
3: K[i, j] = self.kernel(X[i], X[j])
Accordingly, at first, the Chebyshev function is changed to CUDA kernel

as a single thread GPU function, then it is changed to a multi-tread GPU func-
tion to remove the nested for-loop instruction shown in Program 3. The C-based
CUDA kernel is shown in Program 4. In this code, the instruction np.cos(i *
np.arccos(x[j])) has changed to cos(i * acos(x[j])) because of the
translation of Python to C programming language. A single thread will execute the
Chebyshev function; therefore, if you put it inside the nested for-loop shown in
Program 3, this single thread will be called 3600 times. According to GPUs’ architec-
ture mentioned previously, the GPU’s global memory is located on the device. Hence,
the x and y variables should be copied to the device memory from the host side (the
main memory from the CPU side). Therefore, for small and sequential applications,
this process will take a longer time in comparison with the CPU execution without
GPU.
Program 4: C-based CUDA kernel for Chebyshev_GPU function
1: __global__ void Chebyshev(float *x,

float *y,
float *p_cheb){
2: int i, j;
3: float a,b, chebyshev_result, weight;
4: chebyshev_result =0;
5: int idx = blockIdx.x blockDim.x + threadIdx.x *4;
6: p_cheb[idx] = 1;
7: for( j=0; j<4; j++){
8: for( i=0; i<4; i++){
9: a = cos(i * acos(x[j+idx]));
10: b = cos(i * acos(y[j+idx]));
11: chebyshev_result += (a * b);
12: }
13: weight = sqrt(1 - (x[j+idx] * y[j+idx]) + 0.0002);
14: p_cheb[idx] *= chebyshev_result / weight;
15: }
16: }
This program should transfer a massive block of data in single access to reduce
the number of accesses and minimize the memory access latency. Moreover, when a
GPU is active, all its functional units consume energy; hence utilizing more threads is
more efficient. In Program 5, the listed program breaks the inner for-loop (second line
of Program 3), then it calls the Chebyshev function only 60 times. According to
Program 5, to break the mentioned for-loop, the threadIdx.x is used. The matrix
data has been considered as a linear array. Each row of the matrix y is followed
after the first row of it in a single line. Therefore, the starting index of each row is
calculated with threadIdx.x * 4.
Before calling the CUDA Kernel function, we have to load all the needed variables
and array elements on the GPU device’s global memory. The mem_alloc instruc-
tion allocates the required memory on the device. After that, the memcpy_htod
copies the data from the host to the device. As shown in lines 8 and 12, the x_gpu
and y_gpu are the allocated memory on the device. The x variable is a single row
of the X matrix, containing 60 rows with four features. We send all the X matrix as
the y input. In line 35, the block size for the threads is defined. In this code, the
block contains 60 (n_samples) threads in the x dimension, and its y dimension is
set to 1, which shows a single dimension block is used. In order to break the other
for-loop (first line from Program 3), a two-dimensional block with 60 ∗ 60 threads
could be used. In this case, together with reducing memory access from the host site
for getting p_cheb outputs or writing x and y variables on the device, the amount
of data elements would be reduced as you do not need to copy the y variable to the
y_gpu.
Program 5: C-based CUDA kernel for the multiple threads
1: for i in range(n_samples): #n_samples

2: p_cheb = np.random.randn(1,n_samples)
3: p_cheb = p_cheb.astype(np.float32)
4: p_cheb_gpu = cuda.mem_alloc(p_cheb.nbytes)
5: cuda.memcpy_htod(p_cheb_gpu, p_cheb)
6: t1 = X[i].astype(np.float32)
7: x_gpu = cuda.mem_alloc(t1.nbytes)
8: cuda.memcpy_htod(x_gpu, t1)
9:
10: t2 = X.astype(np.float32)
11: y_gpu = cuda.mem_alloc(t2.nbytes)
12: cuda.memcpy_htod(y_gpu, t2)
13:
14: mod = SourceModule("""
15: #include <stdio.h>
16: __global__ void Chebyshev(float *x,
float *y,
float *p_cheb){
17: int i, j;
18: float a,b, chebyshev_result, weight;
19: chebyshev_result = 0;
20: int idx = blockIdx.x blockDim.x + threadIdx.x *4;
21: p_cheb[idx] = 1;
22: for( j=0; j<4; j++){
23: for( i=0; i<4; i++){
24: a = cos(i * acos(x[idx+j]));
25: b = cos(i * acos(y[idx+j]));
26: chebyshev_result += (a * b);
27: }
28: weight = sqrt(1-(x[idx+j]*y[idx+j])+0.0002);
29: p_cheb[idx] *= chebyshev_result / weight;
30: }
31: }
32: """)
33: func = mod.get_function("Chebyshev")
34: func(x_gpu,y_gpu,p_cheb_gpu, block=(n_samples,1,1))
35: p_cheb = np.empty_like(p_cheb)
36: cuda.memcpy_dtoh(p_cheb, p_cheb_gpu)
37: K[i] = p_cheb;
At the end of the code (Program 5) in line 37, the CPU reads the whole data
from the p_cheb_gpu location with the command memcpy_dtoh, which means
copying the memory from the device to the host. Therefore, each GPU execution
brings a row containing 60 elements for the K matrix.
11.6 More Optimizations
In this section, we focus on Program 5. Although it breaks 3600 calls of the Chebyshev
function into only 60 calls, it is not efficient yet. This is because the execution time
on the CPU is less than on GPU. Therefore, for acceleration, it is worth considering
these hints:
1. The object SourceModule (line 15 of Program 5) is not a compiled source code.
Hence, having this statement inside a for-loop is inefficient because of the extra
time for compilation in each iteration. We may move lines 15 to 33 of Program 5
outside the for-loop before line 1.
2. Having memory allocation in each iteration of a for-loop is time-consuming.
Therefore, we should allocate the needed memory first, and then use it in every
iteration.
3. It is more efficient to increase the utilization of CUDA cores. This will be reached
by hiring more threads in our implementation. This issue will increase parallelism
and also reduces memory communications.
After the optimizations, a 2.2X speedup is obtained on Tesla T4 GPU over the
CPU in a Colab machine containing two VCPUs of intel Xeon processor working at
2.20 GHz and with 13 GB memory. You may download the codes from the footnote
link.2 In the chosen SVM classification code with the first kind of Chebyshev kernel,
there are two parts, fit function and predict function. Significantly, the speedup for
the fit function is 58X over the CPU. The nested loops inside the fit function result
in more array-based computations. Therefore, the nature of our algorithm directly
affects the rate of acceleration.
2 https://github.com/sampp098/SVM-Kernel-GPU-acceleration-.
11.7 Accelerating the QP Solver
Quadratic Programming (QP) is the process of solving a particular mathematical

problem. As mentioned in previous chapters, it is needed to solve a quadratic equation
and find its minimal answer to create the best separating line. Therefore, we calculate
the K matrix and prepare all the constraint matrixes to make a quadratic problem.
Previously the CVXOPT library was used to solve the mentioned quadratic equation.
One of the complex parts of the code is the QP solver function. However, there is
a valuable library inside the CUDA API for the quadratic equations. Moreover, it is
possible to use the PyTorch and QPTH PyTorch (Amos and Kolter 2017) libraries
inside our Python implementation. As a prerequisite, the following needed libraries
should be installed.
Installing torchvision and qpth packages
pip3 install torchvision

pip3 install qpth
After the installation, we have to import the solver
Import the solver in your code
from qpth.qp import QPFunction
The following shows the problem formulation. The standard form of the QP is
used as follows:

min 21 (x T )P x + q T .x
subject to : Gx <= h, and Ax = b
However, the QPTH define a quadratic program layer as

min 21 (z T )Qz + p T .z
subject to : Gz <= h, and Az = b
The differences are only at the name of variables. Therefore, we replace the
following statement:
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
to:
solution = QPFunction(verbose=False)(P, q, G, h, A, b)
It is also necessary to change all the CVXOPT matrices to NumPy form, for example
P = np.outer(y, y) * K
This library will automatically send the process to execute on the GPU device. The
PyTorch package also can be used for matrix multiplication and also other matrix
operations on the GPU device.
11.8 Conclusion
In this chapter, the internal architecture of some NVIDIA GPUs was explained briefly.
The GPUs are the many-core processors that have their memories. They are tradition-
ally used only for graphical processing, for example, rendering videos or enhancing
the graphics in computer games. Due to the structure of these devices, recently, their
usage has significantly changed toward general-purpose applications, i.e., GPGPU
programming. Machine Learning applications, with massive datasets and deep neu-
ral networks, are the most used applications nowadays running on GPU devices. The
same as a specific application, for the SVM Kernel trick on classification problems,
GPU devices can perform better when coupled with the CPUs.
Moreover, some methods for the GPGPU on the previous LS-SVM implementa-
tions (especially the first kind of Chebyshev kernel) are proposed. Instead of CUDA,
in our Python implementation, PyCUDA package is used which results in an accept-
able performance. An important issue is a gap between the main memory and the
device memory on GPU devices. Therefore, it should be noted that more access to the
device memory will result in lower performance on the execution. Also, it should be
noted that the GPGPU shows its superiority when there is a large dataset. However,
the structure of a GPGPU program plays a significant role in getting speedups. In the
first part of this chapter, GPU devices’ architecture is explained, which is needed to
be known to optimize the GPU kernels. There exist many libraries that use GPUs as
their processors in the background, far from the user side. If the developer does not
have enough knowledge of parallel programming, these libraries are the best choices
to be employed.
We have used the mentioned acceleration methods and hints based on GPGPU,
and the SVM application with the first kind of Chebyshev function as its kernel
is edited based on these methods. The experiments show that a 2.2X speedup (for
both fit and test function) is gained over the CPU on Colab’s Tesla T4 GPU. The
partial optimization, only on fit function, resulted in a better speedup of about 58X
due to the structure of this function. The main leads for getting performance are
reducing memory access together with increasing the CUDA core utilization. In
the optimization process, unwanted extra instructions inside the loops, like memory
allocations and compilation processes, are omitted.
References
Ahmadzadeh, A., Hajihassani, O., Gorgin, S.: A high-performance and energy-efficient exhaustive
key search approach via GPU on DES-like cryptosystems. J. Supercomput. 74, 160–182 (2018)
Allec, S.I., Sun, Y., Sun, J., Chang, C.E.A., Wong, B.M.: Heterogeneous CPU+ GPU-enabled
simulations for DFTB molecular dynamics of large chemical and biological systems. J. Chem.
Theory Comput. 15, 2807–2815 (2019)
AlSaber, N., Kulkarni, M.: Semcache: Semantics-aware caching for efficient gpu offloading. In:
Proceedings of the 27th International ACM Conference on International Conference on Super-
computing, pp. 421–432 (2013)
Amos, B., Kolter, J.Z.: OptNet: differentiable optimization as a layer in neural networks. In: Inter-
national Conference on Machine Learning, pp. 136–145. PMLR (2017)
1007/s00366-022-01612-x
Cheng, J., Grossman, M., McKercher, T.: Professional CUDA c Programming. Wiley, Amsterdam
(2014)
Choquette, J., Gandhi, W., Giroux, O., Stam, N., Krashinsky, R.: Nvidia a100 tensor core gpu:
Performance and innovation. IEEE Micro. 41, 29–35 (2021)
Corporation, N.: CUDA Zone (2019). https://developer.nvidia.com/cuda-zone
Dalrymple R.A.: GPU/CPU Programming for Engineers Course, Class 13 (2014)
Doi, J., Takahashi, H., Raymond, R., Imamichi, T., Horii, H.: Quantum computing simulator on a
heterogenous hpc system. In: Proceedings of the 16th ACM International Conference on Com-
puting Frontiers, pp. 85–93 (2019)
Gavahi, M., Mirzaei, R., Nazarbeygi, A., Ahmadzadeh, A., Gorgin, S.: High performance GPU
implementation of k-NN based on Mahalanobis distance. In: 2015 International Symposium on
Computer Science and Software Engineering (CSSE), pp. 1–6 (2015)
Kalaiselvi, T., Sriramakrishnan, P., Somasundaram, K.: Survey of using GPU CUDA programming
model in medical image analysis. Inform. Med. Unlocked 9, 133–144 (2017)
Luo, C., Fei, Y., Luo, P., Mukherjee, S., Kaeli, D.: Side-channel power analysis of a GPU AES
implementation. In: 2015 33rd IEEE International Conference on Computer Design (ICCD), pp.
281–288 (2015)
Mei, X., Chu, X.: Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans.
Parallel Distrib. Syst. 28, 72–86 (2016)
Moayeri, M.M., Hadian Rasanan, A.H., Latifi, S., Parand, K., Rad, J.A.: An efficient space-splitting
Eng. Comput. 1–28 (2020)
Nvidia, T.: NVIDIA GeForce GTX 750 Ti: Featuring First-Generation Maxwell GPU Technology,
Designed for Extreme Performance per Watt (2014)
Nvidia, T.: NVIDIA Turing GPU Architecture: Graphics Reinvented (2018)
Nvidia, T.: P100. The most advanced data center accelerator ever built. Featuring Pascal GP100,
the world’s fastest GPU (2016)
Nvidia, T.: V100 GPU architecture. The world’s most advanced data center GPU. Version WP-
08608-001_v1 (2017)
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A
survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113
(2007)
Parand, K., Aghaei, A.A., Jani, M., Ghodsi, A.: Parallel LS-SVM for the numerical simulation of
fractional Volterra’s population model. Alex. Eng. J. 60, 5637–5647 (2021)
Pienaar, J.A., Raghunathan, A., Chakradhar, S.: MDR: performance model driven runtime for
heterogeneous parallel platforms. In: Proceedings of the International Conference on Supercom-
puting, pp. 225–234 (2011)
PyCUDA 2021, documentation (2021). http://documen.tician.de/pycuda/
Rahmani, S., Ahmadzadeh, A., Hajihassani, O., Mirhosseini, S., Gorgin, S.: An efficient multi-core
and many-core implementation of k-means clustering. In: ACM-IEEE International Conference
on Formal Methods and Models for System Design (MEMOCODE), pp. 128–131 (2016)
Wang, C., Jia, Z., Chen, K.: Tuning performance on Kepler GPUs: an introduction to Kepler assem-
bler and its usage in CNN optimization. In: GPU Technology Conference Presentation (2015)
Welcome To Colaboratory (2021). https://colab.research.google.com
Xiao, B., Wang, H., Wu, J., Kwong, S., Kuo, C.C.J.: A multi-grained parallel solution for HEVC
encoding on heterogeneous platforms. IEEE Trans. Multimedia 21, 2997–3009 (2019)
Chapter 12
Classification Using Orthogonal Kernel
Functions: Tutorial on ORSVM Package

Amirreza Azmoon, Mohammad Akhavan, and Jamal Amani Rad
Abstract Classical and fractional orthogonal functions and their properties as the
kernel functions for the SVM algorithm are discussed throughout this book. In
Chaps. 3, 4, 5, and 6, the four classical families of the orthogonal polynomials (Cheby-
shev, Legendre, Gegenbauer, and Jacobi) were considered and the fractional forms of
these polynomials were presented as kernel function and their performance has been
shown. However, using these kernels needs much effort to implement. To make it
easy for anyone who needs to try and use these kernels, a Python package is provided
here. In this chapter, the ORSVM package is introduced as an SVM classification
package with orthogonal kernel functions.
Keywords Python · ORSVM · Orthogonal Kernel · Classification
12.1 Introduction
Python programming language is one of the most popular programming languages

among developers and researchers. However, there are a bunch of reasons why Python
is popular, a large share of this popularity is because there exist hundreds of open-
source Python packages which one can include in her main code and extent her
code’s functionality, without any need for more coding. This makes it possible for
A. H. Hadian Rasanan (B) · J. Amani Rad

University, G.C. Tehran, Iran
S. Nedaei Janbesaraei · M. Akhavan
School of Computer Science, Institute for Research in Fundamental Sciences, Tajrish, Iran
A. Azmoon
Zanjan, Iran
one to develop a reusable code. This feature of programming languages cultivates

the development process by making it easier and faster. Suppose you have written a
code for a customized SVM algorithm of your own, consisting of some classes with
relevant functions. Instead of writing the same classes and functions, every time
and everywhere you need to access them, you can save those classes and functions
in a separate file let’s name it, custom-svm.py, then you only need to add a single
line of code import custom-svm, somewhere in your code to gain access to all the
functions of your custom SVM algorithm. It is been a widely accepted paradigm of
using reusable codes specifically in the Python programming language. There exist
multiple Python packages available for SVM classification, such as the SVM which
Scikit learn package provides. Although it is easy to use, it lacks kernel diversity,
more specifically orthogonal kernels and definitely the novel fractional orthogonal
kernels. Kernels in SVM play a critical role, without a doubt there does not exist a
kernel that suits every problem of the real world, in precise one kernel function is not
suitable for all datasets because each kernel function produces a higher dimension
in SVM with the same data scattering pattern. Suppose the linear kernel function,
K (x, y) = x.y, the decision boundary of such kernel is a straight line in 2D space
or a flat plane in 3D space. No argument that a flat plane cannot classify all datasets.
Hence, there is a perpetual struggle to find the most suitable kernel function for each
dataset, and a new kernel function is always welcomed if can improve the success
metrics of classification tasks. To make it easier for anyone to use orthogonal kernel
functions and also the fractional form, we decided to create the Python package
ORSVM (consists of multiple modules). Through this chapter, this package will be
introduced. Moreover, the basics of the ORSVM, how to install and how to use it,
are discussed all through examples.
12.1.1 ORSVM
ORSVM is a free and open-source Python package that provides an SVM classifier
with some novel orthogonal kernel functions. This library provides a complete chain
of using the SVM classifier from normalization to calculation of the SVM equation
and the final evaluation. However, note that there are some necessary steps before
normalization that should be handled for every dataset, such as duplicate check-
ing, Null values or outlier handling, or even dimensionality reduction or whatever
enhancements that may apply to a dataset. These steps are out of the scope of the
SVM algorithm, thereupon, ORSVM package. Instead, the normalization step which
is a must before sending data points into orthogonal kernels is handled directly in
ORSVM, by calling the relevant function. As it has been discussed already, the frac-
tional form of all kernels is achievable during the normalization process too. ORSVM
package includes multiple classes and functions. All of them will be introduced in
this chapter.
ORSVM package is heavily dependent on Numpy and cvxopt packages. Arrays,
matrices, and linear algebraic functions are used repeatedly from numpy and the
12 Classification Using Orthogonal Kernel Functions: Tutorial on ORSVM Package 269
heart of SVM algorithm which is solving the convex equation of SVM and finding
the Support Vectors is achievable by means of a convex quadratic solver from cvxopt
library which is in turn a free Python package for convex optimization.1
ORSVM is structured on two modules. A module consists of Kernel classes and
the other one consists of relevant classes and functions supporting initialization,
normalization, fitting process, capturing the fitted model, and the classification report.
ORSVM is structured as follows:
• orsvm module
This is the main module consisting of the fitting procedure, prediction, and report.
It includes Model and SVM classes and the transformation (normalization) func-
tion. We opted for the transformation name instead of normalization because
transforming to fractional space is achievable through this function too.
– Model Class
Model class literally creates the model. Initialization starts under this class.
Calling the model.model_fit function initiates an object from the SVM class.
After calling the Transformation function, the train set gets ready to input into
the SVM object. Then, an object is ready to start the fitting process of the SVM
object from SVM class.
1. ModelFit function
Initiates an object from the SVM class, transforms/normalizes the train set,
and calls the fit function of the SVM object. Finally captures the fitted model
and parameters.
2. ModelPredict function
Transforms/Normalizes the train set. Calls the predict function of the SVM
object with proper parameters. And finally calls accuracy_score, confu-
sion_matrix, and classification_report of Scikit learn (sklearn.metrics) with
the previously captured result.
– SVM class
Here the svm equation is formed, and required matrices of cvxopt are created
under the fit function of the SVM class. Invokes cvxopt and solves the SVM
equation to determine the support vectors and calculate the hyper-plane’s equa-
tion. The prediction procedure is implemented under the prediction function of
the SVM class.
1. fit function creates proper matrices by directly calling kernels. Such matrices
are the Gram matrix and also other matrices of the SVM equation required by
cvxopt. As the result, cvxopt returns Lagrange Multipliers. Applying some
criteria of user interest, support vectors opt from them, and as a result, the
SVM.fit function returns the weights and bias of the hyper-plane’s equation
and also the array of support vectors.
1 A suitable guide on this package is available at http://cvxopt.org about installation and how to use
it.
2. predict function maps data points from the test set with the decision bound-
ary( hyper-plane equation) and determines to which class each data points
belong.
– Transformation function
This is the function that normalizes the input dataset or, in the case of frac-
tional form, transforms the input dataset into fractional space, in other words,
normalizes the input in fractional space.
• Kernels module includes one class per kernel. So currently there exist four classes
for each orthogonal kernel. The following classes are available in the kernels
module:
– Chebyshev class contains the relevant functions to calculate the orthogonal
polynomials and fractional form of the Chebyshev family of the first kind.
– Legendre class holds the relevant functions to calculate the orthogonal polyno-
mials and fractional form of the Legendre family.
– Gegenbauer class consists of the relevant functions to calculate the orthogonal
polynomials and fractional form of the Gegenabuer family.
– Jacobi class includes the relevant functions to calculate the orthogonal polyno-
mials and fractional form of the Jacobi family.
ORSVM is written in Python programming language which requires Python ver-

sion 3 or higher.
12.2 How to Install
To use the package you can install it using pip as follows:
pip install orsvm.py
or you can easily clone the package from our GitHub and use the setup.py file to
install the package;
python setup.py install.
12.3 Model Class
This is the interface to use the ORSVM package. Using ORSVM starts with creating
an object of the Model class. This class includes a ModelFit function which itself
creates an ORSVM model as an instance of the SVM class. Then normalizes the
input data in the case of the normal form of the kernel or transforms the input data
in the case of fractional form. Choosing between normal or fractional is determined
by the value of the argument T if T = 1 the kernel is in normal form, and in the case
of 0 < T < 1, then it is the fractional form. model_fit function receives the train and
test set separately; moreover, each of them should be divided into two matrices of x
and y. The matrix x is the whole dataset where the relevant column to output a.k.a
class or label is omitted and the matrix y is that specific column of class or label.
Division of datasets can be achieved through multiple methods. A widely used one
is StratifiedShuffleSplit from sklearn.model_selection.
Creating an object of the Model class requires some input parameters, cause it
needs to create an SVM object with these passed parameters. The following example
code creates an orsvm classifier from the Jacobi kernel function of order 3.
obj = orsvm.Model(kernel="jacobi", order = 3, T = 0.5,
k_param1 = --0.8, k_param2 = 0.2,
sv_determiner = ’a’, form = ’r’, C = 100)
Here T = 0.5 means that it is in the fractional form of order 0.5. “k_param1” is
equivalent to ψ, and “k_param2” is equivalent to ω in case “Jacobi” kernel is chosen.
If it was Gegenbauer, then only “k_param1” was applicable to the kernel. Because
Chebyshev and Legendre both do not have any hyper-parameter, there is no need to
pass any value for “k_param1” and “k_param2”, even if one passes values for these
parameters, those will be ignored. Parameter sv_determiner is the user’s choice on
how support vectors have to be selected among Lagrange Multipliers (data points
which were computed through convex optimization solver and opted as candidates to
draw the SVM’s hyper-plane). This is one of the important parameters that can affect
final classification performance metrics. Three options are considered for choosing
support vectors. Two options are widely used, first a number of type int, that represents
the number of support vectors that should be selected from Lagrange Multipliers,
if the number is greater than that number of Lagrange Multipliers then all of them
will be selected as support vectors. The second one is a number in scientific notation
(often a negative power of 10) as a minimum threshold, and the third method which
is considered in ORSVM is the flag “a” that represents Average. Most of the time,
we have no clear conjecture about the values of Lagrange Multipliers, how small
they are, or how many of them will be available. Therefore, we do not know what
the threshold should be, or how many support vectors should be chosen. Sometimes,
it leads to zero support vectors, then there will be an Error. Our solution to this
situation, which always happens for a new dataset, is the average method. In this
case, ORSVM finds the average of Lagrange Multipliers and sets the threshold.
Then, never an error has occurred. However, this method does not guarantee the best
generalization accuracy. But gives a factual estimation about support vectors. After
all, choosing the best number of support vectors itself is an important task to do
in SVM. The parameter form is only applicable to Chebyshev kernel, because two
implantations of Chebyhsev Kernel are available, an explicit equation and a recursive
one. So “r” refers to recursive and “e” refers to “explicit”. Parameter “noise” is only
applicable to the Jacobi kernel, and the Jacobi’s weight function indeed. In fact,
‘noise’ purpose is to avoid the errors that happen at the boundaries of the weight
function as it has already been explained. Finally, parameter “C” is the regularization
Table 12.1 The Model class of ORSVM and parameters list

Parameter Default Value
1 kernel Chebyshev Jacobi, Gegnbauer,
Chebyshev, Legendre
2 order 3 3, 4, 5, …
3 T 1 0<T<1
4 param1 None param1 > −1 for the
Jacobi kernel,
param1 > −0.5 for
the Gegenbauer kernel
5 param2 None param2 > −1 for the
Jacobi kernel
6 svd “a” Int, scientific notation,
“a”
7 form “r” “r” , “e”
8 noise “0.01” Often: 0.01, 0.001, …
8 C None Often: 10, 100, 1000,
…
parameter of the SVM algorithm, which controls to what degree the misclassified
data points are important. Setting “C” to the best value leads to a better generalization
of the classifier. Table 12.1 summarizes the parameters of the Model class.
Right at this point, ORSVM just has initiated the model and has not done any
computation yet; to fit the model, we have to call the ModelFit function. Clearly,
fitting requires an input dataset. As already discussed, ModelFit receives train and
test datasets which are divided into x (data without label) and y (label). The following
code snippet represents how one can divide the dataset. Function LoadDataSet reads
and loads dataset into a pandas DataFrame, then converts and maps the classes to
binary classification. As the “Clnum” is the label column, we have to select and
convert that column solely into one Numpy array. Then remove the label column
from Pandas data frame to reach the data without the label. It converts data into
numpy array; however, it is not necessary. Calling the ‘LoadDataSet’ function gives
the x and y. In the next line using StratifiedShuffleSplit, we can create an object of
stratified shuffle split with required parameters, and then using the split function of
the created object, we can get the train and test set divided into X and y.
Now that train and test set are ready, we can call ModelFit with proper parameters,
and as the result, the function prints status messages and returns weights and bias of
the SVM’s hyper-plane equation and also an array for support vectors and the kernel
instance. Therefore, we have the fitted model and corresponding parameters we can
use to predict. By calling the ModelPredict function, the final step of classification
with ORSVM will be achieved. ModelPredict requires text sets and also the bias and
the kernel instance to calculate the accuracy. It should be noted that the test set will
be transformed into proper space through the ModelPredict function. As the result,
the accuracy score will be returned. Moreover, ModelPredict will print the confusion
matrix, classification report, and accuracy score.
import pandas
from from sklearn.model_selection import StratifiedShuffleSplit
def LoadDataSet():
# load dataset
df = pandas.read_csv(’/home/data/spiral.csv’,
names = [’Chem1’, ’Chem2’, ’Clnum’],
index_col = False)
#convert to binary classification

df.loc[df.Clnum! = 1, [’Clnum’]] = -1
df.loc[df.Clnum == 1, [’Clnum’]] = 1
y_np = df[’Clnum’].to_numpy()
df.drop(’Clnum’, axis = 1, inplace = True)
df_np = df.to_numpy()
return df_np,y_np
X,y = data_set()
sss = StratifiedShuffleSplit(n_splits = 5,
random_state = 30,
test_size = 0.9)
for train_index, test_index in sss.split(X, y):

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Weights, S_Vectors, Bias, K_Instance = obj.ModelFit(X_train,

y_train)
Accuracy_score = obj.ModelPredict(X_test,
y_test,
Bias,
K_Instance)
12.4 SVM Class
The SVM class is the heart of this package where the fitting process happens, the
SVM equation is evaluated to find the support vectors, and finally, the equation of
hyper-plane is constituted. Before getting delve into svm equation and finding the
support vectors, we have to do the kernel trick. So we need a “Gram matrix”, the
matrix of all possible inner products of x_train under the selected kernel. Therefore,
considering X _train m×n , the “SVM.fit” will create a square matrix of shape m × m:
⎡ ⎤
k(x1 , x2 ) ··· k(x1 , xm )
⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
K =⎢
⎢ ··· k(xi , x j ) ··· ⎥.
⎥
⎢ .. ⎥
⎣ . ⎦
k(xm , x1 ) ··· k(xm , xm )
This is also known as kernel matrix as it represents the kernel trick of SVM. “K” can
be any of Legendre, Gegenbauer, Chebyshev, or Jacobi.
Here, after the convex optimization problem, or better to say minimization of the
dual form of SVM equation, it will be solved using qp function from cvxopt library:
cvxopt.solvers.qp(P, q, G, h, A, b)
This function solves a quadratic program:

1
minimi ze x T P x + q T x, (12.1)
2
subject to Gx h,
Ax = b.
According to documents provided by developers of this library, the arguments of qp

functions are as follows2
• P is an n × n dense or sparse Wigner D-matrix with the lower triangular part of
P stored in the lower triangle. It should be positive semi-definite.
• q is an n × 1 dense Wigner D-matrix.
• G is an m × n dense or sparse Wigner D-matrix.
• h is an m × 1 dense Wigner D-matrix.
• A is a p × n dense or sparse Wigner D-matrix. In particular, A is the y matrix
(containing y values of the training set) with shape of row = 1 and column =
(number of rows of the x matrix)
• b is a p × 1 dense Wigner D-matrix or None.
The output of qp contains the Lagrange Multipliers, which support vectors lie
between them and we need the policy to choose some of them as support vectors.
Here, the trade-off is between over-fit and under-fit, and choosing too much of them
as support vectors cause the model to over-fit whereas selecting a few leads to under-
fitting the problem.
SVM class includes a fit function that constructs the Gram matrix using the spec-
ified kernel function. Thereafter, solving the SVM’s equation returns the support
vectors. Moreover, the bias is calculated in Svm.fit function. Also, Svm.predict is
another function of class SVM which predicts labels of a given array. The user never
needs to call functions under the SVM class directly, and all usage of ORSVM is
achievable through Model Class and its functions.
2 For comprehensive documentation and examples of cvxopt, please visit https://cvxopt.org.

12.4.1 Chebyshev Class
Another kernel introduced in this package is the Chebyshev kernel, which is imple-
mented in the vectorial approach which has already been introduced as a Generalized
Chebyshev Kernel. Similar to the Legendre-kernel, the Chebyshev kernel is available
when initiating the Model class object by passing the “Chebyshev” parameter as the
kernel name. Another parameter applicable to the Chebyshev kernel is form. Two
different types of Chebyshev kernels are implemented in the Chebyshev class. One
uses an explicit equation and another is a recursive function. Selecting a type to use
is possible through form parameter. The parameters of the Chebyshev class are as
follows:
• or der := Order of polynomial
• f or m := “e” for Explicit form and “r” for Recursive form.
12.4.2 Legendre Class
Legendre kernel class is generally given as an object to the kernel parameter of Mod-
els class in the initialization step of an object and also its Legendre-kernel function
is explicitly used in constructing the gram matrix K . Legendre class benefits from
a recursive implementation of the Legendre-kernel function. To use this kernel, the
user only needs to initiate the Model object with “Legendre” as the kernel parameter
of the Model class. The Legendre class needs the following parameters:
• or der := Order of the Legendre-kernel function.
12.4.3 Gegenbauer Class
The Gegenbauer kernel is also available in the ORSVM package. For the imple-
mentation of the Gegenbauer kernel function, the fractional kernel function which is
introduced in Chap. 5 is used. Therefore, in addition to the product of input vectors,
the values are obtained from the two other equations. The Gegenbauer class has the
following parameters:
• or der := Order of the Gegenbauer kernel function.
• lambda := The λ parameter of the Gegenbauer kernel function.
12.4.4 Jacobi Class
The Jacobi kernel is the last orthogonal kernel currently available in the ORSVM
package. Jacobi kernel is available to choose from when the ORSVM package is
imported. Simply, the kernel name should be “Jacobi” during the initiation of an
object from the Model class. The Jacobi class needs the following parameters:
• or der := Order of the Jacobi kernel function.

• psi := The ψ parameter of Jacobi kernel function.
• omega := The ω parameter of Jacobi kernel function.
• noise := A small noise to weight function to avoid errors at boundaries.
12.5 Transformation Function
The transformation function is actually the composition of the well-known Min-max

feature scaling
x − xmin
x = ,
xmax − xmin
and the mapping equation already introduced for the fractional form of all kernels.

x = 2x α − 1,
so we have α
x − xmin
x =2 − 1,
xmax − xmin
which transforms the input x into the kernel space related to the fractional order
of function (α). Therefore, the transformation function requires the x (input data)
and alpha ( T as a parameter of the Model class, T represents the transformation
step.) which is 1 by default and causes to normalize the input data as the Min-max
feature scaling function does. The user never needs to call the transformation function
directly, but in the case one needs so
import orsvm
orsvm.transformation(x, T=1).
12.6 How to Use
Using the ORSVM library is easy and straightforward. The simplicity of use has
been an important motivation in developing the ORSVM package. The user only
should provide the dataset as matrices and select a kernel, and by setting the T , the
user can choose between the normal or fractional form of the kernel functions. Other
required parameters related to the chosen kernel should be provided in the object
initiation step. Those parameters are completely discussed already. In this section,
we only demonstrate a sample code of classification of a dataset using ORSVM.
As an example, the three monks problem dataset is considered which has been
introduced before in Chap. 3. Three Monks’ problem has the train set and test set
separated. In case one needs to do this separation, please refer to the code snippet
in Sect. 12.3. In order to use ORSVM, we first have to divide the train and test into
x_train and y_train. This can be done by importing the dataset into a pandas data
frame first, and then we will map the class values in monks dataset to −1 and 1 which
are suitable for the SVM algorithm instead of 0 and 1.
import numpy as np
import pandas as pd
import orsvm
# load train-set
df = pd.read_csv(’/home/datasets/1_monks.train’,
names = [’Class’, ’col1’, ’col2’, ’col3’,
’col4’, ’col5’, ’col6’],
index_col = False)
df.loc[df.Class==0, [’Class’]] = -1 # map "0" to "-1"

y_train=df[’Class’].to_numpy() # convert y_train to numpy array
df.drop(’Class’, axis = 1, inplace = True) # drop the class label
X_train=df.to_numpy() # convert x_train to numpy array
# load test-set
df = pd.read_csv(’/home/datasets/1_monks.test’,
names = [’Class’, ’col1’, ’col2’, ’col3’,
’col4’,’col5’,’col6’],
index_col = False)
df.loc[df.Class == 0, [’Class’]] = -1
y_test=df[’Class’].to_numpy()
df.drop(’Class’, axis = 1, inplace = True)
X_test = df.to_numpy()
Now that we have the train set and test set ready, we need an instance of Model
class, to call the ModelFit function using proper arguments. For example, here we
choose the Chebyshev kernel with T = 0.5 and we let the SVM’s regularization
parameter keep the default value, that is “None” and also the recursive implementa-
tion is preferred, in the second line by calling the ModelFit function, ORSVM fits the
model and returns the fitted parameters. We can capture these parameters for later
use, for example for prediction.
# Create an object from Model class of ORSVM
obj = orsvm.Model(kernel="Chebyshev",order=3,T=0.5,form=’r’)
# fit the model and Capture parameters

Fig. 12.1 Jacobi Polynomials of order 5, fixed ψ and different ω
Weights, S_Vectors, Bias, K_Instance = obj.ModelFit(X_train,

y_train)
These are only for logging purposes. Then in case one needs the prediction, one
may call the ModelPredict function that requires the test set divided into x and y,
and also the bias and the kernel instance from the previous step.
accuracy_score = obj.ModelPredict(X_test,
y_test,
Bias,
K_Instance)
ModelPredict returns the accuracy score and we can capture it. Moreover, this func-
tion prints much more information on classification. Here is the output (Fig. 12.1):
Using the log information may help for better debugging or gain a better under-
standing of how fitting gets done for a dataset by setting the log parameter of the fit
function as True.
After mentioning the basics of SVM in part one, introducing some fractional
orthogonal kernel functions in part two, and reviewing some applications of SVM
algorithms, the aim of this chapter was to present a python package that enables us to
apply the introduced fractional orthogonal kernel functions in real-world situations.
The architecture of this package and a brief tutorial usage of it are presented here.
But for more information and a detailed updated tutorial on this package, you can
visit the online page of this package which is available at http://orsvm.readthedocs.
io.
Appendix: Python Programming Prerequisite
A.1 Introduction
Python is an object-oriented, multi-paradigm, high-level programming language cre-

ated by Guido van Rossum and was released in 1991. Its high-level built-in data struc-
tures and dynamic binding encourage developers to use it for application develop-
ment as well as use it as a scripting language. According to the TIOBE Programming
Community Index,1 Python is one of the top three popular programming languages
of 2020 and was the winner of 2018 best programming language.
Python 2.0 was released on October 16, 2000 with minor fixes, optimizations, bet-
ter error handling messages, cycle-detecting garbage collector, and support for Uni-
code. Python 3.0 was released on December 3, 2008 that is not completely backward-
compatible. Python 2.7s end-of-life date was set in 2015 but postponed to 2020. The
main reason for the postponement was concern that a large body of existing code
could not easily be forward-ported to Python 3.
Some of the very popular features of this programming language for users are
its interpreter and the extensive standard library in source or binary form, which
gives users the capability of using it without charge for all major platforms and this
feature allows us to run the same compiled code on multiple platforms instead of
recompilation again. As another feature of Python, the big community and a lot of
open-source frameworks, tools, and libraries that decrease the cost and increase the
speed of development can be noticed.
1 https://www.tiobe.com/tiobe-index/.
https://doi.org/10.1007/978-981-19-6553-1
280 Appendix: Python Programming Prerequisite
A.2 Basics of Python
In this section, we try to introduce the fundamentals of Python language, how to

install Python and how to use this language, and also how to define variables, loops,
conditions, and functions. Pandas, Numpy, and Matplotlib are three libraries that are
used for all the data scientists’ tasks and we will discuss their preparations.
How to Use Python?
There are many ways you can install Python and work with it. Here, some ways
that are more appropriate are discussed. The basic way of installing is to go to the
Python website2 and find the suitable version of Python for your work based on
your operation system. After downloading and installing Python, also we need a
code editor for writing out code such as Notepad++,3 Atom,4 Sublime,5 or any code
editor. These code editors are open source and we can easily download and use them
for free. There is also some commercial programs like PyCharm, Spyder, Pydev, etc.
The applicability of a program can be seen in debugging of big Python programs.
There is another way to install Python using Anaconda,6 a Python distribution that
its objective is scientific computing (data science, machine learning applications,
large-scale data processing, predictive analytics, etc.). By installing Anaconda, all
important and famous packages are effortlessly installed. Additionally, Anaconda
installs some programs such as Jupyter Notebook7 which is a well-known open-
source platform to develop code. One of the advantages of Jupyter Notebook over
others is that in this program we can run Python code cell by cell which can be
very helpful. For installing Anaconda in Windows or macOS, you can download
the graphical installer, but for Linux .sh file should be downloaded and run in the
terminal. This is the last step to install Python and its packages. Now the basics of
the language can be explained.
Python Basics
The basics of Python are very similar to other programming languages like C, C++,
C#, and Java, so if one is familiar with these programming languages they can easily
learn Python.
2 https://www.Python.org/downloads/.
3 https://notepad-plus-plus.org/.
4 https://atom.io/.
5 https://www.sublimetext.com/.
6 https://www.anaconda.com.
7 https://jupyter.org/.
Appendix: Python Programming Prerequisite 281
Basic Syntax
Python syntax can be executed by writing directly in the Command-Line. For exam-
ple, write the following command:
print("Hello, World!")
Moreover, we can create a Python file; then, write the code in it, save the file using
the .py extension, and run it in the Command-Line. For instance,
> Python test.py
Python uses white-space indentation for delimiting blocks instead of curly brackets
or keywords. The number of spaces is up to the programmer, but it has to be at least
one.
if 2 > 1:
print("Two is greater than one!")
if 2 > 1:
print("Two is greater than one!")
> Two is greater than one!
Comments
Comments start with a “#” for the purpose of in-code documentation, and Python
will render the rest of the line as a comment. Comments also can be placed at the
end of a line. In order to prevent Python from executing commands, we can make
them a comment.
# This is a comment.
print("Hello, World!")
print("Hello, World!") # This is a comment.
# print("Hello, World!")
Variable Types
A Python variable unlike other programming languages does not need an explicit
declaration to reserve memory space. The declaration happens automatically when
a variable is created by assigning a value to it. The equal sign (=) is used to assign
values to variables. The type of a variable can be changed by setting a new value to
it.
x = 5 # x is integer
y = "Adam" # y is string
y = 1 # y is integer
String variables can be declared by using single or double quotes.

y = "Adam" # y is string
y = ’Adam’ # y is string
Rules for Python variables are
• A variable name must start with a letter or underscore character and cannot start
with a number.
• A variable name can only contain alpha-numeric characters and underscores.
• Variable names are case sensitive (yy, Yy, and YY are three different variables).
Python allows multiple variables to be assigned in one line and the same value is
assigned to multiple variables in one line:
x, y, z = "Green", "Blue", "Red"
x = y = z = 2
In Python, a print statement can be used to combine both text and a variable(s) or
mathematical operation with the + character but a string and a numeric variable are
not integrable:
x = "World!"
print("Hello" + x)
x = 2
y = 3
print(x + y)
x = 5
y = "Python"
print(x + y) # raise error
> Hello World!

> 5
> TypeError: unsupported operand type(s) for +: ’int’ and ’str’
Variables that are created outside of a function or class are known as global variables
and can be used everywhere, both inside and outside of functions.
x = "World!"
def myfunc():
print("Hello" + x)
myfunc()
> Hello World!
If a variable was defined globally and is redefined in a function with the same name,
its local value can only be usable in that function and the global value remains as it
was
x = "World!"
def myfunc():
x = "Python."
print("Hello " + x)
myfunc()
> Hello Python.
To create or change a global variable locally in a function, a global keyword must

use before the variable:
x = "2.7"
def myfunc():
global x
x = "3.7"
myfunc()
print("Python version is" + x)
> Python version is 3.7
Numbers and Casting
There are three types of numbers in Python:

• int,
• float, and
• complex,
which can be represented as follows:
a = 10 # int
b = 3.3 # float
c = 2j # complex
The type() is a command used for identifying the type of any object in Python. Casting
is an attribute to change the type of a variable to another. This can be done by using
that type and giving the variable as input:
• int() manufacture an integer number from a float (by rounding down it) or string
(it should be an integer).
• float() manufacture a float number from an integer or string (it can be float or int).
• str() manufacture a string from a wide variety of data types.
a = int(2) # a will be 2
b = int(2.8) # b will be 2
c = int("4") # c will be 4
x = float(8) # x will be 8.0

y = float(4.8) # y will be 4.8
z = float("3") # z will be 3.0
w = float("4.2") # w will be 4.2
i = str("aa1") # i will be ’aa1’
j = str(45) # j will be ’45’
k = str(3.1) # k will be ’3.1’
Strings
Strings in Python can be represented by either quotation marks or double quotation

marks. Even string can be assigned to a variable and it can be a multi-line string by
using three double quotes or three single quotes:
a = "Hello"
print(a)
> Hello
a = """Strings in Python can be represent

surrounded by either single quotation marks,
or double quotation marks."""
print(a)
> Strings in Python can be represent

or double quotation marks.
a = ’’’Strings in Python can be represent

or double quotation marks’’’
print(a)
> Strings in Python can be represent

or double quotation marks.
In Python, there is no dedicated data type as the character. In fact, a single char-
acter is simply a string with a length equal to 1. Strings in Python like many other
programming languages are arrays of bytes representing Unicode characters.
a = "Hello, World!"
print(a[3])
> l
Slicing is an attribute in Python that returns a range of values in an array or list. In

this case, slicing returns a range of characters. In order to specify the starting and the
ending indices of range, separate them by a colon. Negative indexes can be used to
start the slice from the end of a list or array.
a = "Hello, World!"
print(a[3:7])
print(a[-6:-1])
> lo, W
> World
To spot the length of a string, use the len() function.

a = "Hello, World!"
print(len(a))
> 13
There are so many built-in methods that are applicable on strings, just to name a few:
• strip() removes any white-space from the beginning or the end:
a = "Hello, World!"
print(a.strip())
> Hello, World!
• lower() returns the string in lower case:

a = "Hello, World!"
print(a.lower())
> hello, world!
• upper() returns the string in upper case:

a = "Hello, World!"
print(a.upper())
> HELLO, WORLD!
• replace() replaces a string with another one:

a = "Hello, World!"
print(a.replace("W", "K")())
> Hello, Korld!

• The split() method splits the string into substrings if it finds instances of the sep-
arator:
a = "Hello, World!"
print(a.split("o"))
> [’Hell’, ’, W’, ’rld!’]
Use the keywords in for checking if there is a certain phrase or a character in a

string:
txt = "Hello, World! My name is Python."
if "Python" in txt:
print("The Python word exist in txt.")
> The Python word exist in txt.
To concatenate, or combine two strings, + operator can be used.

x = "Hello"
y = "World"
z = x + " " + y
print(z)
> Hello World

Numbers cannot be combined with strings. But we can apply casting by str() or use
the format() method that takes the passed arguments and place them in the string
where the placeholders {} are. It is worth mentioning that the format() method can
take an unlimited number of arguments:
version = 3.7
age = 29
txt1 = "My name is Python, and my version is {} and my age is {}"
txt2 = "My name is Python, and my version is " + str(version) \
+ "and my age is" + srt(age)
print(txt1.format(version, version))
print(txt2)
> My name is Python, and my version is 3.7 and my age is 29

> My name is Python, and my version is 3.7 and my age is 29
An escape character is a backslash \ for writing characters that are illegal in a

string. For example, a double quote inside a string that is surrounded by double
quotes.
txt1 = "This is an "error" example."
txt2 = "This is an \"error\" example."
print(txt2)
> SyntaxError: invalid syntax

> This is an "error" example.
• \’ single quote;
• \\ backslash;
• \n new line;
• \t tab;
• \r carriage Return;
• \b backspace;
• \f form feed;
• \ ooo octal value;
• \ xhh Hex value.
Lists
Lists are one of all four collection data types. When we decide to save some data
in one variable, a list container can be used. In Python lists, elements are written by
square brackets and an item is accessible by referring to its index number.
mylist = ["first", "second"]
print(mylist)
print(mylist[1])
> [’first’, ’second’]

> second
As mentioned in the Strings section, a string is a list and all the negative indexing
and range of indexes can be used for lists, and also remember that the first item in
Python language has an index 0. By considering one side of colon devoid, the range
will go on to the end of the beginning of the list.
mylist = ["first", "second", "Third", "fourth", "fifth"]
print(mylist[-1])
print(mylist[1:3])
print(mylist[2:])
> fifth
> [’second’, ’Third’]
> [’Third’, ’fourth’, ’fifth’]
To change an item value in a list, we can refer to its index number. Iterating through
the list items is possible by for loop. To determine that an item is present in a list we
can use in keyword and with len() function we can find how many items are in a list.
mylist = ["first", "second"]

mylist[0] = "Third"
for x in mylist:
print(x)
if "first" in mylist:
print("No, first is not in mylist")
print(len(mylist))
> Third
> second
> No, first is not in mylist
> 2
There are various methods for lists and some of them are discussed in this appendix:
• append() adds an item to the end of the list.
• insert() inserts an item at a given position. The first argument is the index and the
second is value.
• remove() removes the first item from the list.
• pop() removes the item at the selected position in the list, and returns it. If no index
is specified pop() removes and returns the last item in the list.
• del removes the variable and makes it free for new value assignment.
• clear() removes all items from the list.
For making a copy of a list the syntax of list1 = list2 cannot be used. By this command,
any changes in one of the two lists are applied to the other one, because list2 is only
a reference to list1. Instead, copy() or list() methods can be utilized to make a copy
of a list.
mylist1 = ["first", "second"]
mylist2 = mylist1.copy()
mylist2 = list(mylist1)
For joining or concatenating two lists together, we can easily use “+” between two
lists or using extend() method.
mylist1 = ["first", "second"]
mylist2 = ["third", "fourth"]
mylist3 = mylist1 + mylist2
print(mylist3)
mylist1.extend(mylist2)
print(mylist1)
> [’first’, ’second’, ’third’, ’fourth’]

> [’first’, ’second’, ’third’, ’fourth’]
Dictionary
Dictionary is another useful data type collection whose main difference with other
collections is that the dictionary is unordered and unlike sequences, which are indexed
by a range of numbers, dictionaries are indexed by keys. Type of the keys can be
any immutable types as strings and numbers. In addition, one can create an empty
dictionary with a pair of braces. A dictionary with its keys and values is defined as
follows:
mydict = {
"Age": 23,
"Gender": "Male",
"Education": "Student"
}
print(mydict)
> {’Age’: 23, ’Gender’: ’Male’, ’Education’: ’Student’}
In order to access the items of a dictionary, we can refer to its key name inside square
brackets or call the get() method. Moreover, the corresponding value to a key can be
changed by assigning a new value to it. The keys() and values() methods can be used
for getting all keys and values in the dictionary.
print(mydict["Age"])
print(mydict.get("Age")))
mydict["Age"] = 24
print(mydict["Age"])
print(mydict.keys())
print(mydict.values())
> 23
> 23
> 24
> dict_keys([’Age’, ’Gender’, ’Education’])
> dict_values([23, ’Male’, ’Student’])
Iterating through dictionaries is a crucial technique to access the items in dictio-

nary. The return value of the loop can be keys of the dictionary or by the key and
corresponding value using the items().
for x in mydict:
print(x) # printing Keys
for x in mydict:
print(mydict[x]) # printing Values
for x, y in mydict.items():
print(x, y) # printing Keys and Values
> Age
> Gender
> Education
> 23
> Male
> Student
> Age 23
> Gender Male
> Education Student
Checking the existence of a key in a dictionary can be done by the in keyword same
as lists. Moreover, to determine how many items are in a dictionary we can use the
len() method. Adding an item to a dictionary can be easily done by using the value
and its key.
if "Name" in mydict:
print("No, there is no Name in mydict.")
print(len(mydict))
mydict["weight"] = 72
print(mydict)
> No, there is no Name in mydict.

> 3
> {’Age’: 23, ’Gender’: ’Male’, ’Education’: ’Student’ \
, ’weight’: 72}
If ... Else
Logical conditions from mathematics can be used in “if statements” and loops. An “if
statement” is written by using the if keyword. The scope of “if statement” is defined
with white-space while other programming languages use curly brackets. Also, we
can use an if statement inside another if statement called nested if/else by observing
the indentation.
• Equals: a == b.
• Not Equals: a! = b.
• Less than: a < b.
• Less than or equal to: a <= b.
• Greater than: a > b.
• Greater than or equal to: a >= b.
x = 10
y = 15
if y > x:
print("y is greater than x")
> y is greater than x
elif is a Python keyword that can be used for “if the previous conditions were not
true, then try this condition”. And the else keyword used for catches anything that
couldn’t be caught by previous conditions.
x = 10
y = 9
if x > y:
print("x is greater than y")
elif x == y:
print("x and y are equal")
else:
print("y is greater than x")
There are some logical keywords that can be used for combining conditional state-
ments like and, or.
a = 10
b = 5
c = 20
if a > b and c > a:
print("Both conditions are True")
if a > b or a > c:
print("At least one of the conditions is True")
You can have an if statement inside another if statement.

a = 10
b = 5
c = 20
if a > b:
if a > c:
print("a is grater than both b and c.")
> a is grater than both b and c.
While Loops
While loop is one of two types of iterations in Python that can execute specific
statements as long as a condition is true.
i = 1
while i < 3:
print(i)
i += 1
> 1
> 2
Just like if statement, we can use else code block for when the while loop condition
is no longer true.
i = 1
while i < 3:
print(i)
i += 1
else:
print("i is grater or equal to 3.")
> 1
> 2
> i is grater or equal to 3.
For Loops
The for statement in Python differs from what can be used in other programming
languages like C and Java. Unlike other programming languages that the for statement
always iterates over an arithmetic progression of numbers or lets the user define both
the iteration step and halting condition, in Python, we can iterate over the items of
any sequence (like string).
nums = ["One", "Two", "Three"]
for n in nums:
print(n)
for x in "One.":
print(x)
> One
> Two
> Three
> O
> n
> e
> .
There are some statements and a function that can be very useful in for statement
which will be discussed in the next section.
Range, Break, and Continue
If it is needed to iterate over a sequence of numbers, there is a built-in function range()

which generates the arithmetic progression of numbers. Consider the range() which
starts from 0 and ends at a specified number, except 1. To iterate over the indices of a
list or any sequence, we can use the combination of the range() and len() functions that
are applicable. The range() function by default moves in the sequence by one step,
so we can specify the increment value by adding a third parameter to the function.
for x in range(3):
print(x)
> 0
> 1
> 2
for x in range(2, 4):

print(x)
> 2
> 3
nums = ["One", "Two", "Three"]

for i in range(len(nums)):
print(i, nums[i])
for x in range(2, 10, 3):
print(x)
> 0 One
> 1 Two
> 2 Three
for x in range(2, 10, 3):

print(x)
> 2
> 5
> 8
Same as while loop, else statement can be used after for loop. In Python, break and
continue are used like other programming languages to terminate the loop or skip
the current iteration, respectively.
for x in range(10):
print(x)
if x > 2:
break
> 0
> 1
> 2
for x in range(3):
if x == 1:
continue
print(x)
> 0
> 2
Try, Expect
Like any programming language, Python has its own error handling statements named
try and except to test a block of code for errors and handle the error, respectively.
try:
print(x)
except:
print("x is not define!")
> x is not define!
A developer should handle the error completely in order that the user can see where
the error is accrued. For this purpose, we should use a raise Exception statement to
raise the error.
try:
print(x)
except:
raise Exception("x is not define!")
> Traceback (most recent call last):

File "<stdin>", line 4, in <module>
Exception: x is not define!
Functions
A function is a block of code, may take inputs, doing some specific statements or
computations which may produce output. The purpose of a function is reusing a
block of code to perform a single, related activity to enhance the modularity of the
code. A function is defined by def keyword and is called by its name.
def func():
print("Inside the function")
func()
> Inside the function
information is passed into the function by adding arguments after the function name,
inside the parentheses. We can add as many arguments as is needed.
def func(arg):
print("Passed argument is", arg)
func("name")
> Passed argument is name
def func(arg1, arg2):

print("first argument is", arg1, "and the second is", arg2)
func("name", "family")
> first argument is name and the second is family
An argument defined in a function should be passed to the function when the function
is called. Default parameter values are used to avoid an error in calling a function.
So even when the user does not pass value to the function, the code works properly
by default values.
def func(arg="Bob"):
print("My name is", arg)
func("john")
func()
> My name is john

> My name is Bob
For returning a value from a function we can use the return statement
def func(x):
return 5 * x
print(my_function(3))
> 15
It should be mentioned a list is passed to a function by reference.

def func(x):
x[0]=1
A = [10, 11, 12]

func(A)
print(A)
> [1, 11, 12]
Libraries
Sometimes an attribute is needed in programming which is not defined in Python’s

standard library. Python library is a collection of functions and methods that allows
us to use them without any effort. There are so many libraries written in Python
language that can easily be installed, and then be imported codes.
In the following sections, some basic libraries that are used in the next chapter for
data gathering, preprocessing, and creating arrays are presented.
A.3 Pandas
Pandas is an open-source Python library that delivers data structures and data analysis
tools for the Python programmer. Pandas is used in various sciences including finance,
statistics, analytic, data science, machine learning, etc. Pandas is installed easily by
using conda or pip:
> conda install pandas
or
> pip install pandas
For importing it, we usually use a shorter name as follows:
import pandas as pd
The major two components of pandas are Series and DataFrame. A Series is essen-
tially a column of data, and a DataFrame is a multi-dimensional table. In other
words, it is a collection of Series. A pandas Series can be created using the following
constructor:
pandas.Series(data, index, dtype, copy)
where the parameters of the constructor are described as follows:
Data Data takes various forms like ndarray, list, constants

Index Index values must be unique and hashable, same length as data. Default
np.arrange(n) if no index is passed
Dtype Dtype is for data type. If None, data type will be inferred
Copy Copy data. Default False
When we want to create a Series from a ndarray, the index should have the same
length with ndarray. However, by default index is equal to range(n) where n is array
length.
import pandas as pd
import numpy as np
mydata = np.array([’a’,’b’, np.nan, 0.2])
s = pd.Series(mydata)
print(s)
> 0 a
1 b
2 NaN
3 0.2
dtype: object
As we can see Series can be a complication of multi-type data. A DataFrame is a

two-dimensional data structure which can be created using the following constructor:
pandas.DataFrame( data, index, columns, dtype, copy)

The parameters of the constructor are as follows:
Data Take various forms like ndarray, series, map, lists, dict, constants, etc.
Index Labels of rows. The index default value is np.arange(n)
Columns Labels of column. Its default value is np.arange(n). This is only true if no index is
passed
Dtype Data type of each column
Copy Copy data
The DataFrame can be created using a single list or a list of lists or a dictionary.
import pandas as pd
data = [[’Alex’,10],[’Bob’,12],[’Clarke’,13]]
df = pd.DataFrame(data,columns=[’Name’,’Age’])
df2 = pd.DataFrame({
’A’: 1.,
’B’: pd.Timestamp(’20130102’),
’C’: pd.Series(1, index=list(range(4)), dtype=’float32’),
’D’: np.array([3] * 4, dtype=’int32’),
’E’: pd.Categorical(["test", "train", "test", "train"]),
’F’: ’foo’})
print(df)
print(df2)
> Name Age

0 Alex 10
1 Bob 12
2 Clarke 13
> A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
Examples are taken from8 and.9 Here we mention some attributes that are useful for
a better understanding of the data frame:
df2.dtypes # show columns data type
df.head() # show head of data frame
df.tail(1) # show tail of data frame
df.index # show index
df.columns # show columns
df.describe() # shows a quick statistic summary of your data
> A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
> Name Age
8 https://www.tutorialspoint.com/Python_pandas/Python_pandas_dataframe.html.
9 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html.
0 Alex 10
1 Bob 12
2 Clarke 13
> Name Age

2 Clarke 13
> RangeIndex(start=0, stop=3, step=1)
> Index([’Name’, ’Age’], dtype=’object’)
> Age
count 3.000000
mean 11.666667
std 1.527525
min 10.000000
25\% 11.000000
50\% 12.000000
75\% 12.500000
max 13.000000
You can use df.T to transposing your data:

df.T
> 0 1 2
Name Alex Bob Clarke
Age 10 12 13
In Pandas, creating csv file from data frame or reading a csv file into a data frame is
done as follows:10
df.to_csv(’data.csv’)
pd.read_csv(’data.csv’)
A.4 Numpy
Numerical Python or NumPy is a library consisting of multi-dimensional array

objects, the ability to perform mathematical and logical operations on arrays. In
NumPy dimensions are called axes. A list like
[1, 2, 3]
has one axis which has three elements. Lists inside a list like
10 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min.
[[1, 2, 3], [1, 2, 3]]

has two axes, each with a length of three.
NumPy’s array class called ndarray is also known by the alias array. Note that
numpy.array is not the same as the Standard Python Library class array.array, which
only handles one-dimensional arrays and has less functionality. Some important
attribute of an ndarray object is
• ndarray.reshape changes shape of dimensions of array.
• ndarray.ndim contains the number of axes (dimensions) of the array.
• ndarray.shape is a tuple of integers indicating the size of the array in each dimen-
sion.
• ndarray.size consists of the total number of elements in the array (product of the
elements of shape).
• ndarray.dtype describes the type of the elements in the array.
• ndarray.itemsize shows the size of each element of the array in bytes.
import numpy as np
a = np.arange(15)
a = a.reshape(3, 5)
print(a)
> array([[ 0, 1, 2, 3, 4],

[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
a.shape
> (3, 5)
a.ndim
> 2
a.dtype.name
> ’int64’
a.size
> 15
A user-defined array can be constructed by using the np.array function or con-

verting an existing list into an ndarray by the np.asarray function:11
import numpy as np
a = np.array([2, 3, 5])
b = [3, 5, 7]
c = np.asarray(b)
A.5 Matplotlib
Matplotlib is a library for making 2D plots of arrays in Python. Matplotlib is an inclu-

sive library for creating static, animated, and interactive visualizations in Python. We
have a deeper look into the Matplotlib library in order to use it for visualizing our data
so that we can understand the data structure better. Matplotlib and its dependencies
can be installed using conda or pip:
> conda install -c conda-forge matplotlib
or
> pip install pandas
Matplotlib is a grading of functions that makes Matplotlib to work like MATLAB.
Pyplot
In order to import this library, usually, it is common to use a shorter name as follows:
import matplotlib.pyplot as plt
For plot x versus y, easily the plot() function can be utilized in the following way:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
11 https://numpy.org/devdocs/user/quickstart.html.
Formatting the Style of Your Plot
In order to distinguish each plot, there is an optional third argument that indicates
the color and the shape of the plot. The letters and symbols of the format string are
like MATLAB.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], ’ro’)
Numpy arrays also can be used to plot any sequenced data.

import numpy as np
# evenly sampled time at 200ms intervals

t = np.arange(0., 5., 0.2)
# red dashes, blue squares and green Radical.

plt.plot(t, t, ’r--’, t, t**2, ’bs’, t, np.sqrt(t), ’gˆ’)
plt.show()
Plotting with Keyword Strings
There are some cases in which the format of the data lets accessing particular variables
with strings. With the Matplotlib library, you can skillfully plot them. There is a
sample demonstrating how it is carried out.
data = {’a’: np.arange(50),

’c’: np.random.randint(0, 50, 50),
’d’: np.random.randn(50)}
data[’b’] = data[’a’] + 10 * np.random.randn(50)
data[’d’] = np.abs(data[’d’]) * 100
plt.scatter(’a’, ’b’, c=’c’, s=’d’, data=data)

plt.xlabel(’entry a’)
plt.ylabel(’entry b’)
plt.show()
Plotting with Categorical Variables
It is also possible to create a plot using categorical variables. Matplotlib allows us to

pass categorical variables directly to different plotting functions.
names = [’group_a’, ’group_b’, ’group_c’]
values = [1, 10, 100]
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.plot(names, values)
plt.suptitle(’Categorical Plotting’)
plt.show()

Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines

Uploaded by

Copyright:

Available Formats

Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines

Uploaded by

Copyright:

Available Formats

Industrial and Applied Mathematics

Jamal Amani Rad

Learning with Fractional

ISSN 2364-6837 ISSN 2364-6845 (electronic)

In Chaps. 7 and 8, some applications of the SVM in scientific computing are

Tehran, Iran Jamal Amani Rad

Part I Basics of Support Vector Machines

Part II Special Kernel Classifiers

Part III Applications of Orthogonal Kernels

8 Solving Partial Differential Equations by LS-SVM . . . . . . . . . . . . . . . 171

Part IV Orthogonal Kernels in Action

Appendix: Python Programming Prerequisite . . . . . . . . . . . . . . . . . . . . . . . . 279

About the Editors

Jamal Amani Rad is Assistant Professor at the Department of Cognitive Modeling,

Kourosh Parand is Professor at the Department of Computer and Data Sciences,

Snehashish Chakraverty is Professor at the Department of Mathematics (Applied

Alireza Afzal Aghaei Department of Computer and Data Science, Faculty of

Kourosh Parand Department of Computer and Data Sciences, Faculty of Mathe-

Keywords Machine leaning · Pattern recognition · Support vector machine ·

1.1 What Is Machine Learning?

Recognizing a person from his/her face, reading a handwritten letter, understanding

Fukushima published his work on the neocognitron neural network Fukushima

applications. These analytical models allow researchers, data scientists, engineers,

1.1.1 Classification of Machine Learning Techniques

Machine Learning Regression

Fig. 1.1 Various types of machine learning

reduction as another type of unsupervised learning denotes the methods trans-

1.2 What Is the Pattern?

1.3 An Introduction to SVM with a Geometric

Fig. 1.2 a A binary classification example, b Some possible decision boundaries

the SVM, assume a binary classification example in a two-dimensional space, e.g.,

Fig. 1.3 a SVM optimum decision boundary, b Definition of the notations

w.x+ + w0 ≥ 1, and w.x− + w0 ≤ −1. (1.1)

This is a constrained optimization problem and can be solved by the Lagrange

1.4 History of SVMs

Table 1.1 A brief history of SVM development

1.5 SVM Applications

• ML Applications in Medical: SVM is widely applied in medical applications

Kourosh Parand, Fatemeh Baharifard, Alireza Afzal Aghaei,

Keywords Support vector machine · Classification · Regression · Kernel trick

2.1 Linear SVM Classifiers

2.1.1 Hard Margin SVM

where wi ’s are unknown weights of the problem.

Fig. 2.1 Different

Moreover, in order to maximize the margin, it can equivalently minimize w .

The Quadratic Programming (QP) problem as a special case of nonlinear pro-

advantage of QP problems is that there are some efficient computational methods to

p ∗ = min f (x) (2.7)

The Lagrangian function of this problem is

where α = [α1 , . . . , αm ] and λ = [λ1 , . . . , λ p ] are Lagrangian multiplier vectors.

Therefore, the problem Eq. 2.7 can be written as

p ∗ = min max L(x, α, λ). (2.10)

which leads to the following optimization problem:

The solution is characterized by the saddle point of the problem. To find

min L(w, w0 , α),

taking the first-order partial derivative of L with respect to w and w0 obtain

So, the dual form of problem Eq. 2.5 is as follows:

After finding α by QP process, vector of w can be calculated from equality w =

2.1.2 Soft Margin SVM

Moreover, in order to maximize the margin, it can equivalently minimize w .

Proof Suppose x = cos θ and n = m, so we have

over [−1, 1], using x = 2( b−a