An Overview of Supervised Machine Learning Paradigms and Their Classifiers
An Overview of Supervised Machine Learning Paradigms and Their Classifiers
Science (IJAEMS)
Peer-Reviewed Journal
ISSN: 2454-1311 | Vol-10, Issue-3; Mar-Apr, 2024
Journal Home Page: https://ijaems.com/
DOI: https://dx.doi.org/10.22161/ijaems.103.4
Received: 30 Jan 2024; Received in revised form: 14 Feb 2024; Accepted: 22 Mar 2024; Available online: 1 Apr 2024
Abstract— Artificial Intelligence (AI) is the theory and development of computer systems capable of
performing complex tasks that historically requires human intelligence such as recognizing speech, making
decisions and identifying patterns. These tasks cannot be accomplished without the ability of the systems to
learn. Machine learning is the ability of machines to learn from their past experiences. Just like humans,
when machines learn under supervision, it is termed supervised learning. In this work, an in-depth knowledge
on machine learning was expounded. Relevant literatures were reviewed with the aim of presenting the
different types of supervised machine learning paradigms, their categories and classifiers.
Keywords— Artificial intelligence, Machine learning, supervised learning paradigms
supplied to the machine by the environment is usually determined from raw data and experience in the inductive
imperfect, with the result that the learning element does not information processing and it is used in similarity-based
know in advance how to fill in missing details or ignore learning where as in deductive, general rules are used to
details that are unimportant. The machine therefore, determine the specific facts and is used in proof of a
operates by guessing and then receives feedback from the theorem where deductions are made from known axioms to
performance element. The feedback mechanism enables the other existing axioms (Haykin, 1994).
machine to evaluate its hypotheses and revise them if In comparison with the traditional programming, ML uses
necessary. data and output to run on the computer to generate a
Two different kinds of information processing are involved program which can then be used in traditional programming
in machine learning. They are the inductive and deductive while traditional programming uses data and program on
information processing. General pattern and rules are the computer to produce output (Brownlee, 2020).
Data Data
Computer Output Computer Program
Program Output
(a) (b)
(a) Traditional Programming (b) Machine Learning
Fig. 1: Typical simple model of machine learning
Machine Learning Classifiers ii. Regression: The function is continuous. The target
The technique for determining which class a dependent variable is numeric.
belongs to base on one or more independent variables is iii. Forecasting (Probability Estimation): The function
termed as Classification. The type of machine learning is a probability.
algorithm that assigns a label to a data input is known as iv. The supervised learning paradigm classifiers are
Classifier. Decision trees, Naïve Bayes, Regression, Logistic
Regression, Support Vector Machine (SVM), K-
Supervised Machine Learning Paradigm and their
Nearest Neighbor (K-NN), Discriminant Analysis,
Classifiers
Ensemble Methods and Neural Networks.
As the name implies, it is when a machine learns under
Decision Trees
supervision. This is the learning paradigm for acquiring the
input-output relationship information of a system based on This is a statistical classifier used for both classification and
a given set of paired input-output training samples. The regression problems. It incorporates nominal and numerical
model is provided with a correct answer (output) for every values that are expressed as a recursive partition of the
input pattern (Samarasinghe, 2006) and as such referred to instance space. Decision tree is a graphical representation
as “learning with a teacher” (Jain, 1996), that is, available of a well-defined decision problem (Fig. 3). It consists of
data comprises feature vectors together with the target nodes that are concerned with decision making and arcs
values. The learner (computer program) is provided with which connects the nodes (decision rules). The decision tree
two sets of data, training set and test set. The training set has forms the rooted (directed) tree that has basically three types
labelled dataset examples (solution to each problem dataset) of nodes: the root nodes, the internal nodes and the terminal
which the learner can use to identify unlabeled examples in nodes. The root node originates from the tree and in turn is
the test set with the highest possible accuracy as depicted in called the parent node. It has no incoming edges and zero or
Fig. 2. The data is analyzed in order to tune the parameters more outgoing edges. Every other nodes have one incoming
of the model that were not in the training set to predict the node and are called child node. A node with outgoing edges
target value for the new set of data (test data). is termed an internal node. It is also referred to as the test
node. It represents the features of the dataset. Each internal
The major tasks of supervised learning paradigms are:
node has exactly one incoming edge, two or more outgoing
i. Classification: Labeled data and classifiers are used edges and splits the instance space into two or more sub-
to produce predictions about data input spaces based on the discrete function of the input attribute
classifications. The function is discrete and it is a values (attribute test condition) to separate records that have
categorical type. different characteristics. This latter process is called
Splitting.
This article can be downloaded from here: www.ijaems.com 25
©2024 The Author(s). Published by Infogain Publication, This work is licensed under a Creative Commons Attribution 4.0 License.
http://creativecommons.org/licenses/by/4.0/
Nkemdilim et al. International Journal of Advanced Engineering, Management and Science, 10(3) -2024
This is the process of dividing a node into two or more process of going through and reducing the tree to only the
nodes and decision branches off into variables. For numeric most important nodes or outcomes.
attributes, the range is considered as the partition criteria Decision Tree Pseudocode:
where the decision tree can be geometrically interpreted as
1. Start the decision tree with a root node, P that
a collection of hyperplanes, each orthogonal to one of the
contains the complete dataset.
axes. For classification problem, the entropy, Gini index
2. Using the Attribute Selection Measure (ASM),
and information gain (IG) are the splitting metrics used
determine the best attribute in the dataset P to split
while for regression, residual sum of squares is applied. All
it.
other nodes apart from the root and internal nodes are
3. Divide P into subsets containing possible values
termed as the leaves/terminal/decision nodes. Each of the
for the best attributes.
leaf has exactly one incoming edge and no outgoing edges
4. Generate a tree node that contains the best
because it represents the outcome. The leaf node is assigned
attribute.
to the class label describing the most appropriate target
5. Make new decision trees recursively by using the
value. Instances are classified by navigating them from the
subsets of the dataset P created in Step 3. Continue
root down through the arcs to the leaf (Figure 4). Pruning in
the process until a point is reached that the nodes
decision tree classifier is the opposite of splitting. It is the
cannot be further classified.
Fig. 3: Decision tree showing the root, internal and leaf nodes
Naive Bayes For the mathematical analysis from Bayes theorem, if A and
This is a probabilistic classifier and a generative learning B are events and P(B) ≠ 0, to find the probability of event
algorithm that is based on Bayes’ theorem. It is used for text A:
classification task. Given the data and some prior 𝑃(𝐵 |𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) = …(1.1)
knowledge, the theorem is based on the probability of a 𝑃(𝐵)
hypothesis. The classifier assumes that all features in the where Event B is an evidence (true), P(A) is the priori of A,
input data are conditionally independent of each other, P(B) is the marginal probability, 𝑃(𝐴|𝐵) is the posteriori
given the class label (note: this assumption is not true for all probability of B and 𝑃(𝐵|𝐴) is the Likelihood probability
real world cases) thereby, permitting the algorithm to make that a hypothesis will come true based on the evidence.
predictions quickly. The dataset is divided into two: the Applying Bayes theorem:
feature matrix and the response vector. The feature matrix
𝑃(𝑋 |𝑦)𝑃(𝑦)
contains all the vector of the dataset in which each vector 𝑃(𝑦|𝑋) = …(1.2)
𝑃(𝑋)
consist of the value of the dependent features. The response
vector contains the value of class variable (prediction) for y is the class variable and X is the dependent feature vector
each row of the feature matrix. (of size n), where
i. Feature independence: The features of the data are Putting the naïve assumption into the Bayes’ theorem
conditionally independent of each other, given the (independence among the features), we split the evidence
class label. into independent parts.
ii. Continuous features are normally distributed: If a If A and B are independent, then:
feature is continuous then it is assumed to be P(A,B) = P(A)P(B) …(1.4)
normally distributed within each class.
Hence,
iii. Discrete features have multinomial distributions:
If a feature is discrete then it is assumed to have a 𝑃(𝑥1 |𝑦)𝑃(𝑥2 |𝑦)…𝑃(𝑥𝑛 |𝑦)𝑃(𝑦)
𝑃(𝑦|𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ) =
𝑃(𝑥1 )𝑃(𝑥2 )…𝑃(𝑥𝑛 )
multinomial distribution within each class.
iv. Features are equally important: All features are …(1.5)
assumed to contribute equally to the prediction of which can be expressed as:
the class label. 𝑛
𝑃(𝑦) ∏𝑖=1 𝑃(𝑥𝑖 |𝑦)
v. No missing data: The data should not contain any 𝑃(𝑦|𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ) = …(1.6)
𝑃(𝑥1 )𝑃(𝑥2 )…𝑃(𝑥𝑛 )
missing values.
As the denominator remains constant for any given input, data within the dataset, m is the coefficient
we remove 𝑃(𝑦|𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ) ∝ 𝑃(𝑦) ∏𝑛𝑖=1 𝑃(𝑥𝑖 |𝑦) (contribution of the input value in determining the
In order to create the classifier model, we find the best fit line) and c is the bias or intercept
probability of the given set of inputs for all possible values (deviations added to the line equation for the
of the class variable y, and with maximum probability. predictions made).
2. Adjust the line by varying m and c.
𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃(𝑦) ∏𝑛𝑖=1 𝑃(𝑥𝑖 |𝑦) ..(1.7) 3. Randomly determine values initially for m and c
Regression and plot the line.
The goal of this statistical classifier is to plot the best-fit line 4. If the line does not fit best, adjust m and c using
or curve between the data (Kurama, 2023). A continuous gradient descent algorithm or least square method.
outcome (y) is predicted based on the value of the predictor 𝑦 = 𝑚𝑥 + 𝑐 …(1.8)
variables (x). Linear regression is the most common y = the dependent variable and it is plotted along
regression model due to ease (Fig. 4). It finds the linear the y-axis
relationship between the dependent variables (continuous) x = the independent variable and plotted along the
and one or more independent variables (continuous or x-axis
discrete).
m = Slope of the line
Steps in determining the best-fit line:
c = the intercept (the value of y when x = 0)
1. Considering the linear problem 𝑦 = 𝑚𝑥 + 𝑐
Line of regression = Best fit line for a model
where y is the dependent data, x is the independent
1. Every feature such as variable, dimension, or of the layers are considered) to handle complex non-linear
attribute in the dataset has Gaussian distribution. tasks. The Feed forward neural network comprises of the
2. Each feature holds the same variance and has single layer (Hopfield net architecture) and Multiple layer
varying values around the mean with the same perceptron (MLP) uses back propagation learning
amount on average. (Levenberg Marquardt) and Radial basis neural network are
supervised learning.
3. Each feature is assumed to be sampled randomly.
Feed Forward Neural Networks (FFNN): This is a
4. Lack of multicollinearity in independent features
layered neural network in which an input layer of source
and there is an increment in correlations between
nodes projects on to an input layer of neurons but not vice
independent features and the power of prediction
versa.
decreases.
a. Single-layer Feed Forward Network: This is the
In reducing the features from higher dimension space to
simplest kind of neural network that is flat and
lower dimensional space, the following steps should be
consists of a single layer of output nodes (Fig. 6).
considered:
It is also called single perceptron. The inputs are
1. Compute the separate ability amid the various fed directly to the outputs through a series of
classes. This is to determine the between-class weights. The sum of the products of the weights
variance of the different classes (the distance and the inputs are calculated in each node, and if
between the mean of the different classes). the value is above some threshold (typically 0),
2. Compute the distance among the mean and the the neuron fires and takes the activated value
sample of each class (within class variance). (typically 1); otherwise it takes the deactivated
3. Determine the lower dimensional space that value (-1). Single perceptron is only capable of
maximizes the between class variance and learning linearly separable patterns.
minimizes the within class variance.
Ensemble Methods
This classifier encapsulates multiple learning algorithms for
better predictive results. It aims to mitigate errors or biases
that may exist in individual models by leveraging the
collective intelligence of the ensemble (Singh, 2023). The
outputs of many models are combined thereby utilizing the Fig. 6: A Single layer Feed Forward Network
strengths of these models to improve accuracy and handle
uncertainties in data in its learning system. The various
The mapping of single unit perceptron is expressed as:
ensemble techniques are Max Voting, Averaging, Weighted
Average, Stacking, Blending, Bagging and Boosting. 𝑦 = 𝑓(∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖 + 𝑏) …(1.11)
Artificial Neural Network (ANN)
It is designed to mimic the function and structure of the where 𝑤𝑖 are the individual weights, 𝑥𝑖 are the inputs and 𝑏
human brain. ANN is an intricate network of interconnected is the bias
nodes or neurons that collaborates to tackle complicated b. Multilayer Feed Forward Network (MLP):
tasks. The main characteristics of ANN is the ability to learn This distinguishes itself by the presence of one or
in classification task. It learns by example and through more hidden layers called hidden neurons
experience. In high dimensionality data, learning is needful between the input units and the output units (Fig.
in modeling non-linear relationships or recognizing not well 7). This aids the network in dealing with more
established relationship amongst the input variables. The complex non-linear problems. MLP is structured
learning process is achieved by adjusting the weights of the in a feed forward topology whereby each unit gets
interconnections according to the applied learning its input from the previous one (back
algorithm. The basic attributes of ANNs can be classified propagation).
into Architectural attributes and Neuro-dynamic attributes
(Kartalopoulos, 1996). The architectural attributes define
the number and topology of neurons and interconnectivity
while the neuro-dynamic attributes define the functionality
of the ANN. Based on this, ANN is also referred to as Deep
Learning (DL) when it has more than three layers (the depth
III. CONCLUSION
Fig. 7: Multiple Layer Perceptron As the present world revolts round AI for its benefits,
machine learning has been of immense importance to the
building body of such intelligent systems to improve their
The mapping of the inputs to the outputs using an MLP performances. Learning under supervision to predict the
neural network can be expressed as: output of a system when given new inputs has been more
(2) (1) (1) (2)
𝑦𝑘 = 𝑓(∑𝑚 𝑛
𝑗=1 𝑤𝑘𝑗 (∑𝑖=1 𝑤𝑗𝑖 + 𝑤𝑗0 ) + 𝑤𝑘0 )
accurate and of ease when the decision boundary is not
overstrained. The overview of supervised machine learning
…(1.12) paradigms gave a detailed insight to the various statistical
(1) (2)
Where 𝑤𝑗𝑖 and 𝑤𝑘𝑗 indicate the weights in the first and and scientific classifiers used in building functions that map
second layers respectively, going from input 𝑖 to hidden unit new data onto the expected output values in tasks that
𝑗 (hidden layer 1), 𝑚 is the number of the hidden units, 𝑦𝑘 requires either or both classification and regression issues.
(1) (2)
is the output unit, 𝑤𝑗0 and 𝑤𝑘0 are the biases for the
hidden units 𝑗 and 𝑘 respectively. For simplicity, the biases REFERENCES
have been omitted from the diagram. [1] Ambrose, S.A., Bridges, M.N., Dipietro, M, Lovett, M.C. and
c. Radial Basis Neural Network (RBNN): This is Norman, M.K. (2010). How Learning Works: Seven
Research-Based Principles for Smart Teaching, Jossey-Bass
also called Radial Basis Feed Forward (RBF)
A Wiley Imprint Publisher, San Francisco, pp. 1-301
network. It is a two layer feed forward type
[2] Bansal, R., Singh, J. and Kaur, R. (2019). Machine Learning
network in which the input is transformed by the and its Applications: A Review, Journal of Applied Science
basis function at the hidden layer (Fig. 8). At the and Computations, Vol. VI Issue VI, pp. 1392-1398
output layer, linear combinations of the hidden [3] Brownlee, J. (2020). Basic Concepts in Machine Learning.
layer node responses are added to form the output. Retrieved from
The name RBF comes from the fact that the Basis https://machinelearningmastery.com/basic-concepts-in-
function in the hidden layer nodes are radially machine-learning/
symmetric, that is, the neurons in the hidden layer [4] Falade, K.I. (2021). Introduction to Computational
Algorithm, Numerical and Computational Research
contain Gaussian transfer functions whose
Laboratory, pp.1-50
outputs are inversely proportional to the distance
[5] Ghahremani-Nahr, J., Hamed, N. and Sadeghi, M.E. (2021).
from the center of the neuron. Artificial Intelligence and Machine Learning for Real-World
Problem (A Survey), International Journal of Innovation in
Engineering 1 (3), pp. 38-47
[6] Haykin, S. (1998). Neural Networks: A Comprehensive
Foundation, Macmillan College Publishing Company, Inc.
USA, pp. 1-696
[7] Jain, A.K. (1996). Artificial Neural Networks: A tutorial. Pp.
1-14. Retrieved from
www.cogsci.ucsd.edu/ajyu/Teaching/cogs202_sp12/Readin
gs/jain_ann96.pdf
[8] Kartalopoulous, S.V. (1996). Understanding Neural
Fig. 8: Radial Basis Neural Network
Networks and Fuzzy Loic: Basic Concepts and Applications,
IEEE press, NY, pp. 1-232
[9] Kurama, V. (2023). Regression in Machine Learning: What
Mathematically, it can be expressed as:
it is and Examples of Different Models. Retrieved from
𝑦(𝑥) = ∑𝑁
𝑖=1 𝑤𝑖 ∅(‖𝑥 − 𝑐𝑖 ‖) …(1.13) https://builtin.com/data-science/regression=machine-
where 𝑥 is the input vector, 𝑁 is the number of neurons in learning
[10] NetApp (2023). What is Machine Learning? Retrieved from
the hidden layer, 𝑤𝑖 are weights of the connections from the
https://www.netapp.com/artificial-intelligence/what-is-
hidden layer to the output layer, 𝑐𝑖 are the centers of the machine-learning/