0% found this document useful (0 votes)
6 views

Data Algo Metrics

Uploaded by

jeethhirrani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Algo Metrics

Uploaded by

jeethhirrani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Data Mining
We used classification models to predict the opinions from the dataset. Classification models
perform the task of drawing conclusions by observing data points and subsequently predict
the label value for each label in the dataset. In classification model, inputs are taken from
various features in the dataset and output is given in terms of a label or a class for the
predicted variable. Depending on the number of classes we want to predict there are two types
of classification - binary classification and multi-class classification. If two classes are to be
predicted then it is known as binary classification and if more than two classes are to be
predicted it becomes multi-class classification. To understand a pattern from qualitative
features of data, rule based classifiers like Naive Bayes, Random Forrest, KNN, Artificial
Neural Networks are generally used to give better performance. Thus, we have used these
models to predict suppliers.

1.1 Naive Bayes Classifier


Naive Bayes is an effective, simple and mostly-used machine learning classifier. It classifies
using the Maximum A Posteriori decision rule in a Bayesian conditional probability theorem.
It is normally used when features in the database are of contrasting nature. It takes available
past information to calculate the probability of future events (Soni D. 2018).
To perform the above operation we use Bayes Rule as follows.

Equation 3.1

1.2 Random Forrest Classifier


Random forest also know as Random Decision Forest are one of the supervised machine
learning classifier which analyses the given data features and figure out a way to effectively
split the data into trees and sub trees in order to predict new values. The output is achieved by
corroborating the collection of decision trees classifier to settle for a single result. Thus the
winner class is the one who appears most times in the list of outputs from all the decision
trees used (Borcan M. 2020).
To perform the above operation, each decision tree calculates a nodes using Gini Importance
(Ronaghan S. 2018)

Equation 3.2

To perform the above operation, each decision tree calculates a nodes using Gini Importance
(Ronaghan S. 2018
 nij = the importance of node j
 wj = weighted number of samples reaching node j
 Cj = the impurity value of node j
 leftj = child node from left split on node j
 rightj = child node from right split on node j

The importance for each feature on a decision tree is then calculated as:

Equation 3.3

 fij = the importance of feature i


 nij = the importance of node j

These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature
importance values:

Equation 3.4

2
The final feature importance, at the Random Forest level, is it’s average over all the trees. The
sum of the feature’s importance value on each trees is calculated and divided by the total
number of trees:

Equation 3.5

 RFfij = the importance of feature i calculated from all trees in the Random Forest
model
 normfiij = the normalized feature importance for i in tree j
 T = total number of trees

1.3 KNN Classifier


It is a supervised machine learning classifier and it stands for k-Nearest Neighbours. It is a
simple algorithm which stores all available cases and classifies the new data point based on
how its neighbours are classified. ‘k’ in KNN is based on feature similarity which represents
the number of nearest neighbours to include so as to obtain desired outcome and process of
choosing right value of K is called parameter tuning (Subramanian 2019).
Similarity here is defined according to a distance metric between two data points by
Euclidean distance method.

Equation 3.6

1.4 ANN Classifier


Artificial Neural Network Classifier imitates the processing of human brain neurons as a basis
to develop algorithms which are used to create complex model patterns. It basically has
multiple interconnected nodes of neurons which are divided into three layers i.e input layer,
hidden layer and output layer. ANN mostly is used to solve multi class classification problem
as it measures interdependencies between the output classes (Jahnavi 2017).

3
Figure 3.2 – Single Layer Neural Network (Perceptron)

 x0, x1, x2, x3...x(n) – Input nodes (Independent Variables)


 w0, w1, w2, w3….w(n) – Weights represents strenghts of individual nodes
 b – Bias value to shift the activation function up or down
 f – Activation Function
 o - output

1.5 Evaluation
The results which are generated by the various data mining classification models will be
evaluated using the below mentioned metrics.
1.5.1 Confusion Matrix
It represents overall model performance in a tabular format, where prediction of classes are
summarised as follows:
 True Positives (TP): correctly predicted class 1 as 1
 True Negatives (TN): correctly predicted class 2 as 2
 False Positives (FP): incorrectly predicted class 2 as 1
 False Negatives (FN): incorrectly predicted class 1 as 2

1.5.2 Precision
It is the ratio of true positive classes to the number of actual positive classes and is formulated
as follow:
 Precision = TP/(TP + FP)

1.5.3 F1 Score

4
The harmonic mean of recall and precision is termed as F1 Score. It indicates the classifier’s
overall performance as it takes into account both false positives and false negatives as
follows:
 F1 Score = 2*(Recall * Precision) / (Recall + Precision)

1.5.4 Area under the Curve (AUC)


It is the probability of a model classifying positive observation higher than the negative one
and it values ranges from 0 to 1.

1.5.5 Classification Accuracy (CA)


It is the fraction of correctly predicted observations as follows:
 CA = (TP+TN)/(TP + TN + FP + FN)

1.6 Environment Setup


The technical specifications of the expected system configurations for the proposed
architecture are show below:

Memory Processor Speed


8 GB
RAM Intel i5 8250U 1.80 GHz
Table 3.1 - Hardware Specification

Storage Software and Libraries


Local Anaconda – Spyder IDE, Load libraries numpy, pandas, scikit, matplotlib, sklearn, dlib,
HDD keras, tensorflow
Table 3.2 – Software Specification

You might also like