Data Algo Metrics
Data Algo Metrics
Data Mining
We used classification models to predict the opinions from the dataset. Classification models
perform the task of drawing conclusions by observing data points and subsequently predict
the label value for each label in the dataset. In classification model, inputs are taken from
various features in the dataset and output is given in terms of a label or a class for the
predicted variable. Depending on the number of classes we want to predict there are two types
of classification - binary classification and multi-class classification. If two classes are to be
predicted then it is known as binary classification and if more than two classes are to be
predicted it becomes multi-class classification. To understand a pattern from qualitative
features of data, rule based classifiers like Naive Bayes, Random Forrest, KNN, Artificial
Neural Networks are generally used to give better performance. Thus, we have used these
models to predict suppliers.
Equation 3.1
Equation 3.2
To perform the above operation, each decision tree calculates a nodes using Gini Importance
(Ronaghan S. 2018
nij = the importance of node j
wj = weighted number of samples reaching node j
Cj = the impurity value of node j
leftj = child node from left split on node j
rightj = child node from right split on node j
The importance for each feature on a decision tree is then calculated as:
Equation 3.3
These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature
importance values:
Equation 3.4
2
The final feature importance, at the Random Forest level, is it’s average over all the trees. The
sum of the feature’s importance value on each trees is calculated and divided by the total
number of trees:
Equation 3.5
RFfij = the importance of feature i calculated from all trees in the Random Forest
model
normfiij = the normalized feature importance for i in tree j
T = total number of trees
Equation 3.6
3
Figure 3.2 – Single Layer Neural Network (Perceptron)
1.5 Evaluation
The results which are generated by the various data mining classification models will be
evaluated using the below mentioned metrics.
1.5.1 Confusion Matrix
It represents overall model performance in a tabular format, where prediction of classes are
summarised as follows:
True Positives (TP): correctly predicted class 1 as 1
True Negatives (TN): correctly predicted class 2 as 2
False Positives (FP): incorrectly predicted class 2 as 1
False Negatives (FN): incorrectly predicted class 1 as 2
1.5.2 Precision
It is the ratio of true positive classes to the number of actual positive classes and is formulated
as follow:
Precision = TP/(TP + FP)
1.5.3 F1 Score
4
The harmonic mean of recall and precision is termed as F1 Score. It indicates the classifier’s
overall performance as it takes into account both false positives and false negatives as
follows:
F1 Score = 2*(Recall * Precision) / (Recall + Precision)