CG DADL - 2024 June - Lecture 05
CG DADL - 2024 June - Lecture 05
Introduction to Classification
Corporate Gurukul – Data Analytics using Deep Learning
June 2024
Classification
Algorithm f(x)
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Development of a Classification Model
Development of a classification model consists of three
main phases.
Training phase:
The classification algorithm is applied to the instances
belonging to a subset T of the dataset D .
T is called the training data set.
Classification rules are derived to allow users to predict a class
to each observation Χ.
Test phase:
The rules generated in the training phase are used to classify
observations in D but not in T .
Accuracy is checked by comparing the actual target class with
the predicted class for all instances in V = D − T .
Training
Training
Test Data
Test Rules
Knowledge
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
observations belonging to
0
L L L L L L L L L L
each target class.
1
1 2 3 4 5 6 7 8 9
0
L1 L2 L3 L4 L5 L6 L7 L8 L9 L1
0
In this example:
• Purple/Blue – Class 0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L1
• Red – Class 1 0
-1
p q p+q
Instances
(Negative)
+1
(Positive) u v u+v
Total p+u q+v m
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)
True negative rate – Among all negative instances,
proportion of correct predictions: p
tn =
p+q
False negative rate – Among all positive instances,
proportion of incorrect predictions: u
fn =
u+v
False positive rate – Among all negative instances,
proportion of incorrect predictions: q
fp =
p+q
True positive rate – Among all positive instances, proportion
of correct predictions (also known as recall): v
tp =
u+v
F=
(β 2
)
+ 1 tp × prc
β 2 prc + tp
where β ∈ [0, ∞ ] regulates the relative importance of the
precision w.r.t. the true positive rate. The F-measure is also
equal to 0 if all the predictions are incorrect.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
ROC Curve Charts
Receiver operating characteristic (ROC) curve
charts:
Allow the user to visually evaluate the accuracy of a classifier
and to compare different classification models.
Visually express the information content of a sequence of
confusion matrices.
Allow the ideal trade-off between:
Number of correctly classified positive observations – True Positive
Rate on the y-axis.
Number of incorrectly classified negative observations to be assessed
– False Positive Rate on the x-axis.