0% found this document useful (0 votes)
12 views36 pages

CG DADL - 2024 June - Lecture 05

Introduction to Data Analytics and Descriptive Analytics

Uploaded by

tangow440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views36 pages

CG DADL - 2024 June - Lecture 05

Introduction to Data Analytics and Descriptive Analytics

Uploaded by

tangow440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 5

Introduction to Classification
Corporate Gurukul – Data Analytics using Deep Learning
June 2024

Lecturer: A/P TAN Wee Kek


Email: [email protected] :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
 At the end of this lecture, you should understand:
 Limitations of linear regression models.
 Definitions of classification problem and classification models.
 Evaluation of classification models.
 Usefulness of classification and clustering.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Overview of Classification

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Limitation of Linear Regression Models
 Regression analysis is useful but suffers from an important
limitation.
 In linear regression models, the numerical dependent
variable must be continuous:
 The dependent variable can take on any value, or at least close
to continuous.
 In some data analytics scenarios, the dependent variable may
not be continuous.
 In other scenarios, it may be unnecessary to make a point
prediction.
 It is possible to convert a regression problem into a
classification problem.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Limitations of Linear Regression Models
(cont.)
 Linear regression requires a linear relationships between
the dependent and independent variables:
 The assumption that there is a straight-line relationship
between them does not always hold.
 Linear regression models only look at the mean of the
dependent variable:
 E.g., in the relationship between the birth weight of infants and
maternal characteristics such as age:
 Linear regression will look at the average weight of babies born to
mothers of different ages.
 But sometimes we need to look at the extremes of the dependent
variable, e.g., babies are at risk when their weights are low.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Limitations of Linear Regression Models
(cont.)
 Linear regression is sensitive to outliers:
 Outliers can have huge effects on the regression.
 Data must be independent:
 Linear regression assumes that the data are independent:
 I.e., the scores of one subject (such as a person) have nothing to do
with those of another.
 This assumption does not always make sense:
 E.g., students in the same class tend to be similar in many ways such as
coming from the same neighborhoods, taught by the same teachers,
etc.
 In the above example, the students are not independent.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Parametric versus Non-parametric
 Linear regression is parametric:
 Assumes that sample data comes from a population that can
be adequately modelled by a probability distribution that has a
fixed set of parameters.
 Assumptions can greatly simplify the learning process, but can
also limit what can be learned.
 Parametric ML algorithms:
 Algorithms that simplify the function to a known form.
 Non-parametric ML algorithms:
 Algorithms that do not make strong assumptions about the
form of the mapping function.
 Free to learn any functional form from the training data.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Parametric versus Non-parametric
(cont.)
 Non-parametric ML methods are good when:
 You have a lot of data and no prior knowledge.
 You do not want to worry too much about choosing just the
right features.
 Classification algorithms include both parametric and
non-parametric:
 Parametric – Logistic Regression, Linear Discriminant Analysis,
Perceptron, Naive Bayes, Simple Neural Networks
 Non-parametric – k-Nearest Neighbors, Decision Trees,
Support Vector Machines

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Data Mining Goes to Hollywood
 Data mining scenario – Predicting the box-office receipt
(i.e., financial success) of a particular movie.
 Problem:
 Traditional approach:
 Frames it as a forecasting (or regression) problem.
 Attempts to predict the point estimate of a movie’s box-office receipt.
 Sharda and Delen’s (2006) approach:
 Convert the regression problem into a multinomial classification
problem.
 Classify a movie based on its box-office receipts into one of nine
categories, ranging from “flop” to “blockbuster”.
 Use variables representing different characteristics of a movie to train
various classification models.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Overview of Classification
 Classification models:
 Supervised learning methods for predicting value of a
categorical target variable.
 In contrast, regression models deal with numerical (or
continuous) target variable.
 Aim of classification models:
 Generate a set of rules from past observations with known
target class.
 Rules are used to predict the target class of future
observations.
 Classification holds a prominent position in learning
theory.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Overview of Classification (cont.)
 From a theoretical viewpoint:
 Classification algorithm development represents a fundamental
step in emulating inductive capabilities of the human brain.
 From a practical viewpoint:
 Classification is applicable in many different domains.
 Examples:
 Selection of target customers for a marketing campaign.
 Fraud detection.
 Image recognition.
 Early diagnosis of disease.
 Text cataloguing.
 Spam email recognition.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Classification Problems
 We have a dataset D containing m observations
described in terms of n explanatory variables and a
categorical target variable (a class or label).
 The observations are also termed as examples or
instances.
 The target variable takes a finite number of values:
 Binary classification – The instances belong to two classes
only.
 Multiclass or multicategory classification – There are
more than two classes in the dataset.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Classification Problems (cont.)
 A classification problem consists of defining an
appropriate hypothesis space F and an algorithm AF that
identifies a function f * ∈ F that can optimally describe the
relationship between the predictive variables and the
target class.
F is a class of functions f ( x ) : R ⇒ H called hypotheses
n

that represent hypothetical relationship of dependence
between yi and xi .
 R n is the vector of values taken by the predictive variables
for an instance.
 H could be {0,1} or {− 1,1} for a binary classification
problem.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Components of a Classification Problem
 Generator – Extract random vectors Χ of data
instances.
 Supervisor – For each vector Χ , return the value of the
target class.
 Classification algorithm (or classifier) – Choose a
function f * ∈ F in the hypothesis space so as to minimize
a suitably defined loss function.
x
Generator Supervisor y

Classification
Algorithm f(x)
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Development of a Classification Model
 Development of a classification model consists of three
main phases.
 Training phase:
 The classification algorithm is applied to the instances
belonging to a subset T of the dataset D .
 T is called the training data set.
 Classification rules are derived to allow users to predict a class
to each observation Χ.
 Test phase:
 The rules generated in the training phase are used to classify
observations in D but not in T .
 Accuracy is checked by comparing the actual target class with
the predicted class for all instances in V = D − T .

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Development of a Classification Model
(cont.)
 Observations in V form the test set.
 The training and test sets are disjoint: V ∩T = ∅ .
 Prediction phase:
 The actual use of the classification model to assign target class
to completely new observations.
 This is done by applying the rules generated during the training
phase to the variables of the new instances.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Development of a Classification Model
(cont.)

Training

Training

Training Data Tuning

Test Data
Test Rules

New Data Accuracy Assessment


Prediction

Knowledge

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Taxonomy of Classification Models
 Heuristic models:
 Classification is achieved by applying simple and intuitive
algorithms.
 Examples:
 Classification trees – Apply divide-and-conquer technique to obtain
groups of observations that are as homogenous as possible with
respect to the target variable.
 Also known as decision trees.
 Nearest neighbor methods – Based on the concept of distance
between observations.
 Separation models:
 Divide the variable space into H distinct regions.
 All observations in a region are assigned the same class.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Taxonomy of Classification Models
(cont.)
 How to determine these regions? Neither too complex or
many, nor too simple or few.
 Define a loss function to take into account the misclassified
observations and apply an optimization algorithm to derive a
subdivision into regions that minimizes the total loss.
 Examples – Discriminant analysis, perceptron methods, neural
networks (multi-layer perceptron) and support vector
machines (SVM).
 Regression model:
 Logistic regression is an extension of linear regression suited
to handling binary classification problems.
 Main idea – Convert binary classification problem via a proper
transformation into a linear regression problem.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Taxonomy of Classification Models
(cont.)
 Probabilistic models:
 A hypothesis is formulated regarding the functional form of the
conditional probabilities PΧ| y (Χ | y ) of the observations given
the target class.
 This is known as class-conditional probabilities.
 Based on an estimate of the prior probabilities Py ( y ) and using
Bayes’ theorem, calculate the posterior probabilities Py|Χ ( y | Χ )
of the target class.
 Examples – Naive Bayes classifiers and Bayesian networks.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Evaluation of Classification Models
 In a classification analysis:
 It is advisable to develop alternative classification models.
 The model that affords the best prediction accuracy is then
selected.
 To obtain alternative models:
 Different classification methods may be used.
 The values of the parameters may also be modified.
 Accuracy:
 The proportion of the observations that are correctly
classified by the model.
 Usually, one is more interested in the accuracy of the model on
the test data set V .

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Evaluation of Classification Models
(cont.)
 If yi denotes the class of the generic observation Χ i ∈ V and
f (Χ i ) the class predicted through the function f ∈ F identified
by the learning algorithm A = AF , the following loss function
can be defined:
0, if yi = f (Χ i )
L( yi , f (Χ i )) = 
1, if yi ≠ f (Χ i )
The accuracy of model A can be evaluated as:
1 v
acc A (V ) = acc AF (V ) = 1 − ∑ L( yi , f (Χ i ))
v i =1
where v is the number of observations.
The proportion of errors made is defined as:
1 v
errA (V ) = errAF (V ) = 1 − acc AF (V ) = ∑ L( yi , f (Χ i ))
v i =1

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Evaluation of Classification Models
(cont.)
 Speed:
 Long computation time on large datasets can be reduced by
means of random sampling scheme.
 Robustness:
 The method is robust if the classification rules generated and
the corresponding accuracy do not vary significantly as the
choice of training data and test datasets varies.
 It must also be able to handle missing data and outliers well.
 Scalability:
 Able to learn from large datasets.
 Interpretability:
 Generated rules should be simple and easily understood by
knowledge workers and domain experts.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Holdout Method
 Divide the available m observations in the dataset D into
training dataset T and test dataset V .
 The t observations in T is usually obtained by random
selection.
 The number of observations in T is suggested to be
between one half and two thirds of the total number of
observations in D .
 The accuracy of the classification algorithm via the
holdout method depends on the test set V .
 In order to better estimate accuracy, different strategies
have been recommended.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Repeated Random Sampling
 Simply replicate the holdout method r times.
 For each repetition k = 1,2,..., r :
 A random training dataset Tk having t observations is
generated.
 Compute acc AF (Vk ) , the accuracy of the classifier on the
corresponding test set Vk , where Vk = D − Tk .
 Compute the average accuracy as:
1 r
acc A = acc AF = ∑ acc AF (Vk )
r k =1
 Drawback – No control over the number of times each
observation may appear, outliers may cause undesired
effects on the rules generated and the accuracy.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Cross-validation
 Divide the data into r disjoint subsets, L1 , L2 ,..., Lr of
(almost) equal size.
 For iterations k = 1,2,..., r :
 Let the test set be Vk = Lk
 And the training set be Tk = D − Lk
 Compute acc AF (Vk )
 Compute the average accuracy:
1 r
acc A = acc AF = ∑ acc AF (Vk )
r k =1
 Usual value for r is r = 10 (i.e., ten-fold cross-validation).

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Cross-validation (cont.)
 Also known as k-fold cross-validation or rotation
estimation.
L1 L2 L3 L4 L5 L6 L7 L8 L9 L10

L1 L2 L3 L4 L5 L6 L7 L8 L9 L10

L1 L2 L3 L4 L5 L6 L7 L8 L9 L10

L1 L2 L3 L4 L5 L6 L7 L8 L9 L10

Illustration of ten-fold cross-validation

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Leave-One-Out
 Cross-validation method with the number of iterations r
being set to m.
 This means each of the m test sets consists only of 1
sample and the corresponding training data set consists of
m − 1 samples.
 Intuitively, every observation is used for testing once on
as many models developed as there are number of data
points.
 Time consuming methodology but a viable option for
small dataset.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Stratified Random Sampling
 Instead of random sampling to partition the dataset 𝐷𝐷
into training set 𝑇𝑇 and test set 𝑉𝑉, stratified random
sampling could be used to ensure the same proportion of
observations belonging to each target class is the same in
both 𝑇𝑇 and test set 𝑉𝑉.
 In cross-validation, each subset 𝐿𝐿𝑘𝑘 should also contain the
same proportion of L L L 1 L L
2 L L
3 L L4
L
5 6 7 8 9
1

observations belonging to
0

L L L L L L L L L L
each target class.
1
1 2 3 4 5 6 7 8 9
0

L1 L2 L3 L4 L5 L6 L7 L8 L9 L1
0

In this example:
• Purple/Blue – Class 0 L1 L2 L3 L4 L5 L6 L7 L8 L9 L1
• Red – Class 1 0

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Confusion Matrices
 In many situations, just computing the accuracy of the
classifier may not be sufficient:
 Example 1 – Medical Domain:
 The value of 1 means the patient has a given medical condition and -1
means the patient does not.
 If only 2% of all patients in the database have the condition, then we
achieve an accuracy rate of 98% by having the trivial rule that “the
patient does not have the condition”.
 Example 2 – Customer Retention:
 The value of 1 means the customer has cancelled the service, 0 means
the customer is still active.
 If only 2% of the available data correspond to customers who have
cancelled the service, the trivial rule “the customer is still active” has
an accuracy rate of 98%.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)
 Confusion matrix for a binary target variable encoded
with the class values {− 1,+1} :
 Accuracy – Among all instances, what is the proportion that
are correctly predicted?
p+v p+v
acc = =
p+q+u +v m
Predictions
-1 +1
(Negative) (Positive) Total

-1
p q p+q
Instances

(Negative)

+1
(Positive) u v u+v
Total p+u q+v m
CG DADL (June 2024) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)
 True negative rate – Among all negative instances,
proportion of correct predictions: p
tn =
p+q
 False negative rate – Among all positive instances,
proportion of incorrect predictions: u
fn =
u+v
 False positive rate – Among all negative instances,
proportion of incorrect predictions: q
fp =
p+q
 True positive rate – Among all positive instances, proportion
of correct predictions (also known as recall): v
tp =
u+v

CG DADL (June 2024) Lecture 5 – Introduction to Classification


Confusion Matrices (cont.)
 Precision – Among all positive predictions, the proportion of
actual positive instances: v
prc =
q+v
 Geometric mean is defined as:
gm = tp × prc
and sometimes also as:
gm = tp × tn
 F-measure is defined as:

F=
(β 2
)
+ 1 tp × prc
β 2 prc + tp
where β ∈ [0, ∞ ] regulates the relative importance of the
precision w.r.t. the true positive rate. The F-measure is also
equal to 0 if all the predictions are incorrect.
CG DADL (June 2024) Lecture 5 – Introduction to Classification
ROC Curve Charts
 Receiver operating characteristic (ROC) curve
charts:
 Allow the user to visually evaluate the accuracy of a classifier
and to compare different classification models.
 Visually express the information content of a sequence of
confusion matrices.
 Allow the ideal trade-off between:
 Number of correctly classified positive observations – True Positive
Rate on the y-axis.
 Number of incorrectly classified negative observations to be assessed
– False Positive Rate on the x-axis.

CG DADL (June 2024) Lecture 5 – Introduction to Classification


ROC Curve Charts (cont.)

CG DADL (June 2024) Lecture 5 – Introduction to Classification


ROC Curve Charts (cont.)
 ROC curve chart is a two dimensional plot:
 fp on the horizontal axis and tp on the vertical axis.
 The point (0,1) represents the ideal classifier.
 The point (0,0) corresponds to a classifier that predicts class
{− 1} for all samples.
 The point (1,1) corresponds to a classifier that predicts class
{1} for all samples.
 Parameters in a classifier may be adjusted so that tp can be
increased, but at the same time increasing fp .
 A classifier with no parameters to be (further) tuned yields
only 1 point on the chart.
 The area beneath the ROC provides means to compare the
accuracy of various classifiers.
 The ROC curve with the greatest area is preferable.
CG DADL (June 2024) Lecture 5 – Introduction to Classification

You might also like