0% found this document useful (0 votes)
25 views53 pages

Lecture 3 Basics of Clssification

Uploaded by

parisangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views53 pages

Lecture 3 Basics of Clssification

Uploaded by

parisangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Basics of Classification

(Adapted from various sources. The slides copied are only for
teaching purposes.)
BASIC IDEA
§ In the exams, the student’s grade was assigned based
on their marks as follows:

Mark ≥ "# : A

"# > Mark ≥ $# : B

Rules 8# > Mark ≥ %# : C

%# > Mark ≥ &# : D

6# > Mark : F

Here the classification is done based on a simple


rule!!!
§ We apply a rule / set of rules to classify the
data
§ Classificationis a technique for
describing important data classes based
on some rules.

Classification !
§ The classes are mutually exhaustive and
exclusive.

§ This indicate each object can be assigned


to precisely only one class
§ Science
§ Finance
§ Medical
§ Security
Applications § Prediction
§ Entertainment
§ Social media
§ And more….
§ Supervised Classification
§ We already know the set of possible
classes.

§ Unsupervised Classification
Classification § It is called clustering
types § We don’t know the classes or the
number of possible classes.

§ We try to categorize based on some


rule which may not serve our purpose
at all.
Image taken from www.webstockreview.net
§ The Classification is a supervised
technique

§ A good classifier depends on the


Points to below two factors.
remember § We need rules for classification
§ We need a teacher.
§ Training set (the teacher)
§ Collection of records with a set of
attributes and one class lebel.

How to
proceed?
The image is taken from https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
§ Training set (the teacher)
§ Collection of records with a set of
attributes and one class label.

How to
proceed?

§ Develop a model for class in terms of


the other attributes with training set
§ Define the rules
§ Statistics based - Bayesian
§ Distance Based - KNN

Classifiers § Decision Tree Based - CART


§ Machine Learning based - SVM
§ Neural Network based - CNN
§ The ‘idiot’ or ‘simple’ classifier.
§ Based on statistics.
§ Empirically proven to be useful.
§ Scales very well.
Naïve Bayes § Predicts class membership probabilities
§ Based on Bayes’ Theorem.
§ The attributes are independent given the
class.
In this database, there are four attributes
A = [ Sepal length, Sepal width, Petal length,
Petal width]
with 150 sample.

The categories of classes are:


C= [Iris Versicolor, Iris Setosa, Iris Virginica]

Given this is the knowledge of data and


classes, we are to find most likely
classification for any other unseen instance.
§ In many applications, a unknown sample
cannot be classified to a class label with
certainty.
§ In such a situation, the classification can
be achieved probabilistically.
Why Statistics? § In Bayesian classifier we try to model
probabilistic relationships between the
attribute set and the class variable,
§ Bayesian classifier use Bayes’ Theorem of
Probability for classification.
Digit Recognition

Classifier 5
Another § X1,…,Xn Î {0,1} (Blue vs. Red pixels)
Application § Y Î {5,6} (predict whether a digit is a 5 or a 6)

A good strategy is to predict what is the


probability that the image represents a 5
given its pixels?
The Bayes Classifier
§ A good strategy is to predict what is the
probability that the image represents a 5
given its pixels?

§ So … How do we compute that?


Likelihood Prior

Normalization Constant

§ To classify, we’ll simply compute these two probabilities

How and predict based on which one is greater


CLASSIFIERS
X1
feature X2 Y
X3 Classifier category
values …
Xn
CLASSIFIERS
X1
feature X2
Y category
X3 Classifier
values …
Xn

collection of instances
DB
with known categories
EXAMPLE 1
Determining decision on scholarship application based on the following features:

Household income (annual Number of siblings in High school grade (on a


income in millions of pesos) family QPI scale of 1.0 – 4.0)

Intuition (reflected on data set): award scholarships


to high-performers and to those with financial need
,ØInstance-based learning is often termed lazy learning, as there is
typically no “transformation” of training instances into more general
“statements”

ØInstead, the presented training data is simply stored and, when a new
query instance is encountered, a set of similar, related instances is
retrieved from memory and used to classify the new query instance

ØHence, instance-based learners never form an explicit general hypothesis


regarding the target function. They simply compute the classification of
each new query instance as needed
K-NN APPROACH
The simplest, most used instance-based learning algorithm is the k-
NN algorithm

k-NN assumes that all instances are points in some n-dimensional


space and defines neighbors in terms of distance (usually
Euclidean in R-space)

k is the number of neighbors considered


K-NN APPROACH
ØUnlike all the previous learning methods, kNN does not build model from the
training data.

ØTo classify a test instance d, define k-neighborhood P as k nearest neighbors


of d

ØCount number n of training instances in P that belong to class cj

ØEstimate Pr(cj|d) as n/k

ØNo training is needed. Classification time is linear in training set size for each
test case.
9
K-NEAREST-
NEIGHBORS
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3:
WHAT IS THE MOST
POSSIBLE LABEL FOR C?
Solution: Looking for the nearest
K neighbors of c.
Take the majority label as c’s c
label
Let’s suppose k = 3
The 3 nearest points to c are: a,
a and o.
Therefore, the most possible
label for c is a.
SIMPLE ILLUSTRATION: THE COMPLEXITY

-
-
+
+ -
-
+
+
-
SIMPLE ILLUSTRATION

-
-
+
•q + -
-
+
+
-
What is the class of q?
SIMPLE ILLUSTRATION

-
-
+
•q + -
-
+
+
-
q is + under 1-NN
SIMPLE ILLUSTRATION
-
-
+
•q + -
-
+
+
-
q is + under 1-NN,
but – under 5-NN
For a given instance T, get the top k
Get dataset instances that are “nearest” to T
• Select a reasonable distance measure

K - NEAREST Inspect
Inspect the category of these k instances,
choose the category C that represent the
NEIGHBORS most instances

Conclude Conclude that T belongs to category C


A case is classified by a majority
voting of its neighbors, with the case
being assigned to the class most
common among its K nearest neighbors
measured by a distance function.
K-NEAREST-
NEIGHBORS
If K=1, then the case is simply assigned
ALGORITHM
to the class of its nearest neighbor
Evaluation of Classification
Models

Adopted from Václav Hlaváč , Czech


Technical University, Prague.
Performance of a learned classifier?

— Classifiers (both supervised and unsupervised) are


learned (trained) on a finite training multiset (named
simply training set in the sequel for simplicity).
— A learned classifier has to be tested on a different test set
experimentally.
— The classifier performs on different data in the run mode
that on which it has learned.
— The experimental performance on the test data is a proxy
for the performance on unseen data. It checks the
classifier’s generalization ability.
— There is a need for a criterion function assessing the
classifier performance experimentally, e.g., its error rate,
accuracy, expected Bayesian risk (to be discussed later).
A need for comparing classifiers experimentally.
Evaluation as Hypothesis testing

— Evaluation has to be
treated as hypothesis
testing in statistics.
— The value of the
population parameter
has to be statistically
inferred based on the
sample statistics (i.e., a
training set in pattern
recognition).
Danger of overfitting

— Learning the training data too precisely usually


leads to poor classification results on new data.

— Classifier has to have the ability to generalize.


Training vs. test data

— Problem: Finite data are available only and have to


be used both for training and testing.

— More training data gives better generalization.


More test data gives better estimate for the
classification error probability.

— Never evaluate performance on training data.


¡ The conclusion would be optimistically biased.
Training vs. test data

— Hold out: Partitioning of available finite set of data


to training / test sets.

— Bootstrap and Cross validation

Once evaluation is finished, all the available data can


be used to train the final classifier.
Hold out method

— Given data is randomly partitioned into two


independent sets.
— Training multi-set (e.g., 2/3 of data) for the
statistical model construction, i.e. learning the
classifier.
— Test set (e.g., 1/3 of data) is hold out for the accuracy
estimation of the classifier.
— Random sampling is a variation of the hold out
method
— Repeat the hold out k times, the accuracy is
estimated as the average of the accuracies obtained.
K-fold cross validation

— The training set is randomly divided into K disjoint


sets of equal size where each part has roughly the
same class distribution.

— The classifier is trained K times, each time with a


different set held out as a test set.

— The estimated error is the mean of these K errors.


Graphical Example
Leave-one-out

— A special case of K-fold cross validation with K = n,


where n is the total number of samples in the training
multiset.
— n experiments are performed using (n − 1) samples for
training and the remaining sample for testing.
— It is rather computationally expensive.
— Leave-one-out cross-validation does not guarantee the
same class distribution in training and test data!
— The extreme case:
¡ 50% class A, 50% class B. Predict majority class label in the training
data. True error 50%; Leave-one-out error estimate 100%!
Bootstrap aggregating

— The bootstrap uses sampling with replacement to


form the training set.
— Let the training set T consisting of n entries.
— Bootstrap generates m new datasets Ti each of size n0 < n
by sampling T uniformly with replacement. The
consequence is that some entries can be repeated in Ti.
— The m statistical models (e.g., classifiers, regressors) are
learned using the above m bootstrap samples.
— The statistical models are combined, e.g. by averaging
the output (for regression) or by voting (for
classification).
Recommended experimental validation
procedure

— Use K-fold cross-validation (K = 5 or K = 10) for


estimating performance estimates (accuracy, etc.).

— Compute the mean value of performance estimate,


and standard deviation and confidence intervals.

— Report mean values of performance estimates and


their standard deviations or 95% confidence
intervals around the mean.
Criterion function to assess classifier
performance

— Accuracy and error rate


¡ Accuracy is the percent of correct classifications.
¡ Error rate = is the percent of incorrect classifications.
¡ Accuracy = 1 - Error rate.

— Problems with the accuracy:


¡ Assumes equal costs for misclassification.
¡ Assumes relatively uniform class distribution.

— Other characteristics derived from the confusion


matrix
— Expected Bayesian risk.
Confusion matrix, two classes only
Confusion matrix, two classes only

— Accuracy — Precision, predicted


(a + d)/(a + b + c + d) =(TN + positive value
TP)/total = d/(b + d) = TP/predicted
positive
— True positive rate, recall,
sensitivity — False positive rate, false
= d/(c + d) = TP/actual positive alarm
= b/(a+b) = FP/actual negative =
— Specificity, true negative
1 - specificity
rate
= a/(a + b) = TN/actual negative — False negative rate
= c/(c + d) = FN/actual positive
Confusion matrix, # of classes > 2
Here is the summary!!!

— Any ML/AI model depends on the training data set

— Class balancing is important.

— Validation is important!

— And we also need some measure for validating the


data.
THANK YOU FOR LISTENING

Questions???

You might also like