0% found this document useful (0 votes)
2 views

DataScienceIntro-CM-ML_compressed

Supervised learning is a machine learning approach that uses labeled training data to create models for classifying similar unlabeled data, with applications in regression, prediction, and classification. The process involves training and validation datasets, classifiers, and various algorithms, including K-Nearest Neighbors and Learning Vector Quantization. Key challenges include managing the bias-variance trade-off and selecting appropriate input variables to improve model performance.

Uploaded by

mailsacrifice14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DataScienceIntro-CM-ML_compressed

Supervised learning is a machine learning approach that uses labeled training data to create models for classifying similar unlabeled data, with applications in regression, prediction, and classification. The process involves training and validation datasets, classifiers, and various algorithms, including K-Nearest Neighbors and Learning Vector Quantization. Key challenges include managing the bias-variance trade-off and selecting appropriate input variables to improve model performance.

Uploaded by

mailsacrifice14
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Introduction to supervised learning

Nistor Grozavu

LIPN - CNRS UMR 7030

2020/2021

Nistor Grozavu Introduction to supervised learning


What is supervised Learning ?

Supervised learning is a Machine Learning Task the aim of which


is to learn a model from labeled training data in order to classify
similar unlabeled data. Potential applications of supervised
learning include:
Regression (Cf. previous Lecture)
Prediction
Classification
Speech and hand-writing processing
Pattern Recognition
Rule Mining

Nistor Grozavu Introduction to supervised learning


Supervised Learning workflow
In supervised learning, we distinguish 2 types of data:
The training data, subdivided into 2 subsets:
The Training set, on which the label are known and that will
be used to build the model.
The Validation set, on which the labels are also known and
that can be used to rate the model and improve it if necessary.
(noy used by all algorithms)
The test data, or Test set, on which the model will be used
to find unknown labels.

Nistor Grozavu Introduction to supervised learning


Supervised Learning data sets examples

Supervised data are described by several attributes and a target


class (or label), which is known in the training and validation set,
but unknown in the test set.

Size Weight Shoe size Sex Size Weight Shoe size Sex
176 72 43 M 205 85 47 ?
159 61 37 F 172 60 40 ?
180 66 39 F 164 57 38 ?
185 85 44 M 169 52 36 ?
177 70 41 F 183 78 42 ?
155 88 38 M 175 65 44 ?
210 110 45 M 191 77 41 ?
Table: Training data Table: Test data

Nistor Grozavu Introduction to supervised learning


Supervised Learning Example

Nistor Grozavu Introduction to supervised learning


The notion classifier

Besides for regression applications, most supervised algorithms are


known as classifiers.
Classifiers
A classifier learns a model in the form of a function, a set of
logic rules, the parameters of a probabilistic model, the
parameters of a neural network, a set of prototypes, etc.
The classifier will use the model it learned to label new and
previously unknown data.

When the target attribute is an integer number, we usually


refer to it as a class.
When the target attribute is categorical, we talk about labels.
When the target attribute is a real number, the process
involved is a regression.

Nistor Grozavu Introduction to supervised learning


Formalism
Let us denote X = {x1 , · · · , xN }, xi ∈ X be the matrix the N
observed examples of the training set.
The xi are vectors containing the attributes of each object.
Let Y = {y1 , · · · , yN }, yi ∈ [1..K ] be the vector containing
the labels/classes associated to the observed examples.
We note L = {(xi , yi ), i ∈ [1..N]} the training set.
A classifier induced from the traning set L will be denoted
ψ(·, L). It is a function that for any vector xi from X
associates a class:

ψ(·, L) : X → [1..K ]
Applying ψ on a new object x from the test set will therefore
return a class prediction:

ŷ = ψ(x, L)

Nistor Grozavu Introduction to supervised learning


Types of models and classifiers

How are patterns and models expressed ? There are 2 extremes:


Black box representation: The model, structure or function
is impossible to grasp for a human unfamiliar with the
generating algorithm.
Examples: Deep learning algorithms
White box representation: The model and its construction
process are easy to understand and reveal which kind of
structure to expect.
Examples: Decision trees, K-Nearest neighbors
The different types of models come from various fields such as AI,
statistics and research in databases.

Nistor Grozavu Introduction to supervised learning


The 5 steps of supervised learning

1 Decide on a training set that will be representative of the


real-world use of your classifier.
2 Determine your input features and the representations that
you want to use in your model.
3 Decide on the structure of your learning function, and choose
a supervised algorithm compatible with this model.
4 Run the algorithm on your training set. If your algorithm
allows it, do cross-validation checking and adjust your model.
5 Evaluate the accuracy of your algorithm and apply it on a
separate test set.

Nistor Grozavu Introduction to supervised learning


The bias-variance trade-off
Supervised learning faces several challenges.
The bias-variance trade-off in supervised learning
The bias-variance trade-off is a problem in which a supervised
algorithm has to achieve simultaneously two seemlingly
incompatible goals:
Building a model that gives good results on the validation set.
Building a model that can generalize beyond de training set.

The bias is the error from erroneous assumptions in the


learning algorithm or model that usually results in the
algorithms missing important links between the input variables
and the output, thus leading to underfitting.
The variance is the error caused by a too high sensitivity of
the model toward small variations in the training set. This
results in overfitting and the model being unable to
generalize to data outside of the training set.
Nistor Grozavu Introduction to supervised learning
Function complexity and amount of training data

The second issue is the available amount of training data when


compared to the relative complexity of the real model to be
learned:
If the model is simple, a learning algorithm with a high bias
and a low variance should be able to learn it from a small
amount of data.
If the model is complex, it will only be learnable from a very
large amount of training data and using a learning algorithm
with a low bias and a high variance.

Remark
Good learning algorithms should be able to adjust their bias-
variance trade-off based on the amount of available data and the
apparent complexity of the model to be learned.

Nistor Grozavu Introduction to supervised learning


Picking the right input variables

The problem with too many input variables


Even if the real learning model depends only on a very small
number of variables, the algorithm may never figure it out if it is
flooded with a very high number of input variables.
The result may end up being a very complex and overfitting
model.
Models with too many variables cannot easily be understood
and interpreted.

A good understanding of your data and of the problem that


you want to modelize will help remove irrelevant features.
Scaling your data may have a huge influence on the results.
It is important to check for correlation between the attributes
and to remove redundant variables.

Nistor Grozavu Introduction to supervised learning


Similarity and distance

Very much like in clustering, the distance function is a key


element for most supervised classifiers.
Creating custom distance functions is sometimes required.

pP
Euclidian distance ||a − b||2 = (a − bi )2
P i i
Squared Euclidian distance ||a − b||22 = Pi (ai − bi )2
Manhattan distance ||a − b||1 = i |ai − bi |
Maximum distance p ||a − b||∞ = maxi |ai − bi |
Mahalanobis distance (a − b)> S −1 (a − b) where SPis the covariance matrix
Hamming distance Hamming (a, b) = i (1 − δai ,bi )

Table: Examples of common distances

Nistor Grozavu Introduction to supervised learning


1-Nearest Neighbor
The simplest and laziest classifier consists in using the training set
itself as a model without building or computing anything.
1-NN Classifier
“Learning” process: Remember all the observed examples.
Classification process: When a new data arrives, find the most
similar registered example (distance-wise) and assign it to the
same class.

Nistor Grozavu Introduction to supervised learning


1-Nearest Neighbor
The simplest and laziest classifier consists in using the training set
itself as a model without building or computing anything.
1-NN Classifier
“Learning” process: Remember all the observed examples.
Classification process: When a new data arrives, find the most
similar registered example (distance-wise) and assign it to the
same class.

Nistor Grozavu Introduction to supervised learning


1-Nearest Neighbor

The 1-Nearest Neighbor classifier is sensitive to noise and prone to


overfitting.

Figure: The 1-NN algorithm would assign this data to the red class. On
the other hand, a majority vote would assign it to the blue class.

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors

The K-Nearest Neighbors algorithm (KNN) considers the K closest


observed data from the training set to decide on a class for an
unlabeled data. K is a parameter chosen by the user.

Figure: For K>1, the KNN algorithm would assign this unlabeled data to
the blue class.

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors

When there are only two classes, K is usually an odd number to


avoid two classes having an equal number of votes.

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors: Weaknesses
K is a critical parameter that can render the algorithm quickly
unstable:
Depending on the K , the class changes completely.

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors: Weaknesses
With more than 2 classes, things can quickly become complicated
...

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors: Weaknesses

Because the distance between instances is based on all the


attributes, less relevant attributes and even the irrelevant ones
are used in the classification of a new instance.

Because the algorithm delays all processing until a new


classification/prediction is required, significant processing is
needed to make the prediction.

Nistor Grozavu Introduction to supervised learning


Weighted Nearest Neighbors

The Weighted Nearest Neighbors solves 2 of the previous


problems by adding a weight wk to each neighbor.
Examples: (
1
k if k ≤ K
wk =
0 if k > K
(
1
dist if k ≤ K
wk =
0 if k > K

Remark
The real Weighted Nearest NeighborsP classifier uses a much more
complex weight system that satisfies Nn=1 wni = 1.

Nistor Grozavu Introduction to supervised learning


K-Nearest Neighbors: Summary

Pros
Very simple and intuitive
Low Complexity
Great results with well-behaved classes

Cons
No model: No way to properly describe each class. No
possibility to re-use the knowledge
Does not scale well because it requires to store all the training
set
Critical choice of the parameter K
Ill-adapted for categorical data

Nistor Grozavu Introduction to supervised learning


Learning without remembering all the data
The main issues of the KNN algorithm is that all data have to be
kept in memory and that it is unstable when classes that are not
well separated:
It’s a problem with both large and complex datasets.

Idea
Instead of using all the data, we could
use a prototype representing each class
(like in the mean-shift and K-Means
algorithm).
Can be learned incrementally.
Helps building a model.

Issues
Works only with spherical classes
Doesn’t work with classes that
Nistor Grozavu
aren’t well separated.
Introduction to supervised learning
Learning without remembering all the data

Figure: A single prototype per class will never work here ...

Nistor Grozavu Introduction to supervised learning


Learning without remembering all the data

Figure: However, several prototypes per class could work !

Nistor Grozavu Introduction to supervised learning


Learning Vector Quantization algorithm

The LVQ algorithm (Kohonen) is a primitive neural network


classifier that represents the classes from the training set using
several prototypes per class.
It is closely related to both the K-Means and the KNN
algorithm.
It is an early ancestor to the Self-Organizing Maps (Cf.
Lecture 7)

Remark
In many neural networks algorithms, prototypes learned from an
iterative process are called neurons due to their evolutive behavior
and the fact that thay do not represent a cluster or class on their
own.

Nistor Grozavu Introduction to supervised learning


Learning Vector Quantization algorithm

Figure: Example of a LVQ algorithm using 5 prototypes per class (3


classes) – Elements of statistical learning c Hastie et al. 2001

Nistor Grozavu Introduction to supervised learning


Learning Vector Quantization algorithm
1 Initialization:
Set up the initial M prototypes Z = {z1 , · · · , zM }. It can be
done randomly or using an initialization with the K-Means
algorithm. Then use a majority vote to assign each of
prototype to a class C (zm ).
Choose a learning rate  ∈ [0, 1].
2 Go through the training set and update Z for each
observation:
For each observation xi , find the nearest prototype zm .
If C (xi ) = C (zm ), move zm towards xi : zm ← zm + (xi − zm ).
If C (xi ) 6= C (zm ), move zm away from xi :
zm ← zm − (xi − zm )
3 Repeat step 2 until convergence.
Optional: Reduce  after each step 2 to enhance convergence.

Remark
The learning rate  is a critical parameter that can change
drastically the outcome of the classification.
Nistor Grozavu Introduction to supervised learning
Learning Vector Quantization: Classification

Once the prototypes have been learned, the LVQ classifier


behaves like the 1-NN algorithm using the prototypes instead
of the training data.

The class of each presented unlabeled data is determined


based on the class of the closest prototype.

Remark
Using LVQ, the prototype can be trained (updated) in real time
while being use on unlabeled data. This algorithm is therefore
great for online learning.

Nistor Grozavu Introduction to supervised learning


LVQ: Summary

Pros
Low Complexity
Low memory consumption
Can deal with online and incremental data
Can build a good model
Is still easy and intuitive

Cons
Is often less accurate than KNN.
Critical choice of the learning rate parameter 
Ill-adapted for categorical data

Nistor Grozavu Introduction to supervised learning


LVQ vs KNN
In the bias variance trade-off, 1-NN tends toward bias while LVQ
tends toward variance. KNN is somewhere in between depending on
the value of K .

(a) 1-NN (b) 15-NN

Nistor Grozavu Introduction to supervised learning


LVQ vs KNN

Use KNN when: You have a relatively small data set, you
don’t need to build a model, you don’t need to generalize
from your training set.

Use LVQ when: You have a large data set, you need to build a
model, you are dealing with a semi-supervised problem, you
need to learn data incrementaly or on-line, you can afford a
slightly lower accuracy or want a higher variance.

Remark
For simple problems, both will work just fine.

Nistor Grozavu Introduction to supervised learning


Decision trees
Decisions trees are very common classifiers that mine rules from
the training set:
They are mostly applied to categorical data, but not only.
They decompose the feature space according to the most
discriminating variable at each stage.
There are usually more than one possible tree per data set.

Nistor Grozavu Introduction to supervised learning


Decision trees

A decision tree is a tree of a function-discrete representation. It


can be used as a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences. Learning
decision trees are among the most commonly used classification
methods. The main algorithms are ID3, ID4, C4.5 and C5.0
Properties
Expressiveness: It can represent disjunctions of conjunctions
Readability: It can be translated as a set of decision rules

Notae
Disjunction : A or B
Conjunction : A and B

Nistor Grozavu Introduction to supervised learning


Decision tree: Classification Process

Nistor Grozavu Introduction to supervised learning


Types of trees

Decisions trees can be categorized according to three criteria:


The type of data: Numerical, Categorical, Mixed
The type of nodes: Binary leaves, multiple leaves
The overall shape of the tree

(c) Example of a bi- (d) Example of a numerical


nary tree non-binary tree

Nistor Grozavu Introduction to supervised learning


Types of trees

Nistor Grozavu Introduction to supervised learning


Types of trees
Deep trees are usually very biased, can’t generalize much
outside of their training set and are difficult to interpret.
Setting the right Depth for your tree
Most Decision trees algorithms will allow you to choose the
maximum depth of your tree.
How Deep is too deep will depend on the complexity of the
problem
Deeper trees tend towards overfitting, while less deep trees
will tend towards underfitting.
The best option is to start from a deep tree and to prune it in
a way that minimizes the error on the training set.

While balanced-trees are usually the most preferable option,


bushy trees should not be frowned upon in problems with a
lot of classes, or when they can help reducing the depth of the
tree.
Nistor Grozavu Introduction to supervised learning
Link between decision trees and 1-NN classifiers

Remarks
For most decision trees, it is possible to build an equivalent
1-NN classifier.
Each leave of a decision tree is equivalent to a data in the
learning set of a 1-NN classifier.

This process is much easier with discrete variables.

Nistor Grozavu Introduction to supervised learning


The Weather Problem

Outlook Temperature Humidity Wind Play


Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild Normal Weak Yes
... ... ... ... ...

The goal of this problem is to determine which weather conditions


are the most favorable to let your kids go play outside.

From the resulting tree, we


can see that temperature
is not among the most rel-
evant parameters here.

Nistor Grozavu Introduction to supervised learning


The Weather Problem

From this tree we can extract the following rules:


If (Outlook=Sunny) AND (Humidity=High) THEN Play=No
If (Outlook=Sunny) AND (Humidity=Normal) THEN
Play=Yes
If (Outlook=Overcast) THEN Yes
If (Outlook=Rain) AND (Wind=Strong) THEN Play=No
If (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
Nistor Grozavu Introduction to supervised learning
The Weather Problem

Outlook Temperature Humidity Wind Play


Sunny 29 85 Weak No
Sunny 27 90 Strong No
Overcast 28 86 Weak Yes
Rain 24 80 Weak Yes
... ... ... ... ...

The same problem can be processed with mixed attributes. A


similar tree can be found.

Remark
It usually takes more time
to compute a decision
tree with numerical
values, because of the
time required to find the
optimal cut value.

Nistor Grozavu Introduction to supervised learning


The Weather Problem: Unlabeled Example

Outlook Temperature Humidity Wind Play


Sunny 25 55 Weak ?

This example would be labeled Yes by the decision tree.

Nistor Grozavu Introduction to supervised learning


The Contact Lenses Data

Age Prescription Astigmatism Tear production Recommended Lenses


Young Myope No Low None
Young Myope No Normal Soft
Young Myope Yes Low None
Young Myope Yes Normal Hard
Young Hypermetrope No Low None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Low None
Young Hypermetrope Yes Normal Hard
Pre-presbyopic Myope No Low None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Low None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Low None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Low None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Low None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Low None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Low None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Low None
Presbyopic Hypermetrope Yes Normal None

Nistor Grozavu Introduction to supervised learning


The Contact Lenses Data

Nistor Grozavu Introduction to supervised learning


Summary

Pros
Intuitive, easy to understand and to use
Build comprehensive models
The most commonly classifier for decision making
Can learn in a single sweep

Cons
The process to build the tree is complex
There are always several possible trees
Choosing the depth of the tree is a complex decision
Does not work well with datasets that have too many
attributes.

Nistor Grozavu Introduction to supervised learning


Naive Bayes Classifiers: Introduction

Naive Bayes classifiers are a family of simple probabilistic


classifiers based on Bayes Theorem and a strong independence
hypothesis between the features.
Intuition
To find out the probability of the previously unseen instance
belonging to each class, simply pick the “most probable” class.
These probability are assessed using Bayes theorem applied on
the training data.

Nistor Grozavu Introduction to supervised learning


Naive Bayes Classifiers: Introduction

Bayes Theorem
p(x|cj )p(cj )
p(cj |x) =
p(x)

p(cj |x) The probability of instance x belonging to class cj .


We want to compute this probability.
p(x|cj ) The probability of generating instance x knowing class
cj .
Knowing the distribution function of class cj and the features
of x, what is the probability of x ?
p(cj ) The occurrence probability of class cj .
How frequent is class cj in the training set ?
p(x) The occurrence probability of instance x.
This can usually be ignored because it is independent from cj
and the same for all instances.

Nistor Grozavu Introduction to supervised learning


Naive Bayes Classifiers: Simple example

One year ago, on my way back from Holland, I was


arrested by a police officer names “Claude”. I was a
bit high and can’t remember whether Officer Claude
was a male or a female ...
Using a bayesian classifier, and a police data base
with names and sex, we can try to guess whether
it is more likely that officer Claude was a male
or a female. We have two classes: c1 =male and
c1 =female.

Nistor Grozavu Introduction to supervised learning


Naive Bayes Classifiers: Simple example

Name Sex
Claude Male
Laura Female p(male|Claude) =
p(Claude|male)p(male)
Claude Female p(Claude)
1/3 × 3/8 0.125
Claude Female = =
3/8 3/8
Arthur Male
p(Claude|female)p(female)
Karima Female p(female|Claude) =
p(Claude)
Rose Female 2/5 × 5/8 0.250
= =
Sergio Male 3/8 3/8

Table: Training data (List of Since 0.125 < 0.250, we can conclude that most
Police officers in Lille) likely Officer Claude was a female !

Nistor Grozavu Introduction to supervised learning


Naive Bayes with several features
In the previous example there was only one features: the
name. What happens when there are more ?
To make the problem simpler, naive Bayes classifiers assume
that the attributes have independent distributions (which is
not always true).
Independence Hypothesis
Let us note x = {x1 , ..., xd } a data with d features.
Under the hypothesis that all attributes are independent, we
can write:

p(x|cj ) = p(x1 |cj ) × p(x2 |cj ) × · · · × p(xd |cj )

Therefore, we have:
d
Y
p(cj |x) ∝ p(cj ) p(xi |cj )
i=1
Nistor Grozavu Introduction to supervised learning
Naive Bayes with several features

Note that Naive Bayes is not sensitive to irrelevant features.


Suppose that we are trying to classify persons gender based on
several features, including eye color (which is irrelevant):
p(Jessica|cj ) = p(eye = brown|cj ) × p(wears dress = yes|cj ) × · · ·
p(Jessica|female) = 9000/10000 × 7500/10000 × · · ·
p(Jessica|male) = 9001/10000 × 3/10000 × · · ·
p(eye = brown|female) and p(eye = brown|male) should be almost
identical and won’t affect the outcome much. Wearing a dress
however ...
Remark
This assumes that the estimates of the probabilities are good
enough. Therefore the training set must be as big and as unbiased
as possible.

Nistor Grozavu Introduction to supervised learning


Naive Bayes: Properties
Pros
The only things to store are the probabilities: The training
data need not be kept in memory and a single scan of the
data is necessary to acquire the probabilities.
The model is quite simple to understand.
One of the fastest prediction model.

Cons
Naive Bayes assumes that the features are fully independent.
It is usually not true and can lead to more or less bias when
several of them are too correlated.
Naive Bayes tend to be biased toward the training data and
can’t generalize easily (e.g. It is impossible to classify a new
instance with a single -or more- attribute values the
occurrence of which is 0 in the training set).

Nistor Grozavu Introduction to supervised learning


Mosquito identification
In this example we consider 3 species of mosquitoes:
Culex Pipiens, the common house mosquito
Anopheles Stephensi, a common mosquito from the middle
East
Aedes Aegypti, the yellow fever mosquito (may also carry
Dengue fever, Zika, or Chikungunya)

(g) Culex Pipiens (h) Anopheles (i) Aedes Aegypti


Stephensi

Nistor Grozavu Introduction to supervised learning


Mosquito identification

Mosquitoes have distinct wing beat frequencies.

Culex Pipiens: N (µ = 390, σ = 14)


Anopheles Stephensis: N (µ = 475, σ = 30)
Aedes Aegypti: N (µ = 567, σ = 43)

Nistor Grozavu Introduction to supervised learning


Mosquito identification

Culex Pipiens:
N (µ = 390, σ = 14)
Anopheles Stephensis:
N (µ = 475, σ = 30)
Aedes Aegypti:
N (µ = 567, σ = 43)

Suppose I see a mosquito with a wing frequency of 500Hz, which


one is it?

Nistor Grozavu Introduction to supervised learning


Mosquito identification

Suppose I see a mosquito with a wing frequency of 500Hz, which


one is it?
(500−390)2

exp 2×142
p(Culex|wingbeat = 500) = √ ≈0
14 2π
(500−475)2

exp 2×302
p(Anopheles|wingbeat = 500) = √ = 0.0094
30 2π
(500−576)2

exp 2×432
p(Aedes|wingbeat = 500) = √ = 0.0047
43 2π
Most likely it is an Anopheles.

Nistor Grozavu Introduction to supervised learning


Mosquito identification: Getting the probability

These does not look like probabilities: Getting the probabilities


p(Anopheles|wingbeat = 500) = 0.0094
P(Anopheles|wbf = 500) = 0+0.0094+0.0047 =
0.0094
0.77
p(Aedes|wingbeat = 500) = 0.0047 0.0047
P(Aedes|wbf = 500) = 0+0.0094+0.0047 = 0.23

Nistor Grozavu Introduction to supervised learning


Mosquito identification: More features
We now have additional informations in the form of a chart representing how many
mosquitoes are active depending on the time of the day.

Suppose I am savagely attacked by a mosquito with a wingbeat frequency of 420Hz at


11:30am. Which one is the most likely culprit ?
Nistor Grozavu Introduction to supervised learning
Mosquito identification: More features

2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018

Nistor Grozavu Introduction to supervised learning


Mosquito identification: More features

2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018

Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%

Nistor Grozavu Introduction to supervised learning


Mosquito identification: More features

2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018

Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%

Anopheles Stephensis is again the most likely culprit.

Nistor Grozavu Introduction to supervised learning


Remarks

At this point you should be wondering where was the training set
in this exercise.
You never saw the training set, you only saw the model: Wing
beat frequencies distributions and mosquito activity diagram.
The training set was used to build the wing beat frequencies
laws and the distribution diagram. Once you have them, you
don’t need the training set anymore.

Important remark
You saw normalization constants pretty much everywhere in the
calculi. You don’t need them to classify new items. Unless you
really want probabilities, you don’t have to normalize your results.

Nistor Grozavu Introduction to supervised learning


Evaluating classifiers

The evaluation of a classifier is usually done using the


validation set, the labels of which are known.
There are several ways to validate classifier results depending
on their type and the number of classes.

Accuracy: the simplest evaluation criterion


Number of correctly classified data
Accuracy =
Total number of data
The result is in percentage, 100% being the best.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers

Binary classifiers (with 2 classes: True and False) have specific


validations measures that assess different parameters.
Let us consider the following notations:
TP: True positive (data classified True and that are really in
this class)
FP: False positive (data classified True but are not)
TN: True negative (data classified False and are really in this
class)
FN: False negative (data classified False but are actually
True)

Remark
TP + TN
Accuracy =
TP + TN + FP + FN

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers

Recall or True Positive Rate (TPR)


TP
Recall =
TP + FN

It is also called “hit rate” or “sensitivity”.


It is the probability of correctly labeling a Positive case.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers

Fall-out or False Positive Rate (FPR)


FP
FPR =
TN + FP

specificity (SPC) or True Negative Rate


TN
specificity = = 1 − FPR
TN + FP
The specificity (or TNR) is a statistical measure of how well a
binary classifier correctly identifies the negative cases.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers

Precision or Positive Predictive Value (PPV)


TP
precision =
TP + FP
The precision is the probability that a positive prediction is correct.

F-Measure
2 × precision × recall
F-Measure =
precision + recall

The F-Measure is the harmonic mean of the precision and the


recall.
It can be used as a single measure to evaluate the
performances of a binary classifier.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers

TP, FP, TN and FP provide relevant information


No single measure tells the whole story
A classifier with 90% accuracy can be useless if 90% of the
population does not have cancer and the 10% that do are
misclassified.
If possible, use multiple measures.
Beware of the obscure terminological confusion in the
literature !
Depending on the field, specificity is sometimes refers to
precision
Different name exist for the same thing
Always provide the formula when you use terms such as FP,
TP, etc.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers: ROC space
The ROC space (Receiver operating characteristic) is a type of
graph based on the fall-out and the sensitivity and that can be
used to evaluate a classifier.

Nistor Grozavu Introduction to supervised learning


Evaluating binary classifiers: ROC curves

A ROC curve plots uses the ROC space to


assess the quality of a classifier. It is plot-
ted using different parameters as reference
points to draw the curb.
Useful to find the right parameters
Useful to compare binary classifiers

Nistor Grozavu Introduction to supervised learning


Generalizing to non-binary classifiers

Generalizing to classifiers that have more than 2 classes is often


complicated.
Criteria exist in the literature but they are often quite complex
and restricted to specific cases.

It is possible to do some basic analysis using confusion


matrices between the expected classes and the found classes.

Otherwise, indexes such as the accuracy, or vector comparing


measures (e.g. Rand Index and Adjusted Rand Index) are
good solutions.

Nistor Grozavu Introduction to supervised learning


Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a


model: When comparing several models/classifiers that show
similar performances in term of accuracy (or error), and have
similar bias and variance, the simplest models are usually
considered the best.

Nistor Grozavu Introduction to supervised learning


Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a


model: When comparing several models/classifiers that show
similar performances in term of accuracy (or error), and have
similar bias and variance, the simplest models are usually
considered the best.

Examples of complexity measures


The Bayesian Information Criterion (BIC)
The Akaike Information Criterion (AIC)
For both criteria, the lower the better.

Nistor Grozavu Introduction to supervised learning


Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a


model: When comparing several models/classifiers that show
similar performances in term of accuracy (or error), and have
similar bias and variance, the simplest models are usually
considered the best.

Examples of complexity measures


The Bayesian Information Criterion (BIC)
The Akaike Information Criterion (AIC)
For both criteria, the lower the better.

Remark: Models that are too simple tend to have a low accuracy
(high error), while models that are too complex tend too overfit.

Nistor Grozavu Introduction to supervised learning


Bias-variance trade-off and complexity

The bias tends to decrease with


the complexity of the model
The variance tends to increase
with the complexity of the
model
The mean square error (MSE)
on validation data first
decreases when the model gets
more complex, and then
increases again when the model
gets too complex and overfit.

Nistor Grozavu Introduction to supervised learning


Bias-variance trade-off and complexity

The bias tends to decrease with


the complexity of the model
The variance tends to increase
with the complexity of the
model
The mean square error (MSE)
on validation data first
decreases when the model gets
more complex, and then
increases again when the model
gets too complex and overfit.

Deciding between several models relies on finding the one(s) with


the best variance-bias trade-off, and the lowest complexity.

Nistor Grozavu Introduction to supervised learning


Bibliography

Christopher M. Bishop, Pattern Recognition and Machine


Learning (2006)

R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons, Pattern


Classification (2000)

Tom M. Mitchell, Machine Learning (1997)

Nistor Grozavu Introduction to supervised learning

You might also like