0% found this document useful (0 votes)

2 views79 pages

DataScienceIntro-CM-ML_compressed

Supervised learning is a machine learning approach that uses labeled training data to create models for classifying similar unlabeled data, with applications in regression, prediction, and classification. The process involves training and validation datasets, classifiers, and various algorithms, including K-Nearest Neighbors and Learning Vector Quantization. Key challenges include managing the bias-variance trade-off and selecting appropriate input variables to improve model performance.

Uploaded by

mailsacrifice14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views79 pages

DataScienceIntro-CM-ML_compressed

Uploaded by

mailsacrifice14

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Introduction to supervised learning

Nistor Grozavu

LIPN - CNRS UMR 7030

2020/2021

Nistor Grozavu Introduction to supervised learning

What is supervised Learning ?

Supervised learning is a Machine Learning Task the aim of which

is to learn a model from labeled training data in order to classify
similar unlabeled data. Potential applications of supervised
learning include:
Regression (Cf. previous Lecture)
Prediction
Classification
Speech and hand-writing processing
Pattern Recognition
Rule Mining

Nistor Grozavu Introduction to supervised learning

Supervised Learning workflow
In supervised learning, we distinguish 2 types of data:
The training data, subdivided into 2 subsets:
The Training set, on which the label are known and that will
be used to build the model.
The Validation set, on which the labels are also known and
that can be used to rate the model and improve it if necessary.
(noy used by all algorithms)
The test data, or Test set, on which the model will be used
to find unknown labels.

Nistor Grozavu Introduction to supervised learning

Supervised Learning data sets examples

Supervised data are described by several attributes and a target

class (or label), which is known in the training and validation set,
but unknown in the test set.

Size Weight Shoe size Sex Size Weight Shoe size Sex
176 72 43 M 205 85 47 ?
159 61 37 F 172 60 40 ?
180 66 39 F 164 57 38 ?
185 85 44 M 169 52 36 ?
177 70 41 F 183 78 42 ?
155 88 38 M 175 65 44 ?
210 110 45 M 191 77 41 ?
Table: Training data Table: Test data

Nistor Grozavu Introduction to supervised learning

Supervised Learning Example

Nistor Grozavu Introduction to supervised learning

The notion classifier

Besides for regression applications, most supervised algorithms are

known as classifiers.
Classifiers
A classifier learns a model in the form of a function, a set of
logic rules, the parameters of a probabilistic model, the
parameters of a neural network, a set of prototypes, etc.
The classifier will use the model it learned to label new and
previously unknown data.

When the target attribute is an integer number, we usually

refer to it as a class.
When the target attribute is categorical, we talk about labels.
When the target attribute is a real number, the process
involved is a regression.

Nistor Grozavu Introduction to supervised learning

Formalism
Let us denote X = {x1 , · · · , xN }, xi ∈ X be the matrix the N
observed examples of the training set.
The xi are vectors containing the attributes of each object.
Let Y = {y1 , · · · , yN }, yi ∈ [1..K ] be the vector containing
the labels/classes associated to the observed examples.
We note L = {(xi , yi ), i ∈ [1..N]} the training set.
A classifier induced from the traning set L will be denoted
ψ(·, L). It is a function that for any vector xi from X
associates a class:

ψ(·, L) : X → [1..K ]
Applying ψ on a new object x from the test set will therefore
return a class prediction:

ŷ = ψ(x, L)

Nistor Grozavu Introduction to supervised learning

Types of models and classifiers

How are patterns and models expressed ? There are 2 extremes:

Black box representation: The model, structure or function
is impossible to grasp for a human unfamiliar with the
generating algorithm.
Examples: Deep learning algorithms
White box representation: The model and its construction
process are easy to understand and reveal which kind of
structure to expect.
Examples: Decision trees, K-Nearest neighbors
The different types of models come from various fields such as AI,
statistics and research in databases.

Nistor Grozavu Introduction to supervised learning

The 5 steps of supervised learning

1 Decide on a training set that will be representative of the

real-world use of your classifier.
2 Determine your input features and the representations that
you want to use in your model.
3 Decide on the structure of your learning function, and choose
a supervised algorithm compatible with this model.
4 Run the algorithm on your training set. If your algorithm
allows it, do cross-validation checking and adjust your model.
5 Evaluate the accuracy of your algorithm and apply it on a
separate test set.

Nistor Grozavu Introduction to supervised learning

The bias-variance trade-off
Supervised learning faces several challenges.
The bias-variance trade-off in supervised learning
The bias-variance trade-off is a problem in which a supervised
algorithm has to achieve simultaneously two seemlingly
incompatible goals:
Building a model that gives good results on the validation set.
Building a model that can generalize beyond de training set.

The bias is the error from erroneous assumptions in the

learning algorithm or model that usually results in the
algorithms missing important links between the input variables
and the output, thus leading to underfitting.
The variance is the error caused by a too high sensitivity of
the model toward small variations in the training set. This
results in overfitting and the model being unable to
generalize to data outside of the training set.
Nistor Grozavu Introduction to supervised learning
Function complexity and amount of training data

The second issue is the available amount of training data when

compared to the relative complexity of the real model to be
learned:
If the model is simple, a learning algorithm with a high bias
and a low variance should be able to learn it from a small
amount of data.
If the model is complex, it will only be learnable from a very
large amount of training data and using a learning algorithm
with a low bias and a high variance.

Remark
Good learning algorithms should be able to adjust their bias-
variance trade-off based on the amount of available data and the
apparent complexity of the model to be learned.

Nistor Grozavu Introduction to supervised learning

Picking the right input variables

The problem with too many input variables

Even if the real learning model depends only on a very small
number of variables, the algorithm may never figure it out if it is
flooded with a very high number of input variables.
The result may end up being a very complex and overfitting
model.
Models with too many variables cannot easily be understood
and interpreted.

A good understanding of your data and of the problem that

you want to modelize will help remove irrelevant features.
Scaling your data may have a huge influence on the results.
It is important to check for correlation between the attributes
and to remove redundant variables.

Nistor Grozavu Introduction to supervised learning

Similarity and distance

Very much like in clustering, the distance function is a key

element for most supervised classifiers.
Creating custom distance functions is sometimes required.

pP
Euclidian distance ||a − b||2 = (a − bi )2
P i i
Squared Euclidian distance ||a − b||22 = Pi (ai − bi )2
Manhattan distance ||a − b||1 = i |ai − bi |
Maximum distance p ||a − b||∞ = maxi |ai − bi |
Mahalanobis distance (a − b)> S −1 (a − b) where SPis the covariance matrix
Hamming distance Hamming (a, b) = i (1 − δai ,bi )

Table: Examples of common distances

Nistor Grozavu Introduction to supervised learning

1-Nearest Neighbor
The simplest and laziest classifier consists in using the training set
itself as a model without building or computing anything.
1-NN Classifier
“Learning” process: Remember all the observed examples.
Classification process: When a new data arrives, find the most
similar registered example (distance-wise) and assign it to the
same class.

Nistor Grozavu Introduction to supervised learning

1-Nearest Neighbor

The 1-Nearest Neighbor classifier is sensitive to noise and prone to

overfitting.

Figure: The 1-NN algorithm would assign this data to the red class. On
the other hand, a majority vote would assign it to the blue class.

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors

The K-Nearest Neighbors algorithm (KNN) considers the K closest

observed data from the training set to decide on a class for an
unlabeled data. K is a parameter chosen by the user.

Figure: For K>1, the KNN algorithm would assign this unlabeled data to
the blue class.

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors

When there are only two classes, K is usually an odd number to

avoid two classes having an equal number of votes.

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors: Weaknesses
K is a critical parameter that can render the algorithm quickly
unstable:
Depending on the K , the class changes completely.

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors: Weaknesses
With more than 2 classes, things can quickly become complicated
...

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors: Weaknesses

Because the distance between instances is based on all the

attributes, less relevant attributes and even the irrelevant ones
are used in the classification of a new instance.

Because the algorithm delays all processing until a new

classification/prediction is required, significant processing is
needed to make the prediction.

Nistor Grozavu Introduction to supervised learning

Weighted Nearest Neighbors

The Weighted Nearest Neighbors solves 2 of the previous

problems by adding a weight wk to each neighbor.
Examples: (
1
k if k ≤ K
wk =
0 if k > K
(
1
dist if k ≤ K
wk =
0 if k > K

Remark
The real Weighted Nearest NeighborsP classifier uses a much more
complex weight system that satisfies Nn=1 wni = 1.

Nistor Grozavu Introduction to supervised learning

K-Nearest Neighbors: Summary

Pros
Very simple and intuitive
Low Complexity
Great results with well-behaved classes

Cons
No model: No way to properly describe each class. No
possibility to re-use the knowledge
Does not scale well because it requires to store all the training
set
Critical choice of the parameter K
Ill-adapted for categorical data

Nistor Grozavu Introduction to supervised learning

Learning without remembering all the data
The main issues of the KNN algorithm is that all data have to be
kept in memory and that it is unstable when classes that are not
well separated:
It’s a problem with both large and complex datasets.

Idea
Instead of using all the data, we could
use a prototype representing each class
(like in the mean-shift and K-Means
algorithm).
Can be learned incrementally.
Helps building a model.

Issues
Works only with spherical classes
Doesn’t work with classes that
Nistor Grozavu
aren’t well separated.
Introduction to supervised learning
Learning without remembering all the data

Figure: A single prototype per class will never work here ...

Nistor Grozavu Introduction to supervised learning

Learning without remembering all the data

Figure: However, several prototypes per class could work !

Nistor Grozavu Introduction to supervised learning

Learning Vector Quantization algorithm

The LVQ algorithm (Kohonen) is a primitive neural network

classifier that represents the classes from the training set using
several prototypes per class.
It is closely related to both the K-Means and the KNN
algorithm.
It is an early ancestor to the Self-Organizing Maps (Cf.
Lecture 7)

Remark
In many neural networks algorithms, prototypes learned from an
iterative process are called neurons due to their evolutive behavior
and the fact that thay do not represent a cluster or class on their
own.

Nistor Grozavu Introduction to supervised learning

Learning Vector Quantization algorithm

Figure: Example of a LVQ algorithm using 5 prototypes per class (3

classes) – Elements of statistical learning c Hastie et al. 2001

Nistor Grozavu Introduction to supervised learning

Learning Vector Quantization algorithm
1 Initialization:
Set up the initial M prototypes Z = {z1 , · · · , zM }. It can be
done randomly or using an initialization with the K-Means
algorithm. Then use a majority vote to assign each of
prototype to a class C (zm ).
Choose a learning rate ∈ [0, 1].
2 Go through the training set and update Z for each
observation:
For each observation xi , find the nearest prototype zm .
If C (xi ) = C (zm ), move zm towards xi : zm ← zm + (xi − zm ).
If C (xi ) 6= C (zm ), move zm away from xi :
zm ← zm − (xi − zm )
3 Repeat step 2 until convergence.
Optional: Reduce after each step 2 to enhance convergence.

Remark
The learning rate is a critical parameter that can change
drastically the outcome of the classification.
Nistor Grozavu Introduction to supervised learning
Learning Vector Quantization: Classification

Once the prototypes have been learned, the LVQ classifier

behaves like the 1-NN algorithm using the prototypes instead
of the training data.

The class of each presented unlabeled data is determined

based on the class of the closest prototype.

Remark
Using LVQ, the prototype can be trained (updated) in real time
while being use on unlabeled data. This algorithm is therefore
great for online learning.

Nistor Grozavu Introduction to supervised learning

LVQ: Summary

Pros
Low Complexity
Low memory consumption
Can deal with online and incremental data
Can build a good model
Is still easy and intuitive

Cons
Is often less accurate than KNN.
Critical choice of the learning rate parameter
Ill-adapted for categorical data

Nistor Grozavu Introduction to supervised learning

LVQ vs KNN
In the bias variance trade-off, 1-NN tends toward bias while LVQ
tends toward variance. KNN is somewhere in between depending on
the value of K .

(a) 1-NN (b) 15-NN

Nistor Grozavu Introduction to supervised learning

LVQ vs KNN

Use KNN when: You have a relatively small data set, you
don’t need to build a model, you don’t need to generalize
from your training set.

Use LVQ when: You have a large data set, you need to build a
model, you are dealing with a semi-supervised problem, you
need to learn data incrementaly or on-line, you can afford a
slightly lower accuracy or want a higher variance.

Remark
For simple problems, both will work just fine.

Nistor Grozavu Introduction to supervised learning

Decision trees
Decisions trees are very common classifiers that mine rules from
the training set:
They are mostly applied to categorical data, but not only.
They decompose the feature space according to the most
discriminating variable at each stage.
There are usually more than one possible tree per data set.

Nistor Grozavu Introduction to supervised learning

Decision trees

A decision tree is a tree of a function-discrete representation. It

can be used as a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences. Learning
decision trees are among the most commonly used classification
methods. The main algorithms are ID3, ID4, C4.5 and C5.0
Properties
Expressiveness: It can represent disjunctions of conjunctions
Readability: It can be translated as a set of decision rules

Notae
Disjunction : A or B
Conjunction : A and B

Nistor Grozavu Introduction to supervised learning

Decision tree: Classification Process

Nistor Grozavu Introduction to supervised learning

Types of trees

Decisions trees can be categorized according to three criteria:

The type of data: Numerical, Categorical, Mixed
The type of nodes: Binary leaves, multiple leaves
The overall shape of the tree

(c) Example of a bi- (d) Example of a numerical

nary tree non-binary tree

Nistor Grozavu Introduction to supervised learning

Types of trees

Nistor Grozavu Introduction to supervised learning

Types of trees
Deep trees are usually very biased, can’t generalize much
outside of their training set and are difficult to interpret.
Setting the right Depth for your tree
Most Decision trees algorithms will allow you to choose the
maximum depth of your tree.
How Deep is too deep will depend on the complexity of the
problem
Deeper trees tend towards overfitting, while less deep trees
will tend towards underfitting.
The best option is to start from a deep tree and to prune it in
a way that minimizes the error on the training set.

While balanced-trees are usually the most preferable option,

bushy trees should not be frowned upon in problems with a
lot of classes, or when they can help reducing the depth of the
tree.
Nistor Grozavu Introduction to supervised learning
Link between decision trees and 1-NN classifiers

Remarks
For most decision trees, it is possible to build an equivalent
1-NN classifier.
Each leave of a decision tree is equivalent to a data in the
learning set of a 1-NN classifier.

This process is much easier with discrete variables.

Nistor Grozavu Introduction to supervised learning

The Weather Problem

Outlook Temperature Humidity Wind Play

Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild Normal Weak Yes
... ... ... ... ...

The goal of this problem is to determine which weather conditions

are the most favorable to let your kids go play outside.

From the resulting tree, we

can see that temperature
is not among the most rel-
evant parameters here.

Nistor Grozavu Introduction to supervised learning

The Weather Problem

From this tree we can extract the following rules:

If (Outlook=Sunny) AND (Humidity=High) THEN Play=No
If (Outlook=Sunny) AND (Humidity=Normal) THEN
Play=Yes
If (Outlook=Overcast) THEN Yes
If (Outlook=Rain) AND (Wind=Strong) THEN Play=No
If (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
Nistor Grozavu Introduction to supervised learning
The Weather Problem

Outlook Temperature Humidity Wind Play

Sunny 29 85 Weak No
Sunny 27 90 Strong No
Overcast 28 86 Weak Yes
Rain 24 80 Weak Yes
... ... ... ... ...

The same problem can be processed with mixed attributes. A

similar tree can be found.

Remark
It usually takes more time
to compute a decision
tree with numerical
values, because of the
time required to find the
optimal cut value.

Nistor Grozavu Introduction to supervised learning

The Weather Problem: Unlabeled Example

Outlook Temperature Humidity Wind Play

Sunny 25 55 Weak ?

This example would be labeled Yes by the decision tree.

Nistor Grozavu Introduction to supervised learning

The Contact Lenses Data

Age Prescription Astigmatism Tear production Recommended Lenses

Young Myope No Low None
Young Myope No Normal Soft
Young Myope Yes Low None
Young Myope Yes Normal Hard
Young Hypermetrope No Low None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Low None
Young Hypermetrope Yes Normal Hard
Pre-presbyopic Myope No Low None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Low None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Low None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Low None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Low None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Low None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Low None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Low None
Presbyopic Hypermetrope Yes Normal None

Nistor Grozavu Introduction to supervised learning

The Contact Lenses Data

Nistor Grozavu Introduction to supervised learning

Summary

Pros
Intuitive, easy to understand and to use
Build comprehensive models
The most commonly classifier for decision making
Can learn in a single sweep

Cons
The process to build the tree is complex
There are always several possible trees
Choosing the depth of the tree is a complex decision
Does not work well with datasets that have too many
attributes.

Nistor Grozavu Introduction to supervised learning

Naive Bayes Classifiers: Introduction

Naive Bayes classifiers are a family of simple probabilistic

classifiers based on Bayes Theorem and a strong independence
hypothesis between the features.
Intuition
To find out the probability of the previously unseen instance
belonging to each class, simply pick the “most probable” class.
These probability are assessed using Bayes theorem applied on
the training data.

Nistor Grozavu Introduction to supervised learning

Naive Bayes Classifiers: Introduction

Bayes Theorem
p(x|cj )p(cj )
p(cj |x) =
p(x)

p(cj |x) The probability of instance x belonging to class cj .

We want to compute this probability.
p(x|cj ) The probability of generating instance x knowing class
cj .
Knowing the distribution function of class cj and the features
of x, what is the probability of x ?
p(cj ) The occurrence probability of class cj .
How frequent is class cj in the training set ?
p(x) The occurrence probability of instance x.
This can usually be ignored because it is independent from cj
and the same for all instances.

Nistor Grozavu Introduction to supervised learning

Naive Bayes Classifiers: Simple example

One year ago, on my way back from Holland, I was

arrested by a police officer names “Claude”. I was a
bit high and can’t remember whether Officer Claude
was a male or a female ...
Using a bayesian classifier, and a police data base
with names and sex, we can try to guess whether
it is more likely that officer Claude was a male
or a female. We have two classes: c1 =male and
c1 =female.

Nistor Grozavu Introduction to supervised learning

Naive Bayes Classifiers: Simple example

Name Sex
Claude Male
Laura Female p(male|Claude) =
p(Claude|male)p(male)
Claude Female p(Claude)
1/3 × 3/8 0.125
Claude Female = =
3/8 3/8
Arthur Male
p(Claude|female)p(female)
Karima Female p(female|Claude) =
p(Claude)
Rose Female 2/5 × 5/8 0.250
= =
Sergio Male 3/8 3/8

Table: Training data (List of Since 0.125 < 0.250, we can conclude that most
Police officers in Lille) likely Officer Claude was a female !

Nistor Grozavu Introduction to supervised learning

Naive Bayes with several features
In the previous example there was only one features: the
name. What happens when there are more ?
To make the problem simpler, naive Bayes classifiers assume
that the attributes have independent distributions (which is
not always true).
Independence Hypothesis
Let us note x = {x1 , ..., xd } a data with d features.
Under the hypothesis that all attributes are independent, we
can write:

p(x|cj ) = p(x1 |cj ) × p(x2 |cj ) × · · · × p(xd |cj )

Therefore, we have:
d
Y
p(cj |x) ∝ p(cj ) p(xi |cj )
i=1
Nistor Grozavu Introduction to supervised learning
Naive Bayes with several features

Note that Naive Bayes is not sensitive to irrelevant features.

Suppose that we are trying to classify persons gender based on
several features, including eye color (which is irrelevant):
p(Jessica|cj ) = p(eye = brown|cj ) × p(wears dress = yes|cj ) × · · ·
p(Jessica|female) = 9000/10000 × 7500/10000 × · · ·
p(Jessica|male) = 9001/10000 × 3/10000 × · · ·
p(eye = brown|female) and p(eye = brown|male) should be almost
identical and won’t affect the outcome much. Wearing a dress
however ...
Remark
This assumes that the estimates of the probabilities are good
enough. Therefore the training set must be as big and as unbiased
as possible.

Nistor Grozavu Introduction to supervised learning

Naive Bayes: Properties
Pros
The only things to store are the probabilities: The training
data need not be kept in memory and a single scan of the
data is necessary to acquire the probabilities.
The model is quite simple to understand.
One of the fastest prediction model.

Cons
Naive Bayes assumes that the features are fully independent.
It is usually not true and can lead to more or less bias when
several of them are too correlated.
Naive Bayes tend to be biased toward the training data and
can’t generalize easily (e.g. It is impossible to classify a new
instance with a single -or more- attribute values the
occurrence of which is 0 in the training set).

Nistor Grozavu Introduction to supervised learning

Mosquito identification
In this example we consider 3 species of mosquitoes:
Culex Pipiens, the common house mosquito
Anopheles Stephensi, a common mosquito from the middle
East
Aedes Aegypti, the yellow fever mosquito (may also carry
Dengue fever, Zika, or Chikungunya)

(g) Culex Pipiens (h) Anopheles (i) Aedes Aegypti

Stephensi

Nistor Grozavu Introduction to supervised learning

Mosquito identification

Mosquitoes have distinct wing beat frequencies.

Culex Pipiens: N (µ = 390, σ = 14)

Anopheles Stephensis: N (µ = 475, σ = 30)
Aedes Aegypti: N (µ = 567, σ = 43)

Nistor Grozavu Introduction to supervised learning

Mosquito identification

Culex Pipiens:
N (µ = 390, σ = 14)
Anopheles Stephensis:
N (µ = 475, σ = 30)
Aedes Aegypti:
N (µ = 567, σ = 43)

Suppose I see a mosquito with a wing frequency of 500Hz, which

one is it?

Nistor Grozavu Introduction to supervised learning

Mosquito identification

Suppose I see a mosquito with a wing frequency of 500Hz, which

one is it?
(500−390)2
−
exp 2×142
p(Culex|wingbeat = 500) = √ ≈0
14 2π
(500−475)2
−
exp 2×302
p(Anopheles|wingbeat = 500) = √ = 0.0094
30 2π
(500−576)2
−
exp 2×432
p(Aedes|wingbeat = 500) = √ = 0.0047
43 2π
Most likely it is an Anopheles.

Nistor Grozavu Introduction to supervised learning

Mosquito identification: Getting the probability

These does not look like probabilities: Getting the probabilities

p(Anopheles|wingbeat = 500) = 0.0094
P(Anopheles|wbf = 500) = 0+0.0094+0.0047 =
0.0094
0.77
p(Aedes|wingbeat = 500) = 0.0047 0.0047
P(Aedes|wbf = 500) = 0+0.0094+0.0047 = 0.23

Nistor Grozavu Introduction to supervised learning

Mosquito identification: More features
We now have additional informations in the form of a chart representing how many
mosquitoes are active depending on the time of the day.

Suppose I am savagely attacked by a mosquito with a wingbeat frequency of 420Hz at

11:30am. Which one is the most likely culprit ?
Nistor Grozavu Introduction to supervised learning
Mosquito identification: More features

2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018

Nistor Grozavu Introduction to supervised learning

Mosquito identification: More features

Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%

Nistor Grozavu Introduction to supervised learning

Mosquito identification: More features

Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%

Anopheles Stephensis is again the most likely culprit.

Nistor Grozavu Introduction to supervised learning

Remarks

At this point you should be wondering where was the training set
in this exercise.
You never saw the training set, you only saw the model: Wing
beat frequencies distributions and mosquito activity diagram.
The training set was used to build the wing beat frequencies
laws and the distribution diagram. Once you have them, you
don’t need the training set anymore.

Important remark
You saw normalization constants pretty much everywhere in the
calculi. You don’t need them to classify new items. Unless you
really want probabilities, you don’t have to normalize your results.

Nistor Grozavu Introduction to supervised learning

Evaluating classifiers

The evaluation of a classifier is usually done using the

validation set, the labels of which are known.
There are several ways to validate classifier results depending
on their type and the number of classes.

Accuracy: the simplest evaluation criterion

Number of correctly classified data
Accuracy =
Total number of data
The result is in percentage, 100% being the best.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers

Binary classifiers (with 2 classes: True and False) have specific

validations measures that assess different parameters.
Let us consider the following notations:
TP: True positive (data classified True and that are really in
this class)
FP: False positive (data classified True but are not)
TN: True negative (data classified False and are really in this
class)
FN: False negative (data classified False but are actually
True)

Remark
TP + TN
Accuracy =
TP + TN + FP + FN

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers

Recall or True Positive Rate (TPR)

TP
Recall =
TP + FN

It is also called “hit rate” or “sensitivity”.

It is the probability of correctly labeling a Positive case.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers

Fall-out or False Positive Rate (FPR)

FP
FPR =
TN + FP

specificity (SPC) or True Negative Rate

TN
specificity = = 1 − FPR
TN + FP
The specificity (or TNR) is a statistical measure of how well a
binary classifier correctly identifies the negative cases.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers

Precision or Positive Predictive Value (PPV)

TP
precision =
TP + FP
The precision is the probability that a positive prediction is correct.

F-Measure
2 × precision × recall
F-Measure =
precision + recall

The F-Measure is the harmonic mean of the precision and the

recall.
It can be used as a single measure to evaluate the
performances of a binary classifier.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers

TP, FP, TN and FP provide relevant information

No single measure tells the whole story
A classifier with 90% accuracy can be useless if 90% of the
population does not have cancer and the 10% that do are
misclassified.
If possible, use multiple measures.
Beware of the obscure terminological confusion in the
literature !
Depending on the field, specificity is sometimes refers to
precision
Different name exist for the same thing
Always provide the formula when you use terms such as FP,
TP, etc.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers: ROC space
The ROC space (Receiver operating characteristic) is a type of
graph based on the fall-out and the sensitivity and that can be
used to evaluate a classifier.

Nistor Grozavu Introduction to supervised learning

Evaluating binary classifiers: ROC curves

A ROC curve plots uses the ROC space to

assess the quality of a classifier. It is plot-
ted using different parameters as reference
points to draw the curb.
Useful to find the right parameters
Useful to compare binary classifiers

Nistor Grozavu Introduction to supervised learning

Generalizing to non-binary classifiers

Generalizing to classifiers that have more than 2 classes is often

complicated.
Criteria exist in the literature but they are often quite complex
and restricted to specific cases.

It is possible to do some basic analysis using confusion

matrices between the expected classes and the found classes.

Otherwise, indexes such as the accuracy, or vector comparing

measures (e.g. Rand Index and Adjusted Rand Index) are
good solutions.

Nistor Grozavu Introduction to supervised learning

Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a

model: When comparing several models/classifiers that show
similar performances in term of accuracy (or error), and have
similar bias and variance, the simplest models are usually
considered the best.

Nistor Grozavu Introduction to supervised learning

Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a

Examples of complexity measures

The Bayesian Information Criterion (BIC)
The Akaike Information Criterion (AIC)
For both criteria, the lower the better.

Nistor Grozavu Introduction to supervised learning

Building a model and complexity issues

The complexity of a model is an important criterion to evaluate a

Examples of complexity measures

The Bayesian Information Criterion (BIC)
The Akaike Information Criterion (AIC)
For both criteria, the lower the better.

Remark: Models that are too simple tend to have a low accuracy
(high error), while models that are too complex tend too overfit.

Nistor Grozavu Introduction to supervised learning

Bias-variance trade-off and complexity

The bias tends to decrease with

the complexity of the model
The variance tends to increase
with the complexity of the
model
The mean square error (MSE)
on validation data first
decreases when the model gets
more complex, and then
increases again when the model
gets too complex and overfit.

Nistor Grozavu Introduction to supervised learning

Bias-variance trade-off and complexity

The bias tends to decrease with

Deciding between several models relies on finding the one(s) with

the best variance-bias trade-off, and the lowest complexity.

Nistor Grozavu Introduction to supervised learning

Bibliography

Christopher M. Bishop, Pattern Recognition and Machine

Learning (2006)

R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons, Pattern

Classification (2000)

Tom M. Mitchell, Machine Learning (1997)

Nistor Grozavu Introduction to supervised learning

Data Analytics Using Python
100% (1)
Data Analytics Using Python
982 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
AI Chapter 5
No ratings yet
AI Chapter 5
31 pages
L4 ML2 Types of Learning
No ratings yet
L4 ML2 Types of Learning
34 pages
ML Merge
No ratings yet
ML Merge
145 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
9 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Topic 5-Types of Machine Learning
No ratings yet
Topic 5-Types of Machine Learning
31 pages
Notes
No ratings yet
Notes
125 pages
Unit-1 - Machine Learning
No ratings yet
Unit-1 - Machine Learning
85 pages
Supervised learning
No ratings yet
Supervised learning
19 pages
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
ML 2
No ratings yet
ML 2
166 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
Supervised Learning
No ratings yet
Supervised Learning
9 pages
Supervised vs. Unsupervised Learning in Machine Learning - Springboard
No ratings yet
Supervised vs. Unsupervised Learning in Machine Learning - Springboard
5 pages
aa
No ratings yet
aa
3 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
25 pages
Classification
No ratings yet
Classification
53 pages
Chapter Introduction
No ratings yet
Chapter Introduction
7 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
2 ML
No ratings yet
2 ML
9 pages
AIML Unit-IV & V
100% (1)
AIML Unit-IV & V
47 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
6CS4 AI Unit-4 @zammers
No ratings yet
6CS4 AI Unit-4 @zammers
129 pages
Unit II
No ratings yet
Unit II
25 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Practicalintroductiontomachinelearning1561472049990 PDF
No ratings yet
Practicalintroductiontomachinelearning1561472049990 PDF
110 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
I2ml Chap1 v1 1
No ratings yet
I2ml Chap1 v1 1
14 pages
ML Type
No ratings yet
ML Type
13 pages
06 Machine Learning
No ratings yet
06 Machine Learning
24 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Unit 5 PPT
No ratings yet
Unit 5 PPT
32 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
65 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Unit 01 - Linear Classifiers and Generalizations - MD
No ratings yet
Unit 01 - Linear Classifiers and Generalizations - MD
23 pages
Machine Learning
No ratings yet
Machine Learning
56 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
8 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Supervised Learning
No ratings yet
Supervised Learning
19 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
No ratings yet
Classifiers (Support Vector Machines, Decision Trees, Nearest Neighbor Classification)
16 pages
Intro To ML
No ratings yet
Intro To ML
107 pages
MI_Unit 3
No ratings yet
MI_Unit 3
107 pages
ML Unit-1 Notes
No ratings yet
ML Unit-1 Notes
23 pages
ppt4dl
No ratings yet
ppt4dl
81 pages
Ann Unit 2
No ratings yet
Ann Unit 2
21 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
Unit-5
No ratings yet
Unit-5
73 pages
Chapter 7 Supervised Learning Classification
No ratings yet
Chapter 7 Supervised Learning Classification
28 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
Ch3-Machine Learning
No ratings yet
Ch3-Machine Learning
124 pages
01 Introduction-part 2
No ratings yet
01 Introduction-part 2
63 pages
types of ml
No ratings yet
types of ml
10 pages
ML L1 PDF
No ratings yet
ML L1 PDF
43 pages
statistic inference unit 2 notes
No ratings yet
statistic inference unit 2 notes
34 pages
Supervised_learning
No ratings yet
Supervised_learning
8 pages
Final PPT SIH2023 College
No ratings yet
Final PPT SIH2023 College
4 pages
Chapter_two[1]
No ratings yet
Chapter_two[1]
29 pages
Download Full International Conference on Innovative Computing and Communications Proceedings of ICICC 2021 Volume 1 Advances in Intelligent Systems and Computing 1387 Ashish Khanna (Editor) PDF All Chapters
100% (3)
Download Full International Conference on Innovative Computing and Communications Proceedings of ICICC 2021 Volume 1 Advances in Intelligent Systems and Computing 1387 Ashish Khanna (Editor) PDF All Chapters
40 pages
AI5006 - Deep Learning
No ratings yet
AI5006 - Deep Learning
6 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Journal December 21
No ratings yet
Journal December 21
181 pages
Machine Learning: Course Curriculum
No ratings yet
Machine Learning: Course Curriculum
4 pages
Module 4 Algorithms For Data Science
No ratings yet
Module 4 Algorithms For Data Science
66 pages
Data Compilation for Border Security a Powerful Tool (1)
No ratings yet
Data Compilation for Border Security a Powerful Tool (1)
9 pages
Poetter-Compris-AI ML DS Infographics-Cheat Sheets
No ratings yet
Poetter-Compris-AI ML DS Infographics-Cheat Sheets
239 pages
HKGBC - Smart Green Building Design Best Practice Guidebook
No ratings yet
HKGBC - Smart Green Building Design Best Practice Guidebook
162 pages
ISHA
No ratings yet
ISHA
5 pages
Deep Learning Applications Volume 3 Advances in Intelligent Systems and Computing 1st Edition M Arif Wani Bhiksha Raj Feng Luo Dejing Dou Editors
No ratings yet
Deep Learning Applications Volume 3 Advances in Intelligent Systems and Computing 1st Edition M Arif Wani Bhiksha Raj Feng Luo Dejing Dou Editors
79 pages
Nisha Resume
No ratings yet
Nisha Resume
2 pages
SNN Cmos Paper
No ratings yet
SNN Cmos Paper
59 pages
chetna CAPSTONE
No ratings yet
chetna CAPSTONE
62 pages
Application of Python Programming and Scrum in The Context of Business Applications
No ratings yet
Application of Python Programming and Scrum in The Context of Business Applications
13 pages
Understanding Large Language Models Thimira Amaratunga download
100% (2)
Understanding Large Language Models Thimira Amaratunga download
82 pages
Question 1: What Is Machine Learning Answer 1
No ratings yet
Question 1: What Is Machine Learning Answer 1
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
AI-MODULE 1
No ratings yet
AI-MODULE 1
14 pages
1.to Study Supervisedunsupervisedreinforcement Learning Approach
No ratings yet
1.to Study Supervisedunsupervisedreinforcement Learning Approach
6 pages
Understanding Auditory Evoked Brain Signal Via Physics-Informed Embedding Network With Multi-Task Transformer
No ratings yet
Understanding Auditory Evoked Brain Signal Via Physics-Informed Embedding Network With Multi-Task Transformer
12 pages
btech-bt-6-sem-bioinformatics-2-nbt603-2019
No ratings yet
btech-bt-6-sem-bioinformatics-2-nbt603-2019
1 page
IIT-Roorkee - Business Analytics V7
No ratings yet
IIT-Roorkee - Business Analytics V7
28 pages
L09 Using Matlab Neural Networks Toolbox
100% (1)
L09 Using Matlab Neural Networks Toolbox
34 pages
Forecasting by Machine Learning Techniques and Econometrics A Review
No ratings yet
Forecasting by Machine Learning Techniques and Econometrics A Review
7 pages
Registration For Larkai Healthcare Pvt. Ltd. Recruitment Drive For 2025 Graduating Batch
No ratings yet
Registration For Larkai Healthcare Pvt. Ltd. Recruitment Drive For 2025 Graduating Batch
4 pages
Multimodal Interactive Pattern Recognition And Applications 1st Edition Alejandro Hctor Toselli download
100% (1)
Multimodal Interactive Pattern Recognition And Applications 1st Edition Alejandro Hctor Toselli download
87 pages