DataScienceIntro-CM-ML_compressed
DataScienceIntro-CM-ML_compressed
Nistor Grozavu
2020/2021
Size Weight Shoe size Sex Size Weight Shoe size Sex
176 72 43 M 205 85 47 ?
159 61 37 F 172 60 40 ?
180 66 39 F 164 57 38 ?
185 85 44 M 169 52 36 ?
177 70 41 F 183 78 42 ?
155 88 38 M 175 65 44 ?
210 110 45 M 191 77 41 ?
Table: Training data Table: Test data
ψ(·, L) : X → [1..K ]
Applying ψ on a new object x from the test set will therefore
return a class prediction:
ŷ = ψ(x, L)
Remark
Good learning algorithms should be able to adjust their bias-
variance trade-off based on the amount of available data and the
apparent complexity of the model to be learned.
pP
Euclidian distance ||a − b||2 = (a − bi )2
P i i
Squared Euclidian distance ||a − b||22 = Pi (ai − bi )2
Manhattan distance ||a − b||1 = i |ai − bi |
Maximum distance p ||a − b||∞ = maxi |ai − bi |
Mahalanobis distance (a − b)> S −1 (a − b) where SPis the covariance matrix
Hamming distance Hamming (a, b) = i (1 − δai ,bi )
Figure: The 1-NN algorithm would assign this data to the red class. On
the other hand, a majority vote would assign it to the blue class.
Figure: For K>1, the KNN algorithm would assign this unlabeled data to
the blue class.
Remark
The real Weighted Nearest NeighborsP classifier uses a much more
complex weight system that satisfies Nn=1 wni = 1.
Pros
Very simple and intuitive
Low Complexity
Great results with well-behaved classes
Cons
No model: No way to properly describe each class. No
possibility to re-use the knowledge
Does not scale well because it requires to store all the training
set
Critical choice of the parameter K
Ill-adapted for categorical data
Idea
Instead of using all the data, we could
use a prototype representing each class
(like in the mean-shift and K-Means
algorithm).
Can be learned incrementally.
Helps building a model.
Issues
Works only with spherical classes
Doesn’t work with classes that
Nistor Grozavu
aren’t well separated.
Introduction to supervised learning
Learning without remembering all the data
Figure: A single prototype per class will never work here ...
Remark
In many neural networks algorithms, prototypes learned from an
iterative process are called neurons due to their evolutive behavior
and the fact that thay do not represent a cluster or class on their
own.
Remark
The learning rate is a critical parameter that can change
drastically the outcome of the classification.
Nistor Grozavu Introduction to supervised learning
Learning Vector Quantization: Classification
Remark
Using LVQ, the prototype can be trained (updated) in real time
while being use on unlabeled data. This algorithm is therefore
great for online learning.
Pros
Low Complexity
Low memory consumption
Can deal with online and incremental data
Can build a good model
Is still easy and intuitive
Cons
Is often less accurate than KNN.
Critical choice of the learning rate parameter
Ill-adapted for categorical data
Use KNN when: You have a relatively small data set, you
don’t need to build a model, you don’t need to generalize
from your training set.
Use LVQ when: You have a large data set, you need to build a
model, you are dealing with a semi-supervised problem, you
need to learn data incrementaly or on-line, you can afford a
slightly lower accuracy or want a higher variance.
Remark
For simple problems, both will work just fine.
Notae
Disjunction : A or B
Conjunction : A and B
Remarks
For most decision trees, it is possible to build an equivalent
1-NN classifier.
Each leave of a decision tree is equivalent to a data in the
learning set of a 1-NN classifier.
Remark
It usually takes more time
to compute a decision
tree with numerical
values, because of the
time required to find the
optimal cut value.
Pros
Intuitive, easy to understand and to use
Build comprehensive models
The most commonly classifier for decision making
Can learn in a single sweep
Cons
The process to build the tree is complex
There are always several possible trees
Choosing the depth of the tree is a complex decision
Does not work well with datasets that have too many
attributes.
Bayes Theorem
p(x|cj )p(cj )
p(cj |x) =
p(x)
Name Sex
Claude Male
Laura Female p(male|Claude) =
p(Claude|male)p(male)
Claude Female p(Claude)
1/3 × 3/8 0.125
Claude Female = =
3/8 3/8
Arthur Male
p(Claude|female)p(female)
Karima Female p(female|Claude) =
p(Claude)
Rose Female 2/5 × 5/8 0.250
= =
Sergio Male 3/8 3/8
Table: Training data (List of Since 0.125 < 0.250, we can conclude that most
Police officers in Lille) likely Officer Claude was a female !
Therefore, we have:
d
Y
p(cj |x) ∝ p(cj ) p(xi |cj )
i=1
Nistor Grozavu Introduction to supervised learning
Naive Bayes with several features
Cons
Naive Bayes assumes that the features are fully independent.
It is usually not true and can lead to more or less bias when
several of them are too correlated.
Naive Bayes tend to be biased toward the training data and
can’t generalize easily (e.g. It is impossible to classify a new
instance with a single -or more- attribute values the
occurrence of which is 0 in the training set).
Culex Pipiens:
N (µ = 390, σ = 14)
Anopheles Stephensis:
N (µ = 475, σ = 30)
Aedes Aegypti:
N (µ = 567, σ = 43)
2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018
2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018
Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%
2.87×10−3 3
P(Culex|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1144
2.48×10−3 6
P(Anopheles|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +2.7×10−5 × 3+5+6 =
0.1976
2.7×10−5 5
P(Aedes|420Hz, 11:30am) = 2.87×10−3 +2.48×10−3 +0 × 3+5+6 = 0.0018
Resulting Probabilities
P(Culex|420Hz, 11:30am) = 36.5%
P(Anopheles|420Hz, 11:30am) = 63%
P(Aedes|420Hz, 11:30am) = 0.5%
At this point you should be wondering where was the training set
in this exercise.
You never saw the training set, you only saw the model: Wing
beat frequencies distributions and mosquito activity diagram.
The training set was used to build the wing beat frequencies
laws and the distribution diagram. Once you have them, you
don’t need the training set anymore.
Important remark
You saw normalization constants pretty much everywhere in the
calculi. You don’t need them to classify new items. Unless you
really want probabilities, you don’t have to normalize your results.
Remark
TP + TN
Accuracy =
TP + TN + FP + FN
F-Measure
2 × precision × recall
F-Measure =
precision + recall
Remark: Models that are too simple tend to have a low accuracy
(high error), while models that are too complex tend too overfit.