13 PracticalMachineLearning
13 PracticalMachineLearning
Machine
Learning
Verena
Kaynig-Fi4kau
([email protected])
Unsupervised
Learning
K-means,
mean
shiS
Supervised
Learning
data
points
x2
labels
features
separa%ng
hyper
plane
x1
Features
are
important
?
roundness
weight
Features
are
important
shape
color
Googles
Self-Driving
Car
Car
Features
Laser
scan
Intensity
model
Eleva%on
model
Lane
model
Camera
vision
2D
sta%onary
map
So
just
measure
everything?
More
features
=
be4er
classica%on?
Prac%cal
issues:
Data
volume,
computa%on
overhead
Theore%cal
issues:
Generaliza%on
performance
Curse
of
dimensionality
Supervised
Learning
data
points
x2
labels
features
separa%ng
hyper
plane
x1
Perceptron
x:
data
point
x1
w1
y:
label
x2
w2
w3
w:
weight
vector
x3
b
b:
bias
-1
+1
-1
w
The
XOR
Problem
x3
x2
x1
Support
Vector
Machine
Widely
used
for
all
sorts
of
classica%on
problems
www.clopinet.com/isabelle/Projects/SVM/applist.html
x2 x2
x1
x1
What
about
outliers?
: slack variables
x2
x1
XOR
problem
revised
x=0
Polynomial:
Radial
basis
func%on
(RBF):
Kernel
Trick
for
SVMs
Arbitrary
many
dimensions
Li4le
computa%onal
cost
Maximal
margin
helps
with
curse
of
dimensionality
SVM
Applet
h4p://www.ml.inf.ethz.ch/educa%on/
lectures_and_seminars/annex_estat/Classier/
JSupportVectorApplet.html
Tips
and
Tricks
SVMs
are
not
scale
invariant
Check
if
your
library
normalizes
by
default
Normalize
your
data
mean:
0
,
std:
1
map
to
[0,1]
or
[-1,1]
Normalize
test
set
in
same
way!
Tips
and
Tricks
RBF
kernel
is
a
good
default
For
parameters
try
exponen%al
sequences
Read:
Chih-Wei
Hsu
et
al.,
A
Prac%cal
Guide
to
Support
Vector
Classica%on,
Bioinforma%cs
(2010)
Parameter
Tuning
Given
a
classica%on
task
Which
kernel
?
Which
kernel
parameter
values?
Which
value
for
C?
Try
dierent
combina%ons
and
take
the
best.
Grid
Search
got
call
friend
electricity?
got
new
read
book
dvd?
play
watch
tv
computer
Decision
Trees
Fast
training
Fast
prediciton
Easy
to
understand
Easy
to
interpret
Decision
Tree
-
Idea
C D
A
A
B
C
D
E
Bishop,
Pa4ern
Recogni%on
and
Machine
Learning,
Springer,
2006
Decision
Tree
-
Predic%on
C D
A
A
B
C
D
E
Decision
Tree
-Training
Learn
the
tree
structure:
which
feature
to
query
which
threshold
to
choose
A
B
C
D
E
Node
Purity
10
7
E
3 5 7 2 B
3
2
C
D
A
A
B
C
D
E
When
to
Stop
node
contains
only
one
class
node
contains
less
than
x
data
points
max
depth
is
reached
node
purity
is
sucient
you
start
to
overt
=>
cross-valida%on
Decision
Trees
-
Disadvantages
Sensi%ve
to
small
changes
in
the
data
Overtng
Only
axis
aligned
splits
Decision
Trees
vs
SVM
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
Wisdom
of
Crowds
The
collec%ve
knowledge
of
a
diverse
and
independent
body
of
people
typically
exceeds
the
knowledge
of
any
single
individual,
and
can
be
harnessed
by
vo%ng.
James
Surowiecki
h4p://socialmedia4srm.wordpress.com/
Ensemble
Methods
A
single
decision
tree
does
not
perform
well
But,
it
is
super
fast
What
if
we
learn
mul%ple
trees?
x2
x1
Bagging
Reduces
overtng
(variance)
Normally
uses
one
type
of
classier
Decision
trees
are
popular
Easy
to
parallelize
Boos%ng
Also
ensemble
method
like
Bagging
But:
weak
learners
evolve
over
%me
votes
are
weighted
x2
x1
AdaBoost
Ini%alize
weights
for
data
points
For
each
itera%on:
Fit
classier
to
training
data
Compute
weighted
classica%on
error
Compute
weight
for
classier
from
the
error
Update
weights
for
data
points
Final
classier
is
weighted
sum
of
all
single
classiers
AdaBoost
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
AdaBoost
AdaBoost
Introduced
by
Freund
and
Schapire
in
1995
Worked
great,
nobody
understood
why!
h4p://www.andrewbun%ne.com/ar%cles/about/fun
Random
Forest
All
trees
are
fully
grown
No
pruning
Two
parameters
Number
of
trees
Number
of
features
Random
Forest
Error
Rate
Error
depends
on:
Correla%on
between
trees
(higher
is
worse)
Strength
of
single
trees
(higher
is
be4er)
bootstrap
unused
sample
data
points
train
test
Out
of
Bag
Error
Very
similar
to
cross-valida%on
Measured
during
training
Can
be
too
op%mis%c
Variable
Importance
Again
use
out
of
bag
samples
Predict
class
for
these
samples
Randomly
permute
values
of
one
feature
Predict
classes
again
Measure
decrease
in
accuracy
Temp%ng
Scenario
Run
random
forest
with
all
features
Reduce
number
of
features
based
on
importance
weights
Run
again
with
reduced
feature
set
and
report
out
of
bag
error
Oversample:
Subsample:
sample
train
Random
Forest
Similar
to
Bagging
Easy
to
parallelize
Packaged
with
some
neat
func%ons:
Out
of
bag
error
Feature
importance
measure
Proximity
es%ma%on
Cascade
Classier
Ensemble
methods
are
strong
But
predic%on
is
slow
Solu%on:
Make
predic%on
faster
Idea:
Build
a
cascade
Cascade
Classier
h4p://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detec%on_framework
Viola
Jones
Face
Detec%on
h4p://cvdazzle.com/
Viola
Jones
Face
Detec%on
Takes
long
to
train
Predic%on
in
real
%me!
true
False
nega%ve
(fn)
-1
fp
tn
TPR
and
FPR
predicted
True
Posi%ve
Rate:
1
-1
1
tp
fn
true
-1
fp
tn
False
Posi%ve
Rate:
Precision
Recall
predicted
1
-1
Recall:
1
tp
fn
true
-1
fp
tn
Precision:
Precision
Recall
Curve
1
precision
1
recall
Comparison
Usual
case:
Increasing
allocates
weight
to
recall
Clustering
Evalua%on
Criteria
Based
on
expert
knowledge
Debatable
for
real
data
Hidden
Unknown
structures
could
be
present
Do
we
even
want
to
just
reproduce
known
structure?
Rand
Index
Percentage
of
correct
classica%ons
Compare
pairs
of
elements:
tn
tp
fn fp
random
sample:
red:
4/10
green:
3/10
blue:
3/10
misclassica%on:
red:
4/10
*
(3/10
+
3/10)
true
wrong
class
predic%on
Gini
Impurity
Number
of
classes:
Number
of
data
points:
Number
of
data
points
of
class
i:
true
wrong
class
predic%on
Gini
Impurity
h4p://en.wikipedia.org/wiki/C4.5_algorithm