An Introduction To Machine Learning
An Introduction To Machine Learning
Machine Learning
Pierre Geurts
Outline
Introduction
Supervised Learning
http://www.darpa.mil/grandchallenge/
http://www.netflixprize.com
Data:
Applications
...
8
Related fields
Problem definition
Data generation
Raw data
Preprocessing: normalization,
missing values, feature
selection/extraction...
Preprocessing
Preprocessed data
Machine
learning
Hypothesis
Validation
Knowledge/Predictive model
10
Glossary
Object 1
Object 2
Object 3
Object 4
Object 5
Object 6
Object 7
Object 8
Object 9
Object 10
...
VAR 1
0
2
0
1
0
0
2
2
1
1
...
VAR 2
1
1
0
1
1
1
1
2
1
2
...
VAR 3
2
2
1
2
0
2
0
1
0
2
...
VAR 4
0
0
0
2
2
1
1
0
1
0
...
VAR 5
1
1
1
0
1
1
1
0
0
1
...
VAR 6
1
1
1
0
0
1
2
0
0
0
...
VAR 11
0
2
2
1
1
1
1
2
1
1
...
...
...
...
...
...
...
...
...
...
...
...
...
Dimension=number of variables
Size=number of objects
Outline
Introduction
Supervised Learning
Introduction
12
Supervised learning
Inputs
A1
-0.69
-2.3
0.32
0.37
-0.67
0.51
A2
-0.72
-1.2
-0.9
-1
-0.53
-0.09
A3
Y
N
N
Y
N
Y
Output
A4
0.47
0.15
-0.76
-0.59
0.33
-0.05
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
Supervised
learning
= h(A1,A2,A3,A4)
Learning sample
model,
hypothesis
13
Predictive:
Make predictions for a new sample described by its
attributes
A1
0.83
-2.3
0.08
0.06
-0.98
-0.68
0.92
A2
-0.54
-1.2
0.63
-0.29
-0.18
0.82
-0.33
A3
T
F
F
T
F
T
F
A4
0.68
-0.83
0.76
-0.57
-0.38
-0.95
-0.48
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
?
Informative:
Help to understand the relationship between the inputs
and the output
Y=disease if A3=F and A2<0.3
Example of applications
Patients
A1
-0.61
-2.3
-0.82
-0.74
-0.14
-0.37
A2
0.23
-1.2
-0.41
-0.1
0.98
0.27
...
...
...
...
...
...
...
A4
0.49
-0.11
0.24
-0.15
-0.13
-0.67
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
15
Example of applications
16
Example of applications
17
Outline
Introduction
Supervised Learning
Introduction
18
Illustrative problem
Medical diagnosis from two measurements (eg., weights
and temperature)
M1
0.52
0.44
0.89
0.99
...
0.95
0.29
M2
0.18
0.29
0.88
0.37
...
0.47
0.09
Y
Healthy
Disease
Healthy
Disease
...
Disease
Healthy
M2
0
0
M1
Learning algorithm
an optimization strategy
a model
obtained by
supervised
learning
G2
0
0
G1
20
Linear model
h(M1,M2)=
Disease if w0+w1*M1+w2*M2>0
Normal otherwise
M2
0
0
M1
Quadratic model
h(M1,M2)=
Disease if w0+w1*M1+w2*M2+w3*M12+w4*M22>0
Normal otherwise
M2
0
0
M1
22
M2
0
0
M1
quadratic
0
0
neural net
0
0
25
quadratic
0
0
LS error= 3.4%
TS error= 3.5%
neural net
0
0
LS error= 1.0%
TS error= 1.5%
LS error= 0%
TS error= 3.5%
Upside:
very simple
Computationally efficient
Downside:
27
0
0
Upside:
Downside:
High variance
29
learn the model on the objects that are not in the subset
Test set:
Leave-one-out:
Rule of thumb:
Over-fitting
CV error
LS error
Optimal complexity
Complexity
32
Complexity
CV
error
Algo 1
Algo 4
Algo 2
Algo 3
34
35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
True class
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Positive
Positive
Model 1
Positive
Negative
Positive
Positive
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Negative
Positive
Model 2
Negative
Negative
Positive
Negative
Negative
Negative
Positive
Negative
Negative
Positive
Negative
Positive
Positive
Negative
Negative
36
Various criterion
Error rate = (FP+FN)/(N+P)
Sensitivity = TP/P
(aka recall)
Accuracy = (TP+TN)/(N+P)
= 1-Error rate
Specificity = TN/(TN+FP)
Precision = TP/(TP+FP) (aka PPV)
37
Precision
Recall
(Sensitivity)
38
Outline
Introduction
k-NN
Linear methods
Decision trees
Ensemble methods
Accuracy:
Efficiency:
Interpretability:
40
1
2
3
4
5
M1
0.32
0.15
0.39
0.62
0.92
M2
0.81
0.38
0.34
0.11
0.43
Y
Healthy
Disease
Healthy
Disease
?
42
Over-fitting
CV error
LS error
Optimal k
k
43
Small exercise
Andrew Moore
44
k-NN
Advantages:
very simple
Drawbacks:
45
Linear methods
46
i y i w x i w
w = X X I X y
where X is the input matrix and y is the output vector
Example: perceptron
Linear methods
Advantages:
simple
Drawbacks:
49
Non-linear extensions
y=w 0 w1 1 xw 2 2 x 2 ...w n n x
y= g W j g w i , j x i
j
Kernel methods:
y= w i i x
i
y= j k x j , x
j
50
51
1
A1
A2
x w1
x w2
AN
x w0
+ tanh
A1
x wN
tanh
+1
Y=tanh(w1*A1+w2*A2++wN*AN+w0)
-1
52
Hypothesis space:
Multi-layers Perceptron
Output layer
Learning
Choose a structure
Tune the value of the parameters (connections
between neurons) so as to minimize the learning
sample error.
54
Illustrative example
1
G2
1 neuron
0
0
1
G2
2 +2 neurons
G1
0
0
10 +10 neurons
...
G2
...
G1
55
0
0
G1
Advantages:
Universal approximators
Drawbacks:
Scalability
56
57
Linear classifier
58
59
60
Mathematically
Decision function:
T
y=1 if w xb0, y =1 otherwise
Non-linear boundary
x12
Solution:
x2
x22
62
Intuitively:
Mathematically:
63
Mathematically
1
2
w
minimizes
2
subject to y i w , xi b1, i=1,... , N
Dual form:
1
minimizes i i j yi y j xi , x j
2 i, j
i
subject to i 0 and
i y i =0
i
w= i y i x i
i
Decision function:
y=1 if w , x = i y i x i , x= i y i k x i , x0
y=1 otherwise
64
G1
0.21
0.57
0.21
0.69
0.83
0.48
G2
0.64
0.26
0.68
0.52
0.96
-0.52
Y
C1
C2
C2
C1
C1
C2
kernel matrix
1
2
3
4
5
6
1
1
0.14
0.96
0.17
0.01
0.24
2
0.14
1
0.02
0.7
0.22
0.67
3
0.96
0.02
1
0.15
0.27
0.07
Class labels
1
2
3
4
5
6
G1
ACGCTCTATAG
ACTCGCTTAGA
GTCTCTGAGAG
CGCTAGCGTCG
CGATCAGCAGC
GCTCGCGCTCG
Y
C1
C2
C2
C1
C1
C2
4
0.17
0.17
0.15
1
0.37
0.55
1
2
3
4
5
6
5
6
0.01 0.24
0.22 0.67
0.27 0.07
0.37 0.55
1 -0.25
-0.25 1
Y
C1
C2
C2
C1
C1
C2
SVM
algorithm
Classification
model
65
Examples of kernels
Linear kernel:
k(x,x')= <x,x'>
Polynomial kernel
k(x,x')=(<x,x'>+1)d
h(x1,x2,...,xK)=
|w|
C1 if w0+w1*x1+w2*x2+...+wK*xK>0
C2 otherwise
variables
67
SVM parameters
Kernel's parameters:
68
Advantages:
Drawbacks:
69
71
Decision trees
A1
a11
a13
a12
A2
c1
A3
a21
a22
a31
c1
c2
c2
a32
c1
72
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Hot
Mild
Cool
Mild
Hot
Mild
Humidity
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Strong
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
73
Outlook
Sunny
Overcast
Humidity
High
yes
Normal
Wind
Strong
Weak
yes
no
yes
no
Rain
Outlook
Sunny
Temperature
Hot
Humidity
High
Wind
Weak
Play Tennis
?
74
Rain
Overcast
Day
Outlook
Temp.
Humidity
Wind
Play
Day
Outlook
Temp.
Humidity
Wind
Play
D1
Sunny
Hot
High
Weak
No
D4
Rain
Mild
Normal
Weak
Yes
D2
Sunny
Hot
High
Strong
No
D5
Rain
Cool
Normal
Weak
Yes
D8
Sunny
Mild
High
Weak
No
D6
Rain
Cool
Normal
Strong
No
D9
Sunny
Hot
Normal
D11
Sunny
Cool
Normal
Strong D3
Temp.
Humidity
Wind
Play
D10
Rain
Mild
Normal
Strong
Yes
Yes Overcast
Hot
High
Weak
Yes
D14
Rain
Mild
High
Strong
No
D7
Overcast
Cool
High
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
75
Else
76
[29+,35-]
A2=?
[29+,35-]
[21+,5-]
[8+,30-]
[18+,33-]
[11+,2-]
77
Over-fitting
CV error
LS error
Optimal complexity
Nb nodes
78
79
Post-pruning
Error
Under-fitting
Over-fitting
CV error
2. Tree pruning
1. Tree growing
Optimal complexity
LS error
Nb nodes
80
Numerical variables
>65.4
yes
Illustrative example
1
M2<0.33?
yes
Healthy
M2
no
M1<0.91?
0
M2<0.91?
M1<0.23?
Healthy
M1
Sick
M2<0.75?
M2<0.49?
Healthy
Sick
M2<0.65?
Sick
Sick
Healthy
82
Regression trees
Trees for regression problems: exactly the same model but
with a number in each leaf instead of a class
Outlook
Sunny
Overcast
Humidity
High
45.6
Normal
Temperature
22.3
<71
1.2
Rain
Wind
Strong
64.4
Weak
7.4
>71
3.4
83
Interpretability
Attribute selection
Attribute importance
Outlook
Humidity
Wind
Temperature
85
Advantages:
Drawbacks:
86
Ensemble methods
...
Sick
...
Healthy
Sick
Sick
Bagging: motivation
0
0
88
0
0
...
0
10
...
Sick
...
Healthy
Sick
Sick
89
Bootstrap sampling
G1
0.74
0.78
0.86
0.2
0.2
0.32
-0.34
0.89
0.1
-0.34
G2
0.68
0.45
0.09
0.61
-5.6
0.6
-0.45
-0.34
0.3
-0.65
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Healthy
3
7
2
9
3
10
1
8
6
10
G1
0.86
-0.34
0.78
0.1
0.86
-0.34
0.74
0.89
0.32
-0.34
G2
0.09
-0.45
0.45
0.3
0.09
-0.65
0.68
-0.34
0.6
-0.65
Y
Healthy
Healthy
Disease
Healthy
Healthy
Healthy
Healthy
Disease
Disease
Healthy
90
Boosting
Adaboost:
91
Boosting
LS
LS1
LS2
LST
...
Healthy
w1
Sick
Healthy
wT
w2
Healthy
92
100
75
50
25
0
Error
1 decision tree
Random forests (k=85,T=500)
9.7% (7/72)
5.5% (4/72)
1.4% (1/72)
Importance
22.2% (16/72)
75
50
25
0
94
variables
Method comparison
Method
Accuracy Efficiency
kNN
++
++
DT
+++
+++
+++
Linear
++
+++
++
+++
Ensemble +++
+++
++
+++
ANN
SVM
+
+
+
+
++
+
+++
++++
Note:
95
Outline
Introduction
Supervised Learning
Introduction
96
Graph predictions
Sequence labeling
image segmentation
97
Decomposition:
98
Outline
Introduction
Supervised learning
Semi-supervised learning
Transductive learning
Active learning
Reinforcement learning
Unsupervised learning
99
Examples:
Biomedical domain
Speech analysis
Image categorization/segmentation
Network measurement
100
Semi-supervised learning
A1
0.01
-2.3
0.69
-0.56
-0.85
-0.17
-0.09
A2
0.37
-1.2
-0.78
-0.89
0.62
0.09
0.3
A3
T
F
F
T
F
T
F
A4
0.54
0.37
0.63
-0.42
-0.05
0.29
0.17
Y
Healthy
Disease
Healthy
101
Some approaches
Self-training
102
Some approaches
Graph-based algorithms
103
Transductive learning
Active learning
Goal:
105
Reinforcement learning
Learning form interactions
s0
a0
r0
s1
a1
r1
s2
a2
r2
...
RL approaches
Examples of applications
109
Unsupervised learning
A2
A3
-2.3
-1.2
-4.5
0.41
0.77
-0.44
A4
A5
0.91
-0.17
-0.01 -0.83
A6
A7
A8
A9
A10
0.26 -0.48
-0.1
-0.53 -0.65
0.66
0.55
0.27 -0.65
0.39
-0.82 0.17
0.54 -0.04
0.6
A11
A12
A13
A14
A15
0.23
0.22
0.98
0.57
-1.3
-0.2
-3.5
0.4
0.41
0.03
0.27
-0.21
-0.9
0.21
0.97
-0.27 0.74
0.2
-0.28 0.48
-0.14
0.8
0.28
0.75
-0.78 -0.72
0.94 -0.78
0.48
0.37
0.79
0.26
0.3
0.06
A16
A17
A18
A19
0.28
-0.33
0.6
-0.29
0.48
0.74
0.49
0.7
0.79
0.59
-0.33
0.71
-0.16
0.01
0.36
0.03
0.03
0.59
-0.5
0.4
-0.88 -0.53
0.95
0.15
0.31
-0.12
0.49
-0.53
-0.8
-0.64
-0.93 -0.51
0.28
0.25
0.01 -0.94
0.96
0.25
-0.12 0.27
0.58
-0.86
0.04
0.94
-0.92
0.05
-0.38 -0.07
0.98
0.1
0.13
-0.28
-0.84
0.47
-0.88 -0.73
-0.4
0.58
0.24
0.08
-0.2
0.95
-0.31 0.25
0.55
0.52
-0.66
-0.56 0.97
-0.93
0.91
0.36
-0.14
-0.9
0.65
0.73
0.91
-0.64
0.41
-0.12
0.35
0.21
0.22
Dimensionality reduction: project the data from a highdimensional space down to a small number of
dimensions
Clustering
112
Clustering
variables
Clustering rows
grouping similar objects
objects
Bi-cluster
Clustering columns
Cluster of
variables
Cluster of
objects
Bi-Clustering/Two-way
clustering
grouping objects that are
similar across a subset of
variables
113
Clustering
114
115
Clustering algorithms
hierarchical clustering
K-means
116
Hierarchical clustering
Agglomerative clustering:
1. Each object is assigned to its own cluster
2. Iteratively:
117
118
Hierarchical clustering
(wikipedia)
119
Dendrogram
120
Hierarchical clustering
Strengths
Limitations
121
k-Means clustering
j= 1 oCluster j
d 2 o,c j
122
k-Means clustering
123
k-Means clustering
124
k-Means clustering
125
k-Means clustering
Strengths
Simple, understandable
Limitations
Sensitive to outliers
126
Suboptimal clustering
A2
A3
A4
A5
A6
A7
A8
A9
A10
0.25
0.93
0.04
-0.78
-0.53
0.57
0.19
0.29
0.37
-0.22
-2.3
-1.2
-4.5
-0.51
-0.76
0.07
0.81
0.95
0.99
0.26
0.53
-0.29
-1
0.73
-0.33
0.52
0.13
0.13
-0.5
-0.48
-0.16
-0.17
-0.26
0.32
-0.08
-0.38
-0.48 0.99
-0.95
0.34
0.07
-0.87
0.39
0.5
-0.63
-0.53
0.79
0.74
-0.14
0.61
0.15
0.68
-0.94
0.5
0.06
-0.56 0.49
-0.77
0.88
PC1
0.36
-2.3
0.27
-0.19
-0.77
-0.65
PC2
0.1
-1.2
-0.89
0.7
-0.7
-0.99
Objectives of PCA
to identify outliers
129
Basic idea
First
component
Second
component
130
A2
A3
A4
A5
A6
A9
A10
-0.39
-0.38
0.29
0.65
0.15
0.73
-0.57 0.91
A7
A8
-0.89
-0.17
-2.3
-1.2
-4.5
-0.15
0.86
-0.85
0.43 -0.19
-0.83
-0.4
0.9
0.4
-0.11
0.62
0.94
0.97
0.1
-0.41
0.01
0.1
-0.82
-0.31
0.14
0.22
-0.49
-0.76
0.27
-0.43
-0.81
0.71
0.39
-0.09
0.26
-0.46
-0.05
0.46
0.39
-0.01
0.64
-0.25
0.27
-0.81
-0.42
0.62
0.54
-0.67 -0.15
-0.46
0.69
Scores for
each sample
and PC
PC1
0.62
-2.3
0.88
-0.18
-0.39
-0.61
PC2
-0.33
-1.2
0.31
-0.05
-0.01
0.53
PC1=0.2*A1+3.4*A2-4.5*A3
VAR(PC1)=4.5 45%
PC2=0.4*A4+5.6*A5+2.3*A7
VAR(PC2)=3.3 33%
...
Loading of a variable
Gives an idea of its importance in
the component
Can be use for selecting
biomarkers
...
For each component, we
have a measure of the
percentage of the variance
of the initial data that it
contains
131
Books
Other textbooks
132
Books
133
Softwares
Pepito
www.pepite.be
WEKA
http://www.cs.waikato.ac.nz/ml/weka/
http://www.kyb.mpg.de/bs/people/spider/
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
134
Journals
Machine Learning
Neural computation
Annals of Statistics
...
135
Conferences
...
136