0% found this document useful (0 votes)
20 views33 pages

I2ml3e Chap19

This document provides an overview of key concepts for designing and analyzing machine learning experiments, including: 1) It discusses factors that influence experimental results and different strategies for experimentation, such as response surface design. 2) It describes resampling techniques like k-fold cross-validation that are used to evaluate and compare machine learning algorithms. 3) It outlines common performance measures and statistical analyses used in machine learning experiments, such as hypothesis testing, confidence intervals, and analyses of variance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views33 pages

I2ml3e Chap19

This document provides an overview of key concepts for designing and analyzing machine learning experiments, including: 1) It discusses factors that influence experimental results and different strategies for experimentation, such as response surface design. 2) It describes resampling techniques like k-fold cross-validation that are used to evaluate and compare machine learning algorithms. 3) It outlines common performance measures and statistical analyses used in machine learning experiments, such as hypothesis testing, confidence intervals, and analyses of variance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Lecture Slides for

INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014

[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 19:

DESIGN AND ANALYSIS OF


MACHINE LEARNING EXPERIMENTS
Introduction
3

 Questions:
 Assessment of the expected error of a learning algorithm: Is
the error rate of 1-NN less than 2%?
 Comparing the expected errors of two algorithms: Is k-NN
more accurate than MLP ?
 Training/validation/test sets
 Resampling methods: K-fold cross-validation
Algorithm Preference
4

 Criteria (Application-dependent):
 Misclassification error, or risk (loss functions)
 Training time/space complexity

 Testing time/space complexity

 Interpretability

 Easy programmability

 Cost-sensitive learning
Factors and Response
5

 Response function based


on output to be
maximized
 Depends on controllable
factors
 Uncontrollable factors
introduce randomness
 Find the configuration of
controllable factors that
maximizes response and
minimally affected by
uncontrollable factors
Strategies of Experimentation
6

How to search the factor space?

Response surface design for approximating and maximizing


the response function in terms of the controllable factors
Guidelines for ML experiments
7

A. Aim of the study


B. Selection of the response variable
C. Choice of factors and levels
D. Choice of experimental design
E. Performing the experiment
F. Statistical Analysis of the Data
G. Conclusions and Recommendations
Resampling and
8
K-Fold Cross-Validation
 The need for multiple training/validation sets
{Xi,Vi}i: Training/validation sets of fold i
 K-fold cross-validation: Divide X into k, Xi,i=1,...,K
V1  X1 T 1  X 2  X 3    X K
V2  X 2 T 2  X1  X 3    X K

VK  X K T K  X1  X 2    X K 1
 Ti share K-2 parts
5×2 Cross-Validation
9

 5 times 2 fold cross-validation (Dietterich, 1998)


T 1  X11 V1  X12
T 2  X12 V2  X11
T 3  X 21 V3  X22
T 4  X22 V4  X 21

T 9  X51 V9  X52
T 10  X52 V10  X51
Bootstrapping
10

 Draw instances from a dataset with replacement


 Prob that we do not pick an instance after N draws
N
 1 1
1    e  0.368
 N

that is, only 36.8% is new!


Performance Measures
11

 Error rate = # of errors / # of instances = (FN+FP) / N


 Recall = # of found positives / # of positives
= TP / (TP+FN) = sensitivity = hit rate
 Precision = # of found positives / # of found
= TP / (TP+FP)
 Specificity = TN / (TN+FP)
 False alarm rate = FP / (FP+TN) = 1 - Specificity
ROC Curve
12
13
Precision and Recall
14
Interval Estimation
15

 X = { xt }t where xt ~ N ( μ, σ2)
 m ~ N ( μ, σ2/N)
m   
N ~Z

 m    
P  1.96  N  1.96  0.95
  
   
P m  1.96    m  1.96   0.95
 N N
   
P m  z / 2    m  z / 2  1 100(1- α) percent
 N N confidence interval
100(1- α) percent one-sided
 m     confidence interval
P N  1.64  0.95
  
  
P m  1.64     0.95
 N 
  
P m  z   1
 N 

When σ2 is not known:


N m   
S   x  m /N  1
2 t 2
~ t N 1
t S
 S S 
P m  t / 2 ,N 1    m  t / 2 ,N 1  1
 N N
16
Hypothesis Testing
17

 Reject a null hypothesis if not supported by the sample


with enough confidence
X = { xt }t where xt ~ N ( μ, σ2)
H0: μ = μ0 vs. H1: μ ≠ μ0
Accept H0 with level of significance α if μ0 is in the
100(1- α) confidence interval
N m  0 
  z / 2 , z / 2 

Two-sided test
 One-sided test: H0: μ ≤ μ0 vs. H1: μ > μ0
Accept if N m  0 
  , z 

 Variance unknown: Use t, instead of z
Accept H0: μ = μ0 if
N m  0 
  t / 2 ,N 1 ,t / 2 ,N 1 
S
18
Assessing Error: H0:p ≤ p0 vs. H1:p > p0
19

 Single training/validation set: Binomial Test


If error prob is p0, prob that there are e errors or
less in N validation trials is
N  j
 
e
PX  e    p0 1  p0
N j
  j

j 1  j 

Accept if this prob is less than 1- α

N=100, e=20
1- α
Normal Approximation to the Binomial
20

 Number of errors X is approx N with mean Np0 and


var Np0(1-p0)
X  Np0
~Z
Np0 1  p0 

Accept if this prob for X = e is


less than z1-α

1- α
Paired t Test
21

 Multiple training/validation sets


 xti = 1 if instance t misclassified on fold i
Error rate of fold i:

 N
xt
pi  t 1 i

N
 With m and s2 average and var of pi , we accept p0 or
less error if
K m  p0 
~ tK 1
S
is less than tα,K-1
Comparing Classifiers: H0:μ0=μ1 vs.
22
H1:μ0≠μ1
 Single training/validation set: McNemar’s Test

 Under H0, we expect e01= e10=(e01+ e10)/2


e01  e10  1
2

~ X12
e01  e10

Accept if < X2α,1


K-Fold CV Paired t Test
23

 Use K-fold cv to get K training/validation folds


 pi1, pi2: Errors of classifiers 1 and 2 on fold i
pi = pi1 – pi2 : Paired difference on fold i
 The null hypothesis is whether pi has mean 0
H0 :   0 vs. H0 :   0

i 1 pi
K K

i 1 ip  m 2

m s2 
K K 1
K m  0 K m
 ~ t K 1 Accept if in  t / 2 ,K 1 ,t / 2 ,K 1 
s s
5×2 cv Paired t Test
24

 Use 5×2 cv to get 2 folds of 5 tra/val replications


(Dietterich, 1998)
 pi(j) : difference btw errors of 1 and 2 on fold j=1,
2 of replication i=1,...,5
pi  pi  pi
1 2 
/ 2 s  pi  pi   pi  pi 
1 2 
2 2 2
i

p11
~ t5

5 2
s /5
i 1 i

Two-sided test: Accept H0: μ0 = μ1 if in (-tα/2,5,tα/2,5)


One-sided test: Accept H0: μ0 ≤ μ1 if < tα,5
5×2 cv Paired F Test
25

  p   j 2
5 2
i 1 j 1 i
~ F10,5
2 s
5 2
i 1 i

Two-sided test: Accept H0: μ0 = μ1 if < Fα,10,5


Comparing L>2 Algorithms:
26
Analysis of Variance (Anova)

H0 : 1  2    L
 Errors of L algorithms on K folds
X ij ~ N  j , 2 , j  1,..., L, i  1,..., K

 We construct two estimators to σ2 .


One is valid if H0 is true, the other is always valid.
We reject H0 if the two estimators disagree.
If H0 is true :

~ N  , 2 / K 
K X ij
mj  
i 1 K
 j 1 m j  m 
L
j  m 2

m S2  j

L L 1
Thus an estimator of  2 is K  S 2 , namely,

ˆ 2  K 
L m j  m 2

j 1 L 1
m  m 2

~ X L21 SSb  K  m j  m 2
j
j

 2 /K j

So when H0 is true, we have


SSb
~ X L21
2
27
Regardlessof H0 our second estimator to  2 is the
average of group variances S 2j :

 X
K
 m 2 L S 2j X  m 2

 ˆ 2    
i 1 ij j ij j
S 2j
K 1 j 1 L j i LK  1
SSw   X ij  m j 2
j i

S 2j SSw
K  1 ~X 2
K 1 ~ X L2K 1
2 2
 SSb /  2   SSw /  2  SSb /L  1
  /   ~ FL1,L K 1
 L 1   LK  1  SSw /LK  1
H0 : 1   2    L if  F ,L1,L K 1
28
ANOVA table
29

If ANOVA rejects, we do pairwise posthoc tests


H0 : i   j vs H1 : i   j
mi  m j
t ~ t L(K 1)
2 w
Comparison over Multiple Datasets
30

 Comparing two algorithms:


Sign test: Count how many times A beats B over N
datasets, and check if this could have been by chance if
A and B did have the same error rate
 Comparing multiple algorithms
Kruskal-Wallis test: Calculate the average rank of all
algorithms on N datasets, and check if these could have
been by chance if they all had equal error
If KW rejects, we do pairwise posthoc tests to find
which ones have significant rank difference
Multivariate Tests
31

 Instead of testing using a single performance


measure, e.g., error, use multiple measures for
better discrimination, e.g., [fp-rate,fn-rate]
 Compare p-dimensional distributions
 Parametric case: Assume p-variate Gaussians
Multivariate Pairwise Comparison
32

 Paired differences:

 Hotelling’s multivariate T2 test

 For p=1, reduces to paired t test


Multivariate ANOVA
33

 Comparsion of L>2 algorithms

You might also like