Mining Process
Mining Process
htm
STIN5084
KNOWLEDGE DISCOVERY AND DATA
MINING
TOPIC 4 :
MINING PROCESS
1
1 Issues
4 Comparing Results
• Statistical reliability of estimated
differences in performance (significance
tests)
• Choice of performance measure:
• Number of correct classifications
• Accuracy of probability estimates
• Error in numeric predictions
• Costs assigned to different types of errors
• Many practical applications involve costs
3
Training
Testing
4
• Test set: independent instances that
have played no part in formation of
classifier
• Assumption: both training data and test data
are representative samples of the underlying
problem
70-30% OR 80-20% OR
90-10%
8
Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for
training
• This means the learning algorithm is applied to k different training sets
• Often the subsets are stratified before the cross-validation is
performed to yield stratified k-fold cross-validation
• The error estimates are averaged to yield an overall error
estimate; also, standard deviation is often computed
• Alternatively, predictions and actual target values from the k
folds are pooled to compute one estimate
• Does not yield an estimate of standard deviation
9
Leave-one-out cross-validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception:
using lazy classifiers such as the nearest-
neighbor classifier)
11
Leave-one-out CV and stratification
12
Bootstrap
• CV uses sampling without replacement
• The same instance, once selected, can not be selected again for
a particular training/test set
• The bootstrap uses sampling with replacement to form
the training set
• Sample a dataset of n instances n times with replacement to
form a new dataset of n instances
• Use this data as the training set
• Use the instances from the original dataset that do not occur in
the new training set for testing
13
Hyperparameter selection
• Hyperparameter: parameter that can be tuned to optimize the performance
of a learning algorithm
– Different from basic parameter that is part of a model, such as a coefficient in a
linear regression model
– Example hyperparameter: k in the k-nearest neighbour classifier
14
Hyperparameters and cross-
validation
• Note that k-fold cross-validation runs k different train-test evaluations
– The above parameter tuning process using validation sets must be applied
separately to each of the k training sets!
• What to do when the training sets are very small, so that performance
estimates on a validation set are unreliable?
• We can use nested cross-validation (expensive!)
– For each training set of the “outer” k-fold cross-validation, run “inner” p-fold cross-
validations to choose the best hyperparameter value
– Outer cross-validation is used to estimate quality of learning process
– Inner cross-validations are used to choose hyperparameter values
15
– Inner cross-validations are part of the learning process!
Comparing machine learning schemes
16
Comparing learning schemes
17
Paired t-test
• In practice, limited data and a limited number of
estimates for computing the mean
• Student’s t-test tells us whether the means of two
samples are significantly different
• In our case the samples are cross-validation estimates,
one for each dataset we have sampled
• We can use a paired t-test because the individual
samples are paired
• The same cross-validation is applied twice, ensuring that all the
training and test sets are exactly the same
18
Performing the test
19
Unpaired observations
• If the CV estimates are from different datasets,
they are no longer paired
(or maybe we have k estimates for one scheme,
and j estimates for the other one)
• Then we have to use an unpaired t-test with
min(k , j) – 1 degrees of freedom
• The statistic for the t-test becomes:
20
Counting the cost
• The confusion matrix:
Predicted class
Yes No
Actual class Yes True positive False negative
No False(TP)
positive True (FN)
negative
(FP) (TN)
• Different misclassification costs can be
assigned to false positives and false
negatives
21
Precision and Recall
relevant irrelevant
Entire document retrieved & Not retrieved &
collection Relevant Retrieved
documents documents irrelevant irrelevant
Actual Class
Accept Reject
TRY !
Precision=TP/(TP+FP)
Recall = TP/(TP+FN)
Aside: the kappa statistic
• Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)
• ROC curves
• “receiver operating characteristic”
• Used in signal detection to show
tradeoff between hit rate and false
alarm rate over noisy channel
27
Summary of some measures
28
Evaluating numeric prediction
29
Other measures
• The root mean-squared error (RMSE) :
31
Correlation coefficient
• D best
• C second-best
• A, B arguable
33