0% found this document useful (0 votes)
13 views33 pages

Mining Process

The document outlines the mining process in knowledge discovery and data mining, focusing on issues related to training and testing, parameter tuning, and performance evaluation. It discusses various methods for estimating model performance, including holdout estimation, cross-validation, and bootstrap techniques, as well as the importance of hyperparameter selection. Additionally, it covers the significance of comparing learning schemes and evaluating numeric predictions using different error measures.

Uploaded by

Shahzaib Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

Mining Process

The document outlines the mining process in knowledge discovery and data mining, focusing on issues related to training and testing, parameter tuning, and performance evaluation. It discusses various methods for estimating model performance, including holdout estimation, cross-validation, and bootstrap techniques, as well as the importance of hyperparameter selection. Additionally, it covers the significance of comparing learning schemes and evaluating numeric predictions using different error measures.

Uploaded by

Shahzaib Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

http://www.cs.waikato.ac.nz/ml/weka/book.

htm
STIN5084
KNOWLEDGE DISCOVERY AND DATA
MINING

TOPIC 4 :
MINING PROCESS

1
1 Issues

2 Training & Testing

3 Parameter Tuning, Holdout, Cross validation,


Boostrap
Outline

4 Comparing Results
• Statistical reliability of estimated
differences in performance (significance
tests)
• Choice of performance measure:
• Number of correct classifications
• Accuracy of probability estimates
• Error in numeric predictions
• Costs assigned to different types of errors
• Many practical applications involve costs

3
Training
Testing

• Natural performance measure for classification


problems: error rate
• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: proportion of errors made over the whole set of
instances
• Resubstitution error: error rate obtained by
evaluating model on training data

4
• Test set: independent instances that
have played no part in formation of
classifier
• Assumption: both training data and test data
are representative samples of the underlying
problem

70-30% OR 80-20% OR
90-10%

• Test and training data may differ in


nature
• Example: classifiers built using customer
5 data
from two different towns A and B
Note on parameter tuning

• It is important that the test data is not used in


any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: build the basic structure
• Stage 2: optimize parameter settings
• Testing data cannot be used for parameter
tuning Traini Valida
• Validation data is used to optimize parameters ng tion
data data

70-20-10% OR 80-10-10% OR 90-5-5%


Testing
data
6
Holdout estimation
• What should we do if we only have a single
dataset?
• The holdout method reserves a certain amount
for testing and uses the remainder for training,
after shuffling
• Usually: one third for testing, the rest for training
(90:10, 80:20 or 70:30)
• Problem: the samples might not be
representative
• Example: class might be missing in the test data
• Advanced version uses stratification
• Ensures that each class is represented with
approximately equal proportions in both subsets
7
Repeated holdout method
• Holdout estimate can be made more reliable
by repeating the process with different
subsamples
• In each iteration, a certain proportion is randomly
selected for training (possibly with stratificiation)
• The error rates on the different iterations are
averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets
overlap
• Can we prevent overlapping?

8
Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for
training
• This means the learning algorithm is applied to k different training sets
• Often the subsets are stratified before the cross-validation is
performed to yield stratified k-fold cross-validation
• The error estimates are averaged to yield an overall error
estimate; also, standard deviation is often computed
• Alternatively, predictions and actual target values from the k
folds are pooled to compute one estimate
• Does not yield an estimate of standard deviation

9
Leave-one-out cross-validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception:
using lazy classifiers such as the nearest-
neighbor classifier)

11
Leave-one-out CV and stratification

• Disadvantage of Leave-one-out CV:


stratification is not possible
• It guarantees a non-stratified sample because
there is only one instance in the test set!
• Extreme example: random dataset split
equally into two classes
• Best inducer predicts majority class
• 50% accuracy on fresh data
• Leave-one-out CV estimate gives 100% error!

12
Bootstrap
• CV uses sampling without replacement
• The same instance, once selected, can not be selected again for
a particular training/test set
• The bootstrap uses sampling with replacement to form
the training set
• Sample a dataset of n instances n times with replacement to
form a new dataset of n instances
• Use this data as the training set
• Use the instances from the original dataset that do not occur in
the new training set for testing

13
Hyperparameter selection
• Hyperparameter: parameter that can be tuned to optimize the performance
of a learning algorithm
– Different from basic parameter that is part of a model, such as a coefficient in a
linear regression model
– Example hyperparameter: k in the k-nearest neighbour classifier

• Parameter tuning needs to be viewed as part of the learning algorithm and


must be done using the training data only
• But how to get a useful estimate of performance for different parameter
values so that we can choose a value?
– Answer: split the data into a smaller “training” set and a validation set” (normally,
the data is shuffled first)
– Build models using different values of k on the new, smaller training set and
evaluate them on the validation set
– Pick the best value of k and rebuild the model on the full original training set

14
Hyperparameters and cross-
validation
• Note that k-fold cross-validation runs k different train-test evaluations
– The above parameter tuning process using validation sets must be applied
separately to each of the k training sets!

• This means that, when hyperparameter tuning is applied, k different


hyperparameter values may be selected
– This is OK: hyperparameter tuning is part of the learning process
– Cross-validation evaluates the quality of the learning process, not the quality of a
particular model

• What to do when the training sets are very small, so that performance
estimates on a validation set are unreliable?
• We can use nested cross-validation (expensive!)
– For each training set of the “outer” k-fold cross-validation, run “inner” p-fold cross-
validations to choose the best hyperparameter value
– Outer cross-validation is used to estimate quality of learning process
– Inner cross-validations are used to choose hyperparameter values
15
– Inner cross-validations are part of the learning process!
Comparing machine learning schemes

• Frequent question: which of two learning schemes


performs better?
• domain dependent
• Obvious way: compare 10-fold cross-validation estimates
• However, what about machine learning research?
• Need to show convincingly that a particular method
works better in a particular domain from which data is
taken

16
Comparing learning schemes

• Want to show that scheme A is better than scheme B in a


particular domain
– For a given amount of training data (i.e., data size)
– On average, across all possible training sets from that domain

• Assume we have an infinite amount of data from the domain


– sample infinitely many dataset of a specified size
– obtain a cross-validation estimate on each dataset for each scheme
– check if the mean accuracy for scheme A is better than the mean
accuracy for scheme B

17
Paired t-test
• In practice, limited data and a limited number of
estimates for computing the mean
• Student’s t-test tells us whether the means of two
samples are significantly different
• In our case the samples are cross-validation estimates,
one for each dataset we have sampled
• We can use a paired t-test because the individual
samples are paired
• The same cross-validation is applied twice, ensuring that all the
training and test sets are exactly the same

18
Performing the test

• Fix a significance level


• If a difference is significant at the a% level,
there is a (100-a)% chance that the true means differ
• Divide the significance level by two because the
test is two-tailed
• I.e., the true difference can be +ve or – ve
• Look up the value for z that corresponds to a/2
• Compute the value of t based on the observed
performance estimates for the schemes being
compared

19
Unpaired observations
• If the CV estimates are from different datasets,
they are no longer paired
(or maybe we have k estimates for one scheme,
and j estimates for the other one)
• Then we have to use an unpaired t-test with
min(k , j) – 1 degrees of freedom
• The statistic for the t-test becomes:

20
Counting the cost
• The confusion matrix:

Predicted class
Yes No
Actual class Yes True positive False negative
No False(TP)
positive True (FN)
negative
(FP) (TN)
• Different misclassification costs can be
assigned to false positives and false
negatives

21
Precision and Recall

relevant irrelevant
Entire document retrieved & Not retrieved &
collection Relevant Retrieved
documents documents irrelevant irrelevant

retrieved & not retrieved but


relevant relevant

retrieved not retrieved

Number of relevant documents retrieved


recall 
Total number of relevant documents

Number of relevant documents retrieved


precision 
Total number of documents retrieved
Determining Recall is Difficult

• Precision vs. Recall:


– Precision = The ability to retrieve top-ranked documents that are mostly
relevant.
– Recall = The ability of the search to find all of the relevant items in the
corpus.

• Total number of relevant items is sometimes not available:


– Sample across the database and perform relevance judgment on these
items.
– Apply different retrieval algorithms to the same database for the same query.
The aggregate of relevant items is taken as the total relevant set.
Computing Recall/Precision
Points: An Example
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
11 103 Missing one
12 591 relevant document.
Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
Assuming cross validation of 10 folds has been used, calculate precision and recall
for the data depicted on the confusion matrix provided below:

Actual Class
Accept Reject
TRY !

Predicted Accept 422 78


Class
Reject 104 164

Precision=TP/(TP+FP)
Recall = TP/(TP+FN)
Aside: the kappa statistic
• Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)

• Number of successes: sum of entries in diagonal (D)


• Kappa statistic: (success rate of actual predictor - success rate of
random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means
perfect accuracy, 0 means we are doing no better than random
26
ROC curves

• ROC curves
• “receiver operating characteristic”
• Used in signal detection to show
tradeoff between hit rate and false
alarm rate over noisy channel

27
Summary of some measures

Domain Plot Explanation

Lift chart Marketing TP TP


Subset (TP+FP)/
size (TP+FP+TN+FN)
ROC Communicatio TP rate TP/(TP+FN)
curve ns FP rate FP/(FP+TN)
Recall- Information Recall TP/(TP+FN)
precision retrieval Precisio TP/(TP+FP)
curve n

28
Evaluating numeric prediction

• Same strategies: independent test set, cross-


validation, significance tests, etc.
• Difference: error measures
• Actual target values: a1 a2 …an
• Predicted target values: p1 p2 … pn
• Most popular measure: mean-squared error

• Easy to manipulate mathematically

29
Other measures
• The root mean-squared error (RMSE) :

• The mean absolute error (MAE) is less sensitive


to outliers than the mean-squared error:

• Sometimes relative error values are more


appropriate (e.g. 10% for an error of 50 when
predicting 500)
30
Improvement on the mean
• How much does the scheme improve on
simply predicting the average?
• The relative squared error is:

• The root relative squared error and the


relative absolute error are:

31
Correlation coefficient

• Measures the statistical correlation between the


predicted values and the actual values

• Scale independent, between –1 and +1


Good performance leads to large values!
32
Which measure?
A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root rel squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91

• D best
• C second-best
• A, B arguable

33

You might also like