0% found this document useful (0 votes)

13 views33 pages

Mining Process

The document outlines the mining process in knowledge discovery and data mining, focusing on issues related to training and testing, parameter tuning, and performance evaluation. It discusses various methods for estimating model performance, including holdout estimation, cross-validation, and bootstrap techniques, as well as the importance of hyperparameter selection. Additionally, it covers the significance of comparing learning schemes and evaluating numeric predictions using different error measures.

Uploaded by

Shahzaib Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views33 pages

Mining Process

Uploaded by

Shahzaib Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

http://www.cs.waikato.ac.nz/ml/weka/book.

htm
STIN5084
KNOWLEDGE DISCOVERY AND DATA
MINING

TOPIC 4 :
MINING PROCESS

1
1 Issues

2 Training & Testing

3 Parameter Tuning, Holdout, Cross validation,

Boostrap
Outline

4 Comparing Results
• Statistical reliability of estimated
differences in performance (significance
tests)
• Choice of performance measure:
• Number of correct classifications
• Accuracy of probability estimates
• Error in numeric predictions
• Costs assigned to different types of errors
• Many practical applications involve costs

3
Training
Testing

• Natural performance measure for classification

problems: error rate
• Success: instance’s class is predicted correctly
• Error: instance’s class is predicted incorrectly
• Error rate: proportion of errors made over the whole set of
instances
• Resubstitution error: error rate obtained by
evaluating model on training data

4
• Test set: independent instances that
have played no part in formation of
classifier
• Assumption: both training data and test data
are representative samples of the underlying
problem

70-30% OR 80-20% OR
90-10%

• Test and training data may differ in

nature
• Example: classifiers built using customer
5 data
from two different towns A and B
Note on parameter tuning

• It is important that the test data is not used in

any way to create the classifier
• Some learning schemes operate in two stages:
• Stage 1: build the basic structure
• Stage 2: optimize parameter settings
• Testing data cannot be used for parameter
tuning Traini Valida
• Validation data is used to optimize parameters ng tion
data data

70-20-10% OR 80-10-10% OR 90-5-5%

Testing
data
6
Holdout estimation
• What should we do if we only have a single
dataset?
• The holdout method reserves a certain amount
for testing and uses the remainder for training,
after shuffling
• Usually: one third for testing, the rest for training
(90:10, 80:20 or 70:30)
• Problem: the samples might not be
representative
• Example: class might be missing in the test data
• Advanced version uses stratification
• Ensures that each class is represented with
approximately equal proportions in both subsets
7
Repeated holdout method
• Holdout estimate can be made more reliable
by repeating the process with different
subsamples
• In each iteration, a certain proportion is randomly
selected for training (possibly with stratificiation)
• The error rates on the different iterations are
averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets
overlap
• Can we prevent overlapping?

8
Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for
training
• This means the learning algorithm is applied to k different training sets
• Often the subsets are stratified before the cross-validation is
performed to yield stratified k-fold cross-validation
• The error estimates are averaged to yield an overall error
estimate; also, standard deviation is often computed
• Alternatively, predictions and actual target values from the k
folds are pooled to compute one estimate
• Does not yield an estimate of standard deviation

9
Leave-one-out cross-validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception:
using lazy classifiers such as the nearest-
neighbor classifier)

11
Leave-one-out CV and stratification

• Disadvantage of Leave-one-out CV:

stratification is not possible
• It guarantees a non-stratified sample because
there is only one instance in the test set!
• Extreme example: random dataset split
equally into two classes
• Best inducer predicts majority class
• 50% accuracy on fresh data
• Leave-one-out CV estimate gives 100% error!

12
Bootstrap
• CV uses sampling without replacement
• The same instance, once selected, can not be selected again for
a particular training/test set
• The bootstrap uses sampling with replacement to form
the training set
• Sample a dataset of n instances n times with replacement to
form a new dataset of n instances
• Use this data as the training set
• Use the instances from the original dataset that do not occur in
the new training set for testing

13
Hyperparameter selection
• Hyperparameter: parameter that can be tuned to optimize the performance
of a learning algorithm
– Different from basic parameter that is part of a model, such as a coefficient in a
linear regression model
– Example hyperparameter: k in the k-nearest neighbour classifier

• Parameter tuning needs to be viewed as part of the learning algorithm and

must be done using the training data only
• But how to get a useful estimate of performance for different parameter
values so that we can choose a value?
– Answer: split the data into a smaller “training” set and a validation set” (normally,
the data is shuffled first)
– Build models using different values of k on the new, smaller training set and
evaluate them on the validation set
– Pick the best value of k and rebuild the model on the full original training set

14
Hyperparameters and cross-
validation
• Note that k-fold cross-validation runs k different train-test evaluations
– The above parameter tuning process using validation sets must be applied
separately to each of the k training sets!

• This means that, when hyperparameter tuning is applied, k different

hyperparameter values may be selected
– This is OK: hyperparameter tuning is part of the learning process
– Cross-validation evaluates the quality of the learning process, not the quality of a
particular model

• What to do when the training sets are very small, so that performance
estimates on a validation set are unreliable?
• We can use nested cross-validation (expensive!)
– For each training set of the “outer” k-fold cross-validation, run “inner” p-fold cross-
validations to choose the best hyperparameter value
– Outer cross-validation is used to estimate quality of learning process
– Inner cross-validations are used to choose hyperparameter values
15
– Inner cross-validations are part of the learning process!
Comparing machine learning schemes

• Frequent question: which of two learning schemes

performs better?
• domain dependent
• Obvious way: compare 10-fold cross-validation estimates
• However, what about machine learning research?
• Need to show convincingly that a particular method
works better in a particular domain from which data is
taken

16
Comparing learning schemes

• Want to show that scheme A is better than scheme B in a

particular domain
– For a given amount of training data (i.e., data size)
– On average, across all possible training sets from that domain

• Assume we have an infinite amount of data from the domain

– sample infinitely many dataset of a specified size
– obtain a cross-validation estimate on each dataset for each scheme
– check if the mean accuracy for scheme A is better than the mean
accuracy for scheme B

17
Paired t-test
• In practice, limited data and a limited number of
estimates for computing the mean
• Student’s t-test tells us whether the means of two
samples are significantly different
• In our case the samples are cross-validation estimates,
one for each dataset we have sampled
• We can use a paired t-test because the individual
samples are paired
• The same cross-validation is applied twice, ensuring that all the
training and test sets are exactly the same

18
Performing the test

• Fix a significance level

• If a difference is significant at the a% level,
there is a (100-a)% chance that the true means differ
• Divide the significance level by two because the
test is two-tailed
• I.e., the true difference can be +ve or – ve
• Look up the value for z that corresponds to a/2
• Compute the value of t based on the observed
performance estimates for the schemes being
compared

19
Unpaired observations
• If the CV estimates are from different datasets,
they are no longer paired
(or maybe we have k estimates for one scheme,
and j estimates for the other one)
• Then we have to use an unpaired t-test with
min(k , j) – 1 degrees of freedom
• The statistic for the t-test becomes:

20
Counting the cost
• The confusion matrix:

Predicted class
Yes No
Actual class Yes True positive False negative
No False(TP)
positive True (FN)
negative
(FP) (TN)
• Different misclassification costs can be
assigned to false positives and false
negatives

21
Precision and Recall

relevant irrelevant
Entire document retrieved & Not retrieved &
collection Relevant Retrieved
documents documents irrelevant irrelevant

retrieved & not retrieved but

relevant relevant

retrieved not retrieved

Number of relevant documents retrieved

recall 
Total number of relevant documents

Number of relevant documents retrieved

precision 
Total number of documents retrieved
Determining Recall is Difficult

• Precision vs. Recall:

– Precision = The ability to retrieve top-ranked documents that are mostly
relevant.
– Recall = The ability of the search to find all of the relevant items in the
corpus.

• Total number of relevant items is sometimes not available:

– Sample across the database and perform relevance judgment on these
items.
– Apply different retrieval algorithms to the same database for the same query.
The aggregate of relevant items is taken as the total relevant set.
Computing Recall/Precision
Points: An Example
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 589 x
3 576
R=1/6=0.167; P=1/1=1
4 590 x
5 986
R=2/6=0.333; P=2/2=1
6 592 x
7 984 R=3/6=0.5; P=3/4=0.75
8 988
9 578 R=4/6=0.667; P=4/6=0.667
10 985
11 103 Missing one
12 591 relevant document.
Never reach
13 772 x R=5/6=0.833; p=5/13=0.38 100% recall
14 990
Assuming cross validation of 10 folds has been used, calculate precision and recall
for the data depicted on the confusion matrix provided below:

Actual Class
Accept Reject
TRY !

Predicted Accept 422 78

Class
Reject 104 164

Precision=TP/(TP+FP)
Recall = TP/(TP+FN)
Aside: the kappa statistic
• Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)

• Number of successes: sum of entries in diagonal (D)

• Kappa statistic: (success rate of actual predictor - success rate of
random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means
perfect accuracy, 0 means we are doing no better than random
26
ROC curves

• ROC curves
• “receiver operating characteristic”
• Used in signal detection to show
tradeoff between hit rate and false
alarm rate over noisy channel

27
Summary of some measures

Domain Plot Explanation

Lift chart Marketing TP TP

Subset (TP+FP)/
size (TP+FP+TN+FN)
ROC Communicatio TP rate TP/(TP+FN)
curve ns FP rate FP/(FP+TN)
Recall- Information Recall TP/(TP+FN)
precision retrieval Precisio TP/(TP+FP)
curve n

28
Evaluating numeric prediction

• Same strategies: independent test set, cross-

validation, significance tests, etc.
• Difference: error measures
• Actual target values: a1 a2 …an
• Predicted target values: p1 p2 … pn
• Most popular measure: mean-squared error

• Easy to manipulate mathematically

29
Other measures
• The root mean-squared error (RMSE) :

• The mean absolute error (MAE) is less sensitive

to outliers than the mean-squared error:

• Sometimes relative error values are more

appropriate (e.g. 10% for an error of 50 when
predicting 500)
30
Improvement on the mean
• How much does the scheme improve on
simply predicting the average?
• The relative squared error is:

• The root relative squared error and the

relative absolute error are:

31
Correlation coefficient

• Measures the statistical correlation between the

predicted values and the actual values

• Scale independent, between –1 and +1

Good performance leads to large values!
32
Which measure?
A B C D
Root mean-squared error 67.8 91.7 63.3 57.4
Mean absolute error 41.3 38.5 33.4 29.2
Root rel squared error 42.2% 57.2% 39.4% 35.8%
Relative absolute error 43.1% 40.1% 34.8% 30.4%
Correlation coefficient 0.88 0.88 0.89 0.91

• D best
• C second-best
• A, B arguable

Notes - Unit 3 - Machine Learning Lnctu-Bca (Aida) - IV Sem
No ratings yet
Notes - Unit 3 - Machine Learning Lnctu-Bca (Aida) - IV Sem
19 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
Unit 4
No ratings yet
Unit 4
34 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
Chapter 7 Learning
No ratings yet
Chapter 7 Learning
34 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Chapter 5
No ratings yet
Chapter 5
3 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Unit IV
No ratings yet
Unit IV
51 pages
5 - Model For Predictions - ML
No ratings yet
5 - Model For Predictions - ML
52 pages
ML Chap 5
No ratings yet
ML Chap 5
14 pages
Question1 Answers Complete
No ratings yet
Question1 Answers Complete
4 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Practical Issues
No ratings yet
Practical Issues
30 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
ML 3170724 Unit-3
No ratings yet
ML 3170724 Unit-3
48 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
CHP 3
No ratings yet
CHP 3
70 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
AI351 Lecture 2 - Common Evaluation Metrics
No ratings yet
AI351 Lecture 2 - Common Evaluation Metrics
50 pages
Chapter 19
No ratings yet
Chapter 19
30 pages
Embed Lec Midterm Reviewer
No ratings yet
Embed Lec Midterm Reviewer
14 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
No ratings yet
Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
30 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
Lagrange Multipliers.
No ratings yet
Lagrange Multipliers.
13 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
ML Fundamentals
No ratings yet
ML Fundamentals
15 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Lecture-4 Model Evaluation
No ratings yet
Lecture-4 Model Evaluation
28 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Unit 2
No ratings yet
Unit 2
28 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
ML 5
No ratings yet
ML 5
14 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
12 - Single Variable Optimization and Methods For Optimization
No ratings yet
12 - Single Variable Optimization and Methods For Optimization
36 pages
Circle Mid Point
0% (1)
Circle Mid Point
20 pages
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
No ratings yet
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
53 pages
Information Security Chapter Lecture 1 & 2
No ratings yet
Information Security Chapter Lecture 1 & 2
39 pages
Chapter Lecture 1 & 2
No ratings yet
Chapter Lecture 1 & 2
39 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
Chapter 6 - DeS
No ratings yet
Chapter 6 - DeS
25 pages
12 Bias-Variance - Underfit - Overfit
No ratings yet
12 Bias-Variance - Underfit - Overfit
4 pages
Lab File: Department of Computer Science & Engineering
No ratings yet
Lab File: Department of Computer Science & Engineering
24 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
17 pages
Predictive Modelling
No ratings yet
Predictive Modelling
9 pages
Midterm Solution
No ratings yet
Midterm Solution
4 pages
Design and Implementation of Digital Fir Filter & Its Performance Assessment Using Matlab Simulink
No ratings yet
Design and Implementation of Digital Fir Filter & Its Performance Assessment Using Matlab Simulink
6 pages
DES
No ratings yet
DES
3 pages
DSA I Detailed Notes English Corrected
No ratings yet
DSA I Detailed Notes English Corrected
3 pages
8.2.1: Introduction To Neural Networks: Objectives
No ratings yet
8.2.1: Introduction To Neural Networks: Objectives
11 pages
Numerical Methods Cheat Sheet
No ratings yet
Numerical Methods Cheat Sheet
3 pages
Assign Poly-Class - X
No ratings yet
Assign Poly-Class - X
11 pages
Dip Practical File
No ratings yet
Dip Practical File
16 pages
CS603 Dsa
No ratings yet
CS603 Dsa
5 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
Least Square Method
No ratings yet
Least Square Method
5 pages
Expanded Intelligent Control Techniques Answers
No ratings yet
Expanded Intelligent Control Techniques Answers
2 pages
Longest Path Matrix Algorithm
No ratings yet
Longest Path Matrix Algorithm
12 pages
Proj 2
No ratings yet
Proj 2
9 pages
9.2 Secant Method, False Position Method, and Ridders' Method
No ratings yet
9.2 Secant Method, False Position Method, and Ridders' Method
6 pages
10th Question Paper Maths
No ratings yet
10th Question Paper Maths
2 pages
Mlpack
No ratings yet
Mlpack
3 pages
Unit Nine Lesson Four Intersection of Three Planes
No ratings yet
Unit Nine Lesson Four Intersection of Three Planes
5 pages
Ece206 Signals-And-systems TH 1.20 Ac29
No ratings yet
Ece206 Signals-And-systems TH 1.20 Ac29
2 pages
Name - Aryan Gupta Reg. No. 199301088 Section - B: Ans.1) A. Rabin Karp String Matching Algorithm Code
No ratings yet
Name - Aryan Gupta Reg. No. 199301088 Section - B: Ans.1) A. Rabin Karp String Matching Algorithm Code
7 pages
Crypto CSE337 Endterm Nov22 2021
No ratings yet
Crypto CSE337 Endterm Nov22 2021
2 pages
4 Linear & Quadratic Equations
No ratings yet
4 Linear & Quadratic Equations
2 pages
Che188-1 Q#2
No ratings yet
Che188-1 Q#2
3 pages
Anaconda With Python 2 On 32-Bit Windows - Anaconda 2.0 Documentation
No ratings yet
Anaconda With Python 2 On 32-Bit Windows - Anaconda 2.0 Documentation
2 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
X X X X X: I Will Be Able To Add, Subtract, Multiply, and Divide Polynomials. Name Per
No ratings yet
X X X X X: I Will Be Able To Add, Subtract, Multiply, and Divide Polynomials. Name Per
2 pages
Newton Polynomials
No ratings yet
Newton Polynomials
3 pages
ISTQB Certified Tester Advanced Level Test Manager (CTAL-TM): Practice Questions Syllabus 2012
From Everand
ISTQB Certified Tester Advanced Level Test Manager (CTAL-TM): Practice Questions Syllabus 2012
Gabriel Awoyemi
No ratings yet
Ways to Achieve Quality
From Everand
Ways to Achieve Quality
chakrapani srinivasa
5/5 (1)

Mining Process

Uploaded by

Mining Process

Uploaded by

http://www.cs.waikato.ac.nz/ml/weka/book.

2 Training & Testing

3 Parameter Tuning, Holdout, Cross validation,

• Natural performance measure for classification

• Test and training data may differ in

• It is important that the test data is not used in

70-20-10% OR 80-10-10% OR 90-5-5%

• Disadvantage of Leave-one-out CV:

• Parameter tuning needs to be viewed as part of the learning algorithm and

• This means that, when hyperparameter tuning is applied, k different

• Frequent question: which of two learning schemes

• Want to show that scheme A is better than scheme B in a

• Assume we have an infinite amount of data from the domain

• Fix a significance level

retrieved & not retrieved but

retrieved not retrieved

Number of relevant documents retrieved

Number of relevant documents retrieved

• Precision vs. Recall:

• Total number of relevant items is sometimes not available:

Predicted Accept 422 78

• Number of successes: sum of entries in diagonal (D)

Domain Plot Explanation

Lift chart Marketing TP TP

• Same strategies: independent test set, cross-

• Easy to manipulate mathematically

• The mean absolute error (MAE) is less sensitive

• Sometimes relative error values are more

• The root relative squared error and the

• Measures the statistical correlation between the

• Scale independent, between –1 and +1

You might also like