0% found this document useful (0 votes)

0 views

Lecture 5b - Model Performance Analytics

The document discusses model evaluation in data science, focusing on the issue of over-fitting, where models perform well on training data but poorly on unseen data. It emphasizes the importance of generalization and introduces techniques like holdout validation and cross-validation to assess model performance. Additionally, it provides strategies to avoid over-fitting, such as regularization and feature selection.

Uploaded by

Le Harry

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

Lecture 5b - Model Performance Analytics

Uploaded by

Le Harry

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Science for Business

Lecture 5b – Model evaluation:

Model Performance Analytics

Assoc. Prof. Pham Quoc Trung

[email protected]
Over-fitting the data

• Finding chance occurrences in data that look like interesting patterns,

but which do not generalize, is called over-fitting the data
• We want models to apply not just to the exact training set but to the
general population from which the training data came
• Generalization
Over-fitting

• The tendency of DM procedures to tailor models to the

training data, at the expense of generalization to previously
unseen data points.
• All data mining procedures have the tendency to over-fit to some
extent
• Some more than others.
• “If you torture the data long enough, it will confess”
• There is no single choice or procedure that will eliminate over-fitting
• recognize over-fitting and manage complexity in a principled way.
Fitting Graph
Over-fitting in tree induction
Over-fitting in linear discriminants

• 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

• 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5

• 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥12

• 𝑓 𝑥 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + 𝑤6 𝑥12 + 𝑤7 ∗
𝑥2 /𝑥3
Example: Classifying Flowers
Example: Classifying Flowers
Example: Classifying Flowers
Example: Classifying Flowers
Need for holdout evaluation

Under-fitting Good Over-fitting

• In sample evaluation is in favor or “memorizing”
• On the training data the right model would be best
• But on new data it would be bad
Over-fitting
Under-fitting

Over-fitting
Good Fit

• Over-fitting: Model “memorizes” the properties of the particular training

set rather than learning the underlying concept or phenomenon
Holdout validation

• We are interested in generalization

• The performance on data not used for training
• Given only one data set, we hold out some data for evaluation
• Holdout set for final evaluation is called the test set
• Accuracy on training data is sometimes called “in-sample” accuracy,
vs. “out-of-sample” accuracy on test data
Cross-Validation
Cross-Validation
From Holdout Evaluation to Cross-Validation

• Not only a simple estimate of the generalization performance, but

also some statistics on the estimated performance,
• such as the mean and variance
• Better use of a limited dataset
• Cross-validation computes its estimates over all the data
Let’s focus back in on actually mining the data..

Which customers should TelCo

target with a special offer, prior to
contract expiration?
MegaTelCo
Generalization Performance

• Different modeling procedures may have different performance on

the same data
• Different training sets may result in different generalization
performance
• Different test sets may result in different estimates of the generation
performance
• If the training set size changes, you may also expect different
generalization performance from the resultant model
Learning Curves
Logistic Regression vs Tree Induction

• For smaller training-set sizes, logistic regression yields better

generalization accuracy than tree induction
• For smaller data, tree induction will tend to over-fit more
• Classification trees are a more flexible model representation than
linear logistic regression
• Flexibility of tree induction can be an advantage with larger training
sets:
• Trees can represent substantially nonlinear relationships between the features and the
target
Learning curves vs Fitting graphs

• A learning curve shows the generalization performance plotted

against the amount of training data used
• A fitting graph shows the generalization performance as well as the
performance on the training data, but plotted against model
complexity
• Fitting graphs generally are shown for a fixed amount of training data
Avoiding Over-fitting

• Tree Induction:
• Post-pruning
• takes a fully-grown decision tree and discards unreliable parts
• Pre-pruning
• stops growing a branch when information becomes unreliable

Linear Models:

• Feature Selection

• Regularization
• Optimize some combination of fit and simplicity
Regularization

• Regularized linear model:

• argmax[fit 𝒙, 𝒘 − 𝜆 ∗ penalty(𝒘)]
𝑾
• “L2-norm”
• The sum of the squares of the weights
• L2-norm + standard least-squares linear regression = ridge regression
• “L1-norm”
• The sum of the absolute values of the weights
• L1-norm + standard least-squares linear regression = lasso
• Automatic feature selection
Nested Cross-Validation
Some Possible Remedies*
• Use simpler (less flexible) models.
• Use fewer features in final model (feature selection).
• Enrich data with influential predictors.
• Enrich (training) data with more observations.
• Tip: Also check compatibility of training and validation sets.

*May or may not work – depending on the source of the problem.

Thanks!

Q&A

120 DS-With Answer
100% (1)
120 DS-With Answer
32 pages
SAT January 2013
No ratings yet
SAT January 2013
26 pages
PPT6-Buss Intel Analytics
No ratings yet
PPT6-Buss Intel Analytics
41 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
Guide
No ratings yet
Guide
24 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
2d Overfitting 18may
No ratings yet
2d Overfitting 18may
19 pages
Classification
No ratings yet
Classification
53 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Lecture5
No ratings yet
Lecture5
26 pages
ML 5
No ratings yet
ML 5
14 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Underfitting and Overfitting Slides and Transcript
No ratings yet
Underfitting and Overfitting Slides and Transcript
13 pages
BiasVariance
No ratings yet
BiasVariance
14 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
Supervised Learning
No ratings yet
Supervised Learning
187 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
Overfitting Regression
No ratings yet
Overfitting Regression
14 pages
1729585037_ML11_Generalization
No ratings yet
1729585037_ML11_Generalization
40 pages
25.-Overfitting-and-Underfitting
No ratings yet
25.-Overfitting-and-Underfitting
25 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
DM UNIT-3
No ratings yet
DM UNIT-3
23 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
RB's ML2 Notes
No ratings yet
RB's ML2 Notes
5 pages
Machine Leafning
No ratings yet
Machine Leafning
5 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
U&O Fitting
No ratings yet
U&O Fitting
6 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Lec-1 Bias-variance-Tradeoff
No ratings yet
Lec-1 Bias-variance-Tradeoff
24 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
Learning From Examples: Knowledge Extraction
No ratings yet
Learning From Examples: Knowledge Extraction
10 pages
00. April27 Revision LR DT Boosting Student Copy
No ratings yet
00. April27 Revision LR DT Boosting Student Copy
33 pages
07 - Evaluating Performance
No ratings yet
07 - Evaluating Performance
46 pages
Decision Tree and Evalaution
No ratings yet
Decision Tree and Evalaution
50 pages
Statistical Learning Slides
No ratings yet
Statistical Learning Slides
60 pages
Classification Error: Training Errors Generalization Errors
No ratings yet
Classification Error: Training Errors Generalization Errors
39 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
FAI Lecture - 4-10-2023 PDF
No ratings yet
FAI Lecture - 4-10-2023 PDF
27 pages
PYQ_ML
No ratings yet
PYQ_ML
8 pages
ML Unit 2
No ratings yet
ML Unit 2
86 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
Lecture 3
No ratings yet
Lecture 3
18 pages
07 07 Overfitting 11-04
No ratings yet
07 07 Overfitting 11-04
7 pages
Sample Q - A For Module 3 - 4
No ratings yet
Sample Q - A For Module 3 - 4
18 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Measurement - Drill Sheets Gr. 3-5
From Everand
Measurement - Drill Sheets Gr. 3-5
Chris Forest
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
agrawal2019
No ratings yet
agrawal2019
11 pages
VSM
No ratings yet
VSM
1 page
VSM_AN HOA
No ratings yet
VSM_AN HOA
1 page
5 - Bond and Stock Valuation (Compatibility Mode)
No ratings yet
5 - Bond and Stock Valuation (Compatibility Mode)
58 pages
International Business Management 2021-22 S2 - Individual Assignment 4
No ratings yet
International Business Management 2021-22 S2 - Individual Assignment 4
2 pages
International Business Management 2021-22 S2 - Individual Assignment 3
No ratings yet
International Business Management 2021-22 S2 - Individual Assignment 3
2 pages
Booklet Class 12
No ratings yet
Booklet Class 12
12 pages
IX Amity Question Paper
No ratings yet
IX Amity Question Paper
8 pages
Short-Term Loans and Long-Term Investments Impact
No ratings yet
Short-Term Loans and Long-Term Investments Impact
21 pages
Sample IMEC Category 1 (KG - G1)
100% (1)
Sample IMEC Category 1 (KG - G1)
5 pages
Introduction To Sacred Geometry
No ratings yet
Introduction To Sacred Geometry
6 pages
2023 Ms p2 Pure Mathematics Zimsec Tuks and MR Share
No ratings yet
2023 Ms p2 Pure Mathematics Zimsec Tuks and MR Share
20 pages
BRM Theoritical Framework Mod
No ratings yet
BRM Theoritical Framework Mod
32 pages
Lecture 6 - Convolution Neural Network (CNN)
No ratings yet
Lecture 6 - Convolution Neural Network (CNN)
26 pages
Research Paper Group 5
No ratings yet
Research Paper Group 5
35 pages
MATH 1701_AS 2 v1u
No ratings yet
MATH 1701_AS 2 v1u
14 pages
Reservoir Engineering I: Semester 7 (2023-2024)
No ratings yet
Reservoir Engineering I: Semester 7 (2023-2024)
44 pages
Subject:Machine Learning Unit-5 Analytical Learning Topic:Remarks On Explanation Based Learning
100% (1)
Subject:Machine Learning Unit-5 Analytical Learning Topic:Remarks On Explanation Based Learning
21 pages
Engineering Mechanics Dec 2013
No ratings yet
Engineering Mechanics Dec 2013
11 pages
Unit 2 Univariate Data Unit Plan
No ratings yet
Unit 2 Univariate Data Unit Plan
5 pages
GEd 102 - Lesson 2 Notes
No ratings yet
GEd 102 - Lesson 2 Notes
40 pages
Unit-III (Functional Dependencies and Normalization, Relational Data Model and Relational Algebra) Important Questions Section A: (2 Marks)
No ratings yet
Unit-III (Functional Dependencies and Normalization, Relational Data Model and Relational Algebra) Important Questions Section A: (2 Marks)
12 pages
Disolucion de Malaquita Con Acido Sulfurico
No ratings yet
Disolucion de Malaquita Con Acido Sulfurico
16 pages
241120 - Past Final Exams 2014-23
No ratings yet
241120 - Past Final Exams 2014-23
10 pages
Water Jet Drilling
No ratings yet
Water Jet Drilling
13 pages
Solution 2
No ratings yet
Solution 2
156 pages
20868860
No ratings yet
20868860
81 pages
LRFD Bridge Design Manual-MDOT PDF
No ratings yet
LRFD Bridge Design Manual-MDOT PDF
747 pages
Lecture-1 (Vectors) PDF
No ratings yet
Lecture-1 (Vectors) PDF
28 pages
Measure of Central Tendency and Variability
No ratings yet
Measure of Central Tendency and Variability
73 pages
Issn 2488-8648: 10.5555/edpv8104
No ratings yet
Issn 2488-8648: 10.5555/edpv8104
9 pages
Quadratic Equation
No ratings yet
Quadratic Equation
4 pages
Math 6 Syllabus
No ratings yet
Math 6 Syllabus
2 pages
BSC Comp SC - CBCSS Syllabus PDF
No ratings yet
BSC Comp SC - CBCSS Syllabus PDF
42 pages
Class Viii Annual Syllabus 2024-2025
No ratings yet
Class Viii Annual Syllabus 2024-2025
14 pages

Lecture 5b - Model Performance Analytics

Uploaded by

Lecture 5b - Model Performance Analytics

Uploaded by

Data Science for Business

Lecture 5b – Model evaluation:

Assoc. Prof. Pham Quoc Trung

• Finding chance occurrences in data that look like interesting patterns,

• The tendency of DM procedures to tailor models to the

Under-fitting Good Over-fitting

• Over-fitting: Model “memorizes” the properties of the particular training

• We are interested in generalization

• Not only a simple estimate of the generalization performance, but

Which customers should TelCo

• Different modeling procedures may have different performance on

• For smaller training-set sizes, logistic regression yields better

• A learning curve shows the generalization performance plotted

• Regularized linear model:

*May or may not work – depending on the source of the problem.

You might also like