0% found this document useful (0 votes)
16 views49 pages

L03 Generalization, Train Test Splits and Validation

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views49 pages

L03 Generalization, Train Test Splits and Validation

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Model Generalization

Choosing Between Different Complexities


Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples

Y Y Y

X X X
How Well Does the Model Generalize?
Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples

Y Y Y

X X X

Poor at Training Good at Training


Just Right
Poor at Predicting Poor at Predicting
Underfitting vs Overfitting
Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples

Y Y Y

X X X
Underfitting Just Right Overfitting
Bias – Variance Tradeoff
Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples

Y Y Y

X X X

High Bias Low Bias


Just Right
Low Variance High Variance
Training and Test Splits
Training and Test Splits

Trainin
g
Data

Test
Data
Using Training and Test Data

Trainin fit the model


g
Data

measure performance
Test - predict label with model
Data - compare with actual
value
- measure error
Using Training and Test Data
Training Data Test Data
x108 x108

4. 4.

0 0
3. 3.

0 0
2. 2.

0 0
1. 1.

0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Using Training and Test Data
Training Data Test Data
x108 x108

4. 4.

0 0
3. 3.

0 0
2. 2.

0 0
1. 1.

0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Fit the model
Using Training and Test Data
Training Data Test Data
x108 x108

4. 4.

0 0
3. 3.

0 0
2. 2.

0 0
1. 1.

0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Make predictions
Using Training and Test Data
Training Data Test Data
x108 x108

4. 4.

0 0
3. 3.

0 0
2. 2.

0 0
1. 1.

0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 Measure
0 error
0
Fitting Training and Test Data
Trainin
g X_train
model.fit( X_train, Y_train ) model
Data Y_train

X_test
Test model
.predict( X_test ) Y_predict
Data

error_metric( Y_test, test error


Y_test Y_predict)
13
Train and Test Splitting: The Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Train and Test Splitting: The Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Train and Test Splitting: The Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split

Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)

Other method for splitting data:


from sklearn.model_selection import ShuffleSplit
Using Training and Test Data

Trainin
g fit the model
Data

measure performance
Test - predict label with model
Data - compare with actual
value
- measure error

17
Training and Test Splits
Beyond a Single Test Set: Cross Validation

Trainin
g
Data

Validati
on
Data
Beyond a Single Test Set: Cross Validation
Training Data Test Data
x108 x108

4. 4.

0 0
3. 3.

0 0
2. 2.

0 0
1. 1.

0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Best model for this test
set
Beyond a Single Test Set: Cross Validation

Trainin
g
Data 1

Validati
on
Data 1
Beyond a Single Test Set: Cross Validation

Training
Data 2

Validati
on
Data 2
Beyond a Single Test Set: Cross Validation

Validati
on
Data 3

Trainin
g
Data 3
Beyond a Single Test Set: Cross Validation

Validati
on
Data 4

Training
Data 4
Beyond a Single Test Set: Cross Validation
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split

Average cross validation results.


Beyond a Single Test Set: Cross Validation
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split

Average cross validation results.


Model Complexity vs Error

cross validation
error
error

training error
Model Complexity vs Error

cross validation
error
error

training error
Model Complexity vs Error

cross validation
error
error

training error
Model Complexity vs Error
Polynomial Degree
=1
Model
True Function
cross validation Samples
error
error

training error

Underfitting: training and cross validation error are high


Model Complexity vs Error
Polynomial Degree =
15
Model
True Function
cross validation Samples
error
error

training error

X
model complexity

Overfitting: training error is low, cross validation is high


Model Complexity vs Error
Polynomial Degree
=4
Model
True Function
cross validation Samples
error
error

training error

Just right: training and cross validation errors are low


Cross Validation: The Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Cross Validation: The Syntax
Import the train and test split function
from sklearn.model_selection import cross_val_score

Perform cross-validation with a given model


cross_val = cross_val_score(KNN, X_data, y_data, cv=4,
scoring='neg_mean_squared_error')

Other methods for cross validation:


from sklearn.model_selection import KFold, StratifiedKFold
Modelling Best Practice

• Use cost function to fit model


• Develop multiple models
• Compare results and choose best
one
Other Model Metrics

𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1

𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1

𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Other Measures of Error

𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1

𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1

𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Other Measures of Error

𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1

𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1

𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Advanced
Linear Regression
Scaling is a Type of Feature Transformation
6
60
24 0

4
40 22
Age 0
20 2
20
18 0

12345 1 2 3 4 5
Number of Surgeries Number of Surgeries
Transformation of Data Distributions

• Predictions from linear


regression models assume
residuals are normally
distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this issue
Transformation of Data Distributions

• Predictions from linear


regression models assume
residuals are normally
distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this issue
Transformation of Data Distributions

from numpy import log, log1p

from scipy.stats import boxcox


Transformation of Data Distributions

• Predictions from linear


regression models assume
residuals are normally
distributed
• Features and predicted data
are often skewed
• Data transformations can
solve this issue
Feature Types

Feature Type Transformation


• Continuous: numerical • Standard Scaling, Min-Max
values Scaling

• Nominal: categorical, • One-hot encoding (0, 1)


unordered features (True or
False)
• Ordinal encoding (0, 1, 2, 3)
• Ordinal: categorical,
ordered features (movie
ratings)
Feature Types

Feature Type Transformation


• Continuous: numerical • Standard Scaling, Min-Max
values Scaling

• Nominal: categorical, • One-hot encoding (0, 1)


unordered features (True or
False)
• Ordinal encoding (0, 1, 2, 3)
• Ordinal: categorical,
ordered features (movie
ratings)
Feature Types

Feature Type Transformation


• Continuous: numerical • Standard Scaling, Min-Max
values Scaling

• Nominal: categorical, • One-hot encoding (0, 1)


unordered features (True or
False)
• Ordinal encoding (0, 1, 2, 3)
• Ordinal: categorical,
ordered features (movie
from sklearn.preprocessing
ratings) import LabelEncoder, LabelBinarizer, OneHotEncoder
Feature Types

Feature Type Transformation


• Continuous: numerical • Standard Scaling, Min-Max
values Scaling

• Nominal: categorical, • One-hot encoding (0, 1)


unordered features (True or
False)
• Ordinal encoding (0, 1, 2, 3)
• Ordinal: categorical,
ordered features
from (movie
sklearn.feature_extraction import DictVectorizer
ratings)
from pandas import get_dummies

You might also like