L03 Generalization, Train Test Splits and Validation
L03 Generalization, Train Test Splits and Validation
Y Y Y
X X X
How Well Does the Model Generalize?
Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples
Y Y Y
X X X
Y Y Y
X X X
Underfitting Just Right Overfitting
Bias – Variance Tradeoff
Polynomial Degree Polynomial Degree Polynomial Degree =
=1 =4 15
Model
True Function
Samples
Y Y Y
X X X
Trainin
g
Data
Test
Data
Using Training and Test Data
measure performance
Test - predict label with model
Data - compare with actual
value
- measure error
Using Training and Test Data
Training Data Test Data
x108 x108
4. 4.
0 0
3. 3.
0 0
2. 2.
0 0
1. 1.
0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Using Training and Test Data
Training Data Test Data
x108 x108
4. 4.
0 0
3. 3.
0 0
2. 2.
0 0
1. 1.
0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Fit the model
Using Training and Test Data
Training Data Test Data
x108 x108
4. 4.
0 0
3. 3.
0 0
2. 2.
0 0
1. 1.
0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Make predictions
Using Training and Test Data
Training Data Test Data
x108 x108
4. 4.
0 0
3. 3.
0 0
2. 2.
0 0
1. 1.
0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 Measure
0 error
0
Fitting Training and Test Data
Trainin
g X_train
model.fit( X_train, Y_train ) model
Data Y_train
X_test
Test model
.predict( X_test ) Y_predict
Data
Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Train and Test Splitting: The Syntax
Import the train and test split function
from sklearn.model_selection import train_test_split
Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Trainin
g fit the model
Data
measure performance
Test - predict label with model
Data - compare with actual
value
- measure error
17
Training and Test Splits
Beyond a Single Test Set: Cross Validation
Trainin
g
Data
Validati
on
Data
Beyond a Single Test Set: Cross Validation
Training Data Test Data
x108 x108
4. 4.
0 0
3. 3.
0 0
2. 2.
0 0
1. 1.
0 0
0. 1. 2. 0. 1. 2.
x108 x108
0 0 0 0 0 0
Best model for this test
set
Beyond a Single Test Set: Cross Validation
Trainin
g
Data 1
Validati
on
Data 1
Beyond a Single Test Set: Cross Validation
Training
Data 2
Validati
on
Data 2
Beyond a Single Test Set: Cross Validation
Validati
on
Data 3
Trainin
g
Data 3
Beyond a Single Test Set: Cross Validation
Validati
on
Data 4
Training
Data 4
Beyond a Single Test Set: Cross Validation
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
+
Training Training Training
Test Split
Split Split Split
cross validation
error
error
training error
Model Complexity vs Error
cross validation
error
error
training error
Model Complexity vs Error
cross validation
error
error
training error
Model Complexity vs Error
Polynomial Degree
=1
Model
True Function
cross validation Samples
error
error
training error
training error
X
model complexity
training error
𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1
𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1
𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Other Measures of Error
𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1
𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1
𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Other Measures of Error
𝑚
Sum of Squared Error ∑ ( 𝑦 𝛽 (𝑥 (𝑖)
)− 𝑦 (𝑖) 2
𝑜𝑏𝑠 )
(SSE): 𝑖=1
𝑚
Total Sum of Squares ∑ ( 𝑦 𝑜𝑏𝑠 − 𝑦 ) (𝑖) 2
𝑜𝑏𝑠
(TSS): 𝑖=1
𝑆𝑆𝐸
Correlation Coefficient 1−
(R2):
𝑇𝑆𝑆
Advanced
Linear Regression
Scaling is a Type of Feature Transformation
6
60
24 0
4
40 22
Age 0
20 2
20
18 0
12345 1 2 3 4 5
Number of Surgeries Number of Surgeries
Transformation of Data Distributions