Multiple Linear Regression
Model Validation
Day 7
Session 2
1
Cross Validation in Predictive Modeling
• Cross Validation is a process of evaluating the model on ‘Out of Sample’ data.
• Model performance measures such as R squared or Root Mean Squared
Error(RMSE) tend to be optimistic on 'In Sample Data'.
• More realistic measures of model performance are calculated using "Out of
Sample' data.
• cross-validation is a procedure for estimating the generalization performance
in this context.
2
Cross Validation in Predictive Modeling
Methods
• Hold-Out Validation
• K-Fold Cross-Validation
• Repeated K-Fold Cross-Validation
• Leave-One-Out Cross-Validation(LOOCV)
3
Introduction to Caret Package in R
• The caret package (short for Classification And REgression Training) is a set of
functions that attempt to streamline the process for creating predictive models.
• The package contains tools for:
data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation
4
Snapshot of the Data
Predicting Claim Amount
n=1000
VehicleAge CC Length Weight Claimamt
4 1495 4250 1023 72000
2 1061 3495 875 72000
2 1405 3675 980 50400
7 1298 4090 930 39960
2 1495 4250 1023 106800
1 1086 3565 854 69592
4 796 3495 740 38400
4 1061 3520 830 43182
2 796 3335 665 40346
1 1405 3675 980 76800
0 1086 3565 854 77822
2 1061 3520 825 72000
3 2499 4435 1585 88560
3 1405 3675 980 48000
8 1405 3675 980 25920
2 796 3335 665 43358
2 1086 3565 854 67200
1 1405 3675 980 78000
2 1396 3675 980 57216
5
Recap: Model Output
Parameter Estimates
Variance
Parameter Standard Inflation
Variable DF Estimate Error t Value Pr > |t| (VIF)
Intercept 1 -49195 5475.1511 -8.99 <.0001 0
VehicleAge 1 -6638.0765 155.5247 -42.68 <.0001 1.03836
CC 1 8.6886 1.4809 5.87 <.0001 2.83393
Length 1 32.0652 1.8522 17.31 <.0001 2.88972
R2 =73.19%.
#Variable weight was excluded
to correct multicollinearity problem.
-4 -3 -2 -1 0 1 2 3 4
6
Cross Validation in Predictive Modeling
Hold-Out Validation
• In Hold-Out validation method available data is split into two non-overlapped
parts: 'Training Data' and 'Testing Data'
• The model is developed using training data and evaluated using testing data.
• Training data should have more sample size. Typically 70%-80% data is used for
model development.
Training Data: data used to fit model
Test Data: “fresh” data used to
evaluate model
7
Hold-Out Validation in R
#import csv file 'Motor Insurance claim amount’
motor<-read.csv(file.choose(),header=T)
library(caret)
#Partition data into 2 parts. P=0.8 indicates partition is 80:20
index<-createDataPartition(motor$claimamt,p=0.8,list=FALSE)
head(index)
createDataPartition function generates list of observation
dim(index) numbers to be included in training data.
traindata<-motor[index,] Now we have training and testing data sets
testdata<-motor[-index,] ready.
dim(traindata)
dim(testdata)
8
Hold-Out Validation in R…..
motor_model<-lm(claimamt~Length+CC+vehage,data=traindata)
traindata$res<-residuals(motor_model)
head(traindata)
RMSEtrain<-sqrt(mean(traindata$res**2))
Overall Performance Measures
RMSEtrain RMSE: 11444.51
[1] 11512.18 Multiple R-squared: 0.7327
testdata$pred<-predict(motor_model,testdata)
testdata$res<-(testdata$claimamt-testdata$pred)
RMSEtest<-sqrt(mean(testdata$res**2))
RMSEtest
[1] 11181.45 RMSE values indicate stable model
9
rmse function in ModelMetrics
# Obtain predicted values of Y
traindata$fit<-fitted(motor_model)
install.packages("ModelMetrics")
library(ModelMetrics)
# rmse function requires observed and predicted values of Y
rmse(traindata$claimamt,traindata$fit)
10
Cross Validation in Predictive Modeling
K-Fold Cross-Validation
• In k-fold cross-validation the data is first partitioned into k equally (or nearly
equally) sized segments or folds.
• Here k iterations of training and testing are performed such that each time one
fold is kept aside for testing and model is developed using k-1 folds.
• Model performance measure is aggregate measure based on above iterations.
11
Cross Validation in Predictive Modeling
K-Fold Cross-Validation
12
K-Fold Cross-Validation in R
library(caret)
kfolds<-trainControl(method="cv",number=4)
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Linear Regression
1000 samples
3 predictor Overall Performance Measures
RMSE: 11444.51
No pre-processing Multiple R-squared: 0.7327
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 749, 751, 750, 750
Resampling results:
RMSE values indicate stable model
RMSE Rsquared
13
11544.92 0.7286847
Cross Validation in Predictive Modeling
Repeated K-Fold Cross-Validation
• k-fold cross-validation can be repeated 'm' times to arrive at more robust
measure of model performance.
• Repeated k-fold CV does the same as above but more than once. For example,
five repeats of 10-fold CV would give 50 total resamples that are averaged. Note
this is not the same as 50-fold CV.
• The process requires computer with good computing power.
14
Repeated K-Fold Cross-Validation in R
library(caret)
kfolds<-trainControl(method="repeatedcv",number=4,repeats=5)
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Overall Performance Measures
Linear Regression RMSE: 11444.51
1000 samples Multiple R-squared: 0.7327
3 predictor
No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times)
Summary of sample sizes: 750, 750, 750, 750, 750, 748, ...
Resampling results: RMSE values indicate stable model
RMSE Rsquared
15
11498.15 0.7319296
Cross Validation in Predictive Modeling
Leave-One-Out Cross-Validation(LOOCV)
• Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation
where k equals the sample size (k=n)
• Let observation number i is kept aside. The model is developed using the remaining
data. Observation number i is predicted using the model and error is computed.
• The process is repeated for all i ( repeated n times).
• RMSE is computed based on these predicted residuals.
16
Leave-One-Out Cross-Validation(LOOCV) in R
library(caret)
kfolds<-trainControl(method="LOOCV")
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Overall Performance Measures
Linear Regression
RMSE: 11444.51
1000 samples
Multiple R-squared: 0.7327
3 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 999, 999, 999, 999, 999, 999, ...
Resampling results: RMSE values indicate stable model
RMSE Rsquared
17
11515.85 0.7294088
THANK YOU!!
18