0% found this document useful (0 votes)
5 views18 pages

DAY 7 SESSION 2 Cross Validation

Uploaded by

codedrive51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

DAY 7 SESSION 2 Cross Validation

Uploaded by

codedrive51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Multiple Linear Regression

Model Validation

Day 7
Session 2

1
Cross Validation in Predictive Modeling

• Cross Validation is a process of evaluating the model on ‘Out of Sample’ data.

• Model performance measures such as R squared or Root Mean Squared


Error(RMSE) tend to be optimistic on 'In Sample Data'.

• More realistic measures of model performance are calculated using "Out of


Sample' data.

• cross-validation is a procedure for estimating the generalization performance


in this context.

2
Cross Validation in Predictive Modeling
Methods

• Hold-Out Validation

• K-Fold Cross-Validation

• Repeated K-Fold Cross-Validation

• Leave-One-Out Cross-Validation(LOOCV)

3
Introduction to Caret Package in R

• The caret package (short for Classification And REgression Training) is a set of
functions that attempt to streamline the process for creating predictive models.

• The package contains tools for:

 data splitting
 pre-processing
 feature selection
 model tuning using resampling
 variable importance estimation

4
Snapshot of the Data
Predicting Claim Amount
n=1000
VehicleAge CC Length Weight Claimamt
4 1495 4250 1023 72000
2 1061 3495 875 72000
2 1405 3675 980 50400
7 1298 4090 930 39960
2 1495 4250 1023 106800
1 1086 3565 854 69592
4 796 3495 740 38400
4 1061 3520 830 43182
2 796 3335 665 40346
1 1405 3675 980 76800
0 1086 3565 854 77822
2 1061 3520 825 72000
3 2499 4435 1585 88560
3 1405 3675 980 48000
8 1405 3675 980 25920
2 796 3335 665 43358
2 1086 3565 854 67200
1 1405 3675 980 78000
2 1396 3675 980 57216

5
Recap: Model Output
Parameter Estimates

Variance
Parameter Standard Inflation
Variable DF Estimate Error t Value Pr > |t| (VIF)
Intercept 1 -49195 5475.1511 -8.99 <.0001 0
VehicleAge 1 -6638.0765 155.5247 -42.68 <.0001 1.03836
CC 1 8.6886 1.4809 5.87 <.0001 2.83393
Length 1 32.0652 1.8522 17.31 <.0001 2.88972

R2 =73.19%.
#Variable weight was excluded
to correct multicollinearity problem.
-4 -3 -2 -1 0 1 2 3 4

6
Cross Validation in Predictive Modeling
Hold-Out Validation

• In Hold-Out validation method available data is split into two non-overlapped


parts: 'Training Data' and 'Testing Data'

• The model is developed using training data and evaluated using testing data.

• Training data should have more sample size. Typically 70%-80% data is used for
model development.

Training Data: data used to fit model


Test Data: “fresh” data used to
evaluate model

7
Hold-Out Validation in R

#import csv file 'Motor Insurance claim amount’


motor<-read.csv(file.choose(),header=T)
library(caret)

#Partition data into 2 parts. P=0.8 indicates partition is 80:20


index<-createDataPartition(motor$claimamt,p=0.8,list=FALSE)
head(index)
createDataPartition function generates list of observation
dim(index) numbers to be included in training data.

traindata<-motor[index,] Now we have training and testing data sets


testdata<-motor[-index,] ready.

dim(traindata)
dim(testdata)

8
Hold-Out Validation in R…..

motor_model<-lm(claimamt~Length+CC+vehage,data=traindata)
traindata$res<-residuals(motor_model)
head(traindata)
RMSEtrain<-sqrt(mean(traindata$res**2))
Overall Performance Measures
RMSEtrain RMSE: 11444.51
[1] 11512.18 Multiple R-squared: 0.7327

testdata$pred<-predict(motor_model,testdata)
testdata$res<-(testdata$claimamt-testdata$pred)
RMSEtest<-sqrt(mean(testdata$res**2))
RMSEtest
[1] 11181.45 RMSE values indicate stable model

9
rmse function in ModelMetrics

# Obtain predicted values of Y


traindata$fit<-fitted(motor_model)

install.packages("ModelMetrics")
library(ModelMetrics)

# rmse function requires observed and predicted values of Y

rmse(traindata$claimamt,traindata$fit)

10
Cross Validation in Predictive Modeling
K-Fold Cross-Validation

• In k-fold cross-validation the data is first partitioned into k equally (or nearly
equally) sized segments or folds.

• Here k iterations of training and testing are performed such that each time one
fold is kept aside for testing and model is developed using k-1 folds.

• Model performance measure is aggregate measure based on above iterations.

11
Cross Validation in Predictive Modeling
K-Fold Cross-Validation

12
K-Fold Cross-Validation in R

library(caret)
kfolds<-trainControl(method="cv",number=4)
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Linear Regression
1000 samples
3 predictor Overall Performance Measures
RMSE: 11444.51
No pre-processing Multiple R-squared: 0.7327
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 749, 751, 750, 750
Resampling results:
RMSE values indicate stable model
RMSE Rsquared
13
11544.92 0.7286847
Cross Validation in Predictive Modeling
Repeated K-Fold Cross-Validation

• k-fold cross-validation can be repeated 'm' times to arrive at more robust


measure of model performance.

• Repeated k-fold CV does the same as above but more than once. For example,
five repeats of 10-fold CV would give 50 total resamples that are averaged. Note
this is not the same as 50-fold CV.

• The process requires computer with good computing power.

14
Repeated K-Fold Cross-Validation in R

library(caret)
kfolds<-trainControl(method="repeatedcv",number=4,repeats=5)
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Overall Performance Measures
Linear Regression RMSE: 11444.51
1000 samples Multiple R-squared: 0.7327
3 predictor

No pre-processing
Resampling: Cross-Validated (4 fold, repeated 5 times)
Summary of sample sizes: 750, 750, 750, 750, 750, 748, ...
Resampling results: RMSE values indicate stable model

RMSE Rsquared
15
11498.15 0.7319296
Cross Validation in Predictive Modeling
Leave-One-Out Cross-Validation(LOOCV)

• Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation


where k equals the sample size (k=n)

• Let observation number i is kept aside. The model is developed using the remaining
data. Observation number i is predicted using the model and error is computed.

• The process is repeated for all i ( repeated n times).

• RMSE is computed based on these predicted residuals.

16
Leave-One-Out Cross-Validation(LOOCV) in R

library(caret)
kfolds<-trainControl(method="LOOCV")
model<-
train(claimamt~vehage+CC+Length,data=motor,method="lm",
trControl=kfolds)
model
__________________________________________________________________
Overall Performance Measures
Linear Regression
RMSE: 11444.51
1000 samples
Multiple R-squared: 0.7327
3 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 999, 999, 999, 999, 999, 999, ...
Resampling results: RMSE values indicate stable model

RMSE Rsquared
17
11515.85 0.7294088
THANK YOU!!

18

You might also like