Cross-Validation in R programming
Last Updated :
15 Sep, 2021
The major challenge in designing a machine learning model is to make it work accurately on the unseen data. To know whether the designed model is working fine or not, we have to test it against those data points which were not present during the training of the model. These data points will serve the purpose of unseen data for the model, and it becomes easy to evaluate the model's accuracy. One of the finest techniques to check the effectiveness of a machine learning model is Cross-validation techniques which can be easily implemented by using the R programming language. In this, a portion of the data set is reserved which will not be used in training the model. Once the model is ready, that reserved data set is used for testing purposes. Values of the dependent variable are predicted during the testing phase and the model accuracy is calculated on the basis of prediction error i.e., the difference in actual values and predicted values of the dependent variable. There are several statistical metrics that are used for evaluating the accuracy of regression models:
- Root Mean Squared Error (RMSE): As the name suggests it is the square root of the averaged squared difference between the actual value and the predicted value of the target variable. It gives the average prediction error made by the model, thus decrease the RMSE value to increase the accuracy of the model.
- Mean Absolute Error (MAE): This metric gives the absolute difference between the actual values and the values predicted by the model for the target variable. If the value of the outliers does not have much to do with the accuracy of the model, then MAE can be used to evaluate the performance of the model. Its value must be less in order to make better models.
- R2 Error: The value of the R-squared metric gives an idea about how much percentage of variance in the dependent variable is explained collectively by the independent variables. In other words, it reflects the relationship strength between the target variable and the model on a scale of 0 - 100%. So, a better model should have a high value of R-squared.
Types of Cross-Validation
During the process of partitioning the complete dataset into the training set and the validation set, there are chances of losing some important and crucial data points for the training purpose. Since those data are not included in the training set, the model has not got the chance to detect some patterns. This situation can lead to overfitting or under fitting of the model. To avoid this, there are different types of cross-validation techniques that guarantees the random sampling of training and validation data set and maximizes the accuracy of the model. Some of the most popular cross-validation techniques are
- Validation Set Approach
- Leave one out cross-validation(LOOCV)
- K-fold cross-Validation
- Repeated K-fold cross-validation
Loading the Dataset
To implement linear regression, we are using a marketing dataset which is an inbuilt dataset in R programming language. Below is the code to import this dataset into your R programming environment.
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# installing package to
# import desired dataset
install.packages("datarium")
# loading the dataset
data("marketing", package = "datarium")
# inspecting the dataset
head(marketing)
Output:
youtube facebook newspaper sales
1 276.12 45.36 83.04 26.52
2 53.40 47.16 54.12 12.48
3 20.64 55.08 83.16 11.16
4 181.80 49.56 70.20 22.20
5 216.96 12.96 70.08 15.48
6 10.44 58.68 90.00 8.64
Validation Set Approach(or data split)
In this method, the dataset is divided randomly into training and testing sets. Following steps are performed to implement this technique:
- A random sampling of the dataset
- Model is trained on the training data set
- The resultant model is applied to the testing data set
- Calculate prediction error by using model performance metrics
Below is the implementation of this method:
R
# R program to implement
# validation set approach
# setting seed to generate a
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(marketing $ sales,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- marketing[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- marketing[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(sales ~., data = training_dataset)
# predicting the target variable
predictions <- predict(model, testing_dataset)
# computing model performance metrics
data.frame( R2 = R2(predictions, testing_dataset $ sales),
RMSE = RMSE(predictions, testing_dataset $ sales),
MAE = MAE(predictions, testing_dataset $ sales))
Output:
R2 RMSE MAE
1 0.9049049 1.965508 1.433609
Advantages:
- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.
Disadvantages:
- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.
Leave One Out Cross-Validation(LOOCV)
This method also splits the dataset into 2 parts but it overcomes the drawbacks of the Validation set approach. LOOCV carry out the cross-validation in the following way:
- Train the model on N-1 data points
- Testing the model against that one data points which was left in the previous step
- Calculate prediction error
- Repeat above 3 steps until the model is not trained and tested on all data points
- Generate overall prediction error by taking the average of prediction errors in every case
Below is the implementation of this method:
R
# R program to implement
# Leave one out cross validation
# defining training control
# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
Output:Â
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...
Resampling results:
RMSE Rsquared MAE
2.059984 0.8912074 1.539441
Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages:
- Less bias model as almost every data point is used for training.
- No randomness in the value of performance metrics because LOOCV runs multiple times on the dataset
Disadvantages:
- Training the model N times leads to expensive computation time if the dataset is large.
K-fold Cross-Validation
This cross-validation technique divides the data into K subsets(folds) of almost equal size. Out of these K folds, one subset is used as a validation set, and rest others are involved in training the model. Following are the complete working procedure of this method:
- Split the dataset into K subsets randomly
- Use K-1 subsets for training the model
- Test the model against that one subset that was left in the previous step
- Repeat the above steps for K times i.e., until the model is not trained and tested on all subsets
- Generate overall prediction error by taking the average of prediction errors in every case
Below is the implementation of this method:
R
# R program to implement
# K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
number = 10)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
Output:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:
RMSE Rsquared MAE
2.027409 0.9041909 1.539866
Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages:
- Fast computation speed.
- A very effective method to estimate the prediction error and the accuracy of a model.
Disadvantages:
- A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model(generally K = 5 and K = 10 is desirable).
Repeated K-fold cross-validation:
As the name suggests, in this method the K-fold cross-validation algorithm is repeated a certain number of times. Below is the implementation of this method:
R
# R program to implement
# repeated K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control as
# repeated cross-validation and
# value of K is 10 and repetition is 3 times
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
Output:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:
RMSE Rsquared MAE
2.020061 0.9038559 1.541517
Tuning parameter 'intercept' was held constant at a value of TRUE
Advantages:
- In each repetition, the data sample is shuffled which results in developing different splits of the sample data.
Disadvantages:
- With each repetition, the algorithm has to train the model from scratch which means the computation time to evaluate the model increases by the times of repetition.
Note: The most preferred cross-validation technique is repeated K-fold cross-validation for both regression and classification machine learning model.
Â
Similar Reads
Machine Learning with R
Machine Learning as the name suggests is the field of study that allows computers to learn and take decisions on their own i.e. without being explicitly programmed. These decisions are based on the available data that is available through experiences or instructions. It gives the computer that makes
2 min read
Getting Started With Machine Learning In R
Data Processing
Introduction to Data in Machine Learning
Data refers to the set of observations or measurements to train a machine learning models. The performance of such models is heavily influenced by both the quality and quantity of data available for training and testing. Machine learning algorithms cannot be trained without data. Cutting-edge develo
4 min read
ML | Understanding Data Processing
In machine learning, data is the most important aspect, but the raw data is messy, incomplete, or unstructured. So, we process the raw data to transform it into a clean, structured format for analysis, and this step in the data science pipeline is known as data processing. Without data processing, e
5 min read
ML | Overview of Data Cleaning
Data cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
ML | Feature Scaling - Part 1
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing. Working: Given a data-set with features- Age, Salary, BHK Apartment with the data size of 5000 people, each having these independent data featu
3 min read
Model Selection and Evaluation