The validation set approach is a cross-validation technique in Machine learning. Cross-validation techniques are often used to judge the performance and accuracy of a machine learning model. In the Validation Set approach, the dataset which will be used to build the model is divided randomly into 2 parts namely training set and validation set(or testing set). The model is trained on the training dataset and its accuracy is calculated by predicting the target variable for those data points which is not present during the training that is validation set. This whole process of splitting the data, training the model, testing the model is a complex task. But the R language consists of numerous libraries and inbuilt functions which can carry out all the tasks very easily and efficiently.
Steps Involved in the Validation Set Approach
- A random splitting of the dataset into a certain ratio(generally 70-30 or 80-20 ratio is preferred)
- Training of the model on the training data set
- The resultant model is applied to the validation set
- Model's accuracy is calculated through prediction error by using model performance metrics
This article discusses the step by step method of implementing the Validation set approach as a cross-validation technique for both classification and regression machine learning models.
For Classification Machine Learning Models
This type of machine learning model is used when the target variable is a categorical variable like positive, negative, or diabetic, non-diabetic, etc. The model predicts the class label of the dependent variable. Here, the Logistic regression algorithm will be applied to build the classification model.
Step 1: Loading the dataset and other required packages
Before doing any exploratory or manipulation task, one must include all the required libraries and packages to use various inbuilt functions and a dataset which will make it easier to carry out the whole process.
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# package Used to split the data
# used during classification into
# train and test subsets
library(caTools)
# loading package to
# import desired dataset
library(ISLR)
Step 2: Exploring the dataset
It is very necessary to understand the structure and dimension of the dataset as this will help in building a correct model. Also, as this is a classification model, one must know the different categories present in the target variable.
R
# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)
Output:
Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...
> table(dataset$Direction)
Down Up
602 648
According to the above information, the imported dataset has 250 rows and 9 columns. The data type of columns as <dbl> means the double-precision floating-point number (dbl came from double). The target variable must be of factor datatype in classification models. Since the data type of the Direction column is already <fct>, there is no need to change anything.
Moreover, the response variable or target variable is a binary categorical variable(as the values in the column are only Down and Up) and the proportion of both class labels is approximately 1:1 means they are balanced. If there will be a case of class imbalance as if the proportion of class labels would be 1:2, we have to make sure that both the categories are in approximately equal proportion. For this purpose, there are many techniques like:
- Down Sampling
- Up Sampling
- Hybrid Sampling using SMOTE and ROSE
Step 3: Building the model and generating the validation set
This step involves the random splitting of the dataset, developing training and validation set, and training of the model. Below is the implementation.
R
# setting seed to generate a
# reproducible random sampling
set.seed(100)
# dividing the complete dataset
# into 2 parts having ratio of
# 70% and 30%
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
# selecting that part of dataset
# which belongs to the 70% of the
# dataset divided in previous step
train = subset(dataset, spl == TRUE)
# selecting that part of dataset
# which belongs to the 30% of the
# dataset divided in previous step
test = subset(dataset, spl == FALSE)
# checking number of rows and column
# in training and testing dataset
print(dim(train))
print(dim(test))
# Building the model
# training the model by assigning Direction column
# as target variable and rest other columns
# as independent variables
model_glm = glm(Direction ~ . , family = "binomial",
data = train, maxit = 100)
Output:
> print(dim(train))
[1] 875 9
> print(dim(test))
[1] 375 9
Step 4: Predicting the target variable
As the training of the model is completed, it is time to make predictions on the unseen data. Here, the target variable has only 2 possible values so in the predict() function it is desirable to use type = response such that the model predicts the probability score of the target categorical variable as 0 or 1.
There is an optional step of transforming the response variable into the factor variable of 1's and 0's so that if the probability score of a data point is above a certain threshold, it will be treated as 1 and if below that threshold it will be treated as 0. Here, the probability cutoff is set as 0.5. Below is the code to implement these steps
R
# predictions on the validation set
predictTest = predict(model_glm, newdata = test,
type = "response")
# assigning the probability cutoff as 0.5
predicted_classes <- as.factor(ifelse(predictTest >= 0.5,
"Up", "Down"))
Step 5: Evaluating the accuracy of the model
The Best way to judge the accuracy of a classification machine learning model is through Confusion Matrix. This matrix gives us a numerical value which suggests how many data points are predicted correctly as well as incorrectly by taking reference with the actual values of the target variable in the testing dataset. Along with the confusion matrix, other statistical details of the model like accuracy and kappa can be calculated using the below code.
R
# generating confusion matrix and
# other details from the
# prediction made by the model
print(confusionMatrix(predicted_classes, test$Direction))
Output:
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 177 5
Up 4 189
Accuracy : 0.976
95% CI : (0.9549, 0.989)
No Information Rate : 0.5173
P-Value [Acc > NIR] : <2e-16
Kappa : 0.952
Mcnemar's Test P-Value : 1
Sensitivity : 0.9779
Specificity : 0.9742
Pos Pred Value : 0.9725
Neg Pred Value : 0.9793
Prevalence : 0.4827
Detection Rate : 0.4720
Detection Prevalence : 0.4853
Balanced Accuracy : 0.9761
'Positive' Class : Down
For Regression Machine Learning Models
Regression models are used to predict a quantity whose nature is continuous like the price of a house, sales of a product, etc. Generally in a regression problem, the target variable is a real number such as integer or floating-point values. The accuracy of this kind of model is calculated by taking the mean of errors in predicting the output of various data points. Below are the steps to implement the validation set approach in Linear Regression Models.
Step 1: Loading the dataset and required packages
R language contains a variety of datasets. Here we are using trees dataset which is an inbuilt dataset for the linear regression model. Below is the code to import the required dataset and packages to perform various operations to build the model.
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# access the data from R’s datasets package
data(trees)
# look at the first several rows of the data
head(trees)
Output:
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
So, in this dataset, there are a total of 3 columns among which Volume is the target variable. Since the variable is of continuous nature, a linear regression algorithm can be used to predict the outcome.
Step 2: Building the model and generating the validation set
In this step, the model is split randomly into a ratio of 80-20. 80% of the data points will be used to train the model while 20% acts as the validation set which will give us the accuracy of the model. Below is the code for the same.
R
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(trees $ Volume,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- trees[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- trees[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(Volume ~., data = training_dataset)
Step 3: Predict the target variable
After building and training the model, predictions of the target variable of the data points belong to the validation set will be done.
R
# predicting the target variable
predictions <- predict(model, testing_dataset)
Step 4: Evaluating the accuracy of the model
Statistical metrics that are used for evaluating the performance of a Linear regression model are Root Mean Square Error(RMSE), Mean Squared Error(MAE), and R2 Error. Among all R2 Error, metric makes the most accurate judgement and its value must be high for a better model. Below is the code to calculate the prediction error of the model.
R
# computing model performance metrics
data.frame(R2 = R2(predictions, testing_dataset $ Volume),
RMSE = RMSE(predictions, testing_dataset $ Volume),
MAE = MAE(predictions, testing_dataset $ Volume))
Output:
R2 RMSE MAE
1 0.9564487 5.274129 4.73567
Advantages of the Validation Set approach
- One of the most basic and simple techniques for evaluating a model.
- No complex steps for implementation.
Disadvantages of the Validation Set approach
- Predictions done by the model is highly dependent upon the subset of observations used for training and validation.
- Using only one subset of the data for training purposes can make the model biased.
Similar Reads
K-fold Cross Validation in R Programming
The prime aim of any machine learning model is to predict the outcome of real-time data. To check whether the developed model is efficient enough to predict the outcome of an unseen data point, performance evaluation of the applied machine learning model becomes very necessary. K-fold cross-validati
8 min read
Data Reshaping in R Programming
Generally, in R Programming Language, data processing is done by taking data as input from a data frame where the data is organized into rows and columns. Data frames are mostly used since extracting data is much simpler and hence easier. But sometimes we need to reshape the format of the data frame
5 min read
Subsetting in R Programming
In R Programming Language, subsetting allows the user to access elements from an object. It takes out a portion from the object based on the condition provided. There are 4 ways of subsetting in R programming. Each of the methods depends on the usability of the user and the type of object. For examp
11 min read
Basic Syntax in R Programming
R is the most popular language used for Statistical Computing and Data Analysis with the support of over 10, 000+ free packages in CRAN repository. Like any other programming language, R has a specific syntax which is important to understand if you want to make use of its powerful features. This art
3 min read
Control Statements in R Programming
Control statements are expressions used to control the execution and flow of the program based on the conditions provided in the statements. These structures are used to make a decision after assessing the variable. In this article, we'll discuss all the control statements with the examples. In R pr
4 min read
Data Structures in R Programming
A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. Râs base data structures are often organized by
6 min read
How to Code in R programming?
R is a powerful programming language and environment for statistical computing and graphics. Whether you're a data scientist, statistician, researcher, or enthusiast, learning R programming opens up a world of possibilities for data analysis, visualization, and modeling. This comprehensive guide aim
4 min read
How To Start Programming With R
R Programming Language is designed specifically for data analysis, visualization, and statistical modeling. Here, we'll walk through the basics of programming with R, from installation to writing our first lines of code, best practices, and much more. Table of Content 1. Installation2. Variables and
12 min read
How to Resolve General Errors in R Programming
Even though R programming Language is strong and flexible, errors can still happen.Resolving general errors can be a common challenge across various contexts, including software development, troubleshooting, and problem-solving in different domains. For any R user, regardless of expertise level, res
3 min read
Data Wrangling in R Programming - Working with Tibbles
R is a robust language used by Analysts, Data Scientists, and Business users to perform various tasks such as statistical analysis, visualizations, and developing statistical software in multiple fields.In R Programming Language Data Wrangling is a process of reimaging the raw data to a more structu
6 min read