The Validation Set Approach in R Programming
Last Updated :
08 Jul, 2025
The Validation Set Approach is a basic cross-validation method in which the dataset is divided into two separate parts. One part is used for training the model while the other is used for validating its performance. This method helps in checking how well the model performs on new data and reduces the chances of overfitting. R offers built-in libraries that make the entire process efficient and easy to implement.
Steps Involved in the Validation Set Approach
- Randomly splitting the dataset into training and validation sets, usually in a 70:30 or 80:20 ratio.
- Training the machine learning model using the training dataset.
- Applying the trained model to the validation dataset to make predictions.
- Calculating model performance using evaluation metrics like confusion matrix, accuracy (for classification), or RMSE, R² (for regression).
Implementation of the Validation Set Approach for Classification in R
We implement the Validation Set Approach using a logistic regression model for a classification problem where the target variable is categorical. The dataset used is Smarket from the ISLR package, and the goal is to predict the Direction of the market (Up or Down).
1. Installing the required packages
We install the necessary libraries that support data manipulation, model building, and evaluation.
- tidyverse: used for data cleaning and visualization.
- caret: used for computing model performance metrics.
- caTools: used for creating train and test splits.
- ISLR: provides the
Smarket
dataset for classification.
R
install.packages("tidyverse")
install.packages("caret")
install.packages("caTools")
install.packages("ISLR")
2. Loading the required packages
We load the libraries after installation to access their functions and datasets.
R
library(tidyverse)
library(caret)
library(caTools)
library(ISLR)
3. Exploring and preparing the dataset
We assign the dataset to a variable, inspect its structure, and analyze the class distribution of the target variable.
- Smarket: contains market data with lag values and direction movement.
- complete.cases(): filters out rows with missing values.
- glimpse(): displays structure and column types of the dataset.
- table(): checks frequency distribution of the
Direction
variable.
R
dataset <- Smarket[complete.cases(Smarket), ]
glimpse(dataset)
table(dataset$Direction)
Output:
Output4. Splitting the dataset and training the model
We divide the dataset into training and validation sets and build a logistic regression model.
- set.seed(): ensures reproducibility of random splitting.
- sample.split(): creates a logical vector for train-test division.
- subset(): subsets the dataset based on the logical split.
- glm(): fits a logistic regression model with
Direction
as the response.
R
set.seed(100)
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
train = subset(dataset, spl == TRUE)
test = subset(dataset, spl == FALSE)
model_glm = glm(Direction ~ . , family = "binomial", data = train, maxit = 100)
5. Predicting the target variable
We predict probabilities and classify the outcomes based on a threshold of 0.5.
- predict(): outputs predicted probabilities for the test data.
- ifelse(): assigns class labels based on probability cutoff.
- as.factor(): converts labels into factor format for evaluation.
R
predictTest = predict(model_glm, newdata = test, type = "response")
predicted_classes <- as.factor(ifelse(predictTest >= 0.5, "Up", "Down"))
6. Evaluating the accuracy of the model
We use a confusion matrix to assess the model’s classification accuracy and related statistics.
- confusionMatrix(): calculates accuracy, kappa, sensitivity, specificity, and more.
R
print(confusionMatrix(predicted_classes, test$Direction))
Output:
OutputImplementation of the Validation Set Approach for Regression in R
We implement the Validation Set Approach for a regression problem using a linear regression model in R. The dataset used is the inbuilt trees dataset and the goal is to predict the continuous target variable Volume.
1. Installing the required packages
We install the necessary packages that help with data handling and model evaluation.
- tidyverse: used for data manipulation and visualization.
- caret: used to compute regression metrics and handle data partitioning.
R
install.packages("tidyverse")
install.packages("caret")
2. Loading the required packages and dataset
We load the libraries and import the dataset for use in the regression model.
- trees: inbuilt dataset that includes Girth, Height, and Volume.
- data(): loads the dataset into the environment.
- head(): previews the top records of the dataset.
R
library(tidyverse)
library(caret)
data(trees)
head(trees)
3. Splitting the dataset and training the model
We divide the dataset into training and testing subsets and build the linear regression model.
- set.seed(): ensures the random sampling is reproducible.
- createDataPartition(): creates indices for training data partition.
- lm(): fits a linear model using
Volume
as the target variable.
R
set.seed(123)
random_sample <- createDataPartition(trees$Volume, p = 0.8, list = FALSE)
training_dataset <- trees[random_sample, ]
testing_dataset <- trees[-random_sample, ]
model <- lm(Volume ~ ., data = training_dataset)
4. Predicting the target variable
We make predictions on the validation dataset using the trained regression model.
- predict(): generates predicted values of the target variable.
R
predictions <- predict(model, testing_dataset)
5. Evaluating the accuracy of the model
We compute performance metrics to evaluate the regression model.
- R2(): calculates the coefficient of determination.
- RMSE(): computes the root mean square error.
- MAE(): computes the mean absolute error.
R
data.frame(R2 = R2(predictions, testing_dataset$Volume),
RMSE = RMSE(predictions, testing_dataset$Volume),
MAE = MAE(predictions, testing_dataset$Volume))
Output:
OutputAdvantages of the Validation Set Approach
- One of the most basic and easy-to-implement techniques for model evaluation.
- No complex steps or repeated computations required.
- Offers a quick estimate of model performance using unseen data.
- Requires less computational time compared to advanced validation techniques.
Disadvantages of the Validation Set Approach
- Model performance heavily depends on the specific data split used.
- Training on a single subset can lead to biased and less generalizable results.