Open In App

The Validation Set Approach in R Programming

Last Updated : 08 Jul, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

The Validation Set Approach is a basic cross-validation method in which the dataset is divided into two separate parts. One part is used for training the model while the other is used for validating its performance. This method helps in checking how well the model performs on new data and reduces the chances of overfitting. R offers built-in libraries that make the entire process efficient and easy to implement.

Steps Involved in the Validation Set Approach

  1. Randomly splitting the dataset into training and validation sets, usually in a 70:30 or 80:20 ratio.
  2. Training the machine learning model using the training dataset.
  3. Applying the trained model to the validation dataset to make predictions.
  4. Calculating model performance using evaluation metrics like confusion matrix, accuracy (for classification), or RMSE, R² (for regression).

Implementation of the Validation Set Approach for Classification in R

We implement the Validation Set Approach using a logistic regression model for a classification problem where the target variable is categorical. The dataset used is Smarket from the ISLR package, and the goal is to predict the Direction of the market (Up or Down).

1. Installing the required packages

We install the necessary libraries that support data manipulation, model building, and evaluation.

  • tidyverse: used for data cleaning and visualization.
  • caret: used for computing model performance metrics.
  • caTools: used for creating train and test splits.
  • ISLR: provides the Smarket dataset for classification.
R
install.packages("tidyverse")
install.packages("caret")
install.packages("caTools")
install.packages("ISLR")

 2. Loading the required packages

We load the libraries after installation to access their functions and datasets.

R
library(tidyverse)
library(caret)
library(caTools)
library(ISLR)

3. Exploring and preparing the dataset

We assign the dataset to a variable, inspect its structure, and analyze the class distribution of the target variable.

  • Smarket: contains market data with lag values and direction movement.
  • complete.cases(): filters out rows with missing values.
  • glimpse(): displays structure and column types of the dataset.
  • table(): checks frequency distribution of the Direction variable.
R
dataset <- Smarket[complete.cases(Smarket), ]
glimpse(dataset)
table(dataset$Direction)

 Output:

dataset
Output

4. Splitting the dataset and training the model

We divide the dataset into training and validation sets and build a logistic regression model.

  • set.seed(): ensures reproducibility of random splitting.
  • sample.split(): creates a logical vector for train-test division.
  • subset(): subsets the dataset based on the logical split.
  • glm(): fits a logistic regression model with Direction as the response.
R
set.seed(100)
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
train = subset(dataset, spl == TRUE)
test = subset(dataset, spl == FALSE)
model_glm = glm(Direction ~ . , family = "binomial", data = train, maxit = 100)

 5. Predicting the target variable

We predict probabilities and classify the outcomes based on a threshold of 0.5.

  • predict(): outputs predicted probabilities for the test data.
  • ifelse(): assigns class labels based on probability cutoff.
  • as.factor(): converts labels into factor format for evaluation.
R
predictTest = predict(model_glm, newdata = test, type = "response")
predicted_classes <- as.factor(ifelse(predictTest >= 0.5, "Up", "Down"))

6. Evaluating the accuracy of the model

We use a confusion matrix to assess the model’s classification accuracy and related statistics.

  • confusionMatrix(): calculates accuracy, kappa, sensitivity, specificity, and more.
R
print(confusionMatrix(predicted_classes, test$Direction))

Output:

confusion_matrix
Output

Implementation of the Validation Set Approach for Regression in R

We implement the Validation Set Approach for a regression problem using a linear regression model in R. The dataset used is the inbuilt trees dataset and the goal is to predict the continuous target variable Volume.

1. Installing the required packages

We install the necessary packages that help with data handling and model evaluation.

  • tidyverse: used for data manipulation and visualization.
  • caret: used to compute regression metrics and handle data partitioning.
R
install.packages("tidyverse")
install.packages("caret")

 2. Loading the required packages and dataset

We load the libraries and import the dataset for use in the regression model.

  • trees: inbuilt dataset that includes Girth, Height, and Volume.
  • data(): loads the dataset into the environment.
  • head(): previews the top records of the dataset.
R
library(tidyverse)
library(caret)
data(trees)
head(trees)

3. Splitting the dataset and training the model

We divide the dataset into training and testing subsets and build the linear regression model.

  • set.seed(): ensures the random sampling is reproducible.
  • createDataPartition(): creates indices for training data partition.
  • lm(): fits a linear model using Volume as the target variable.
R
set.seed(123)
random_sample <- createDataPartition(trees$Volume, p = 0.8, list = FALSE)
training_dataset  <- trees[random_sample, ]
testing_dataset <- trees[-random_sample, ]
model <- lm(Volume ~ ., data = training_dataset)

4. Predicting the target variable

We make predictions on the validation dataset using the trained regression model.

  • predict(): generates predicted values of the target variable.
R
predictions <- predict(model, testing_dataset)

5. Evaluating the accuracy of the model

We compute performance metrics to evaluate the regression model.

  • R2(): calculates the coefficient of determination.
  • RMSE(): computes the root mean square error.
  • MAE(): computes the mean absolute error.
R
data.frame(R2 = R2(predictions, testing_dataset$Volume), 
           RMSE = RMSE(predictions, testing_dataset$Volume), 
           MAE = MAE(predictions, testing_dataset$Volume))

Output:

validation_Set
Output

Advantages of the Validation Set Approach

  • One of the most basic and easy-to-implement techniques for model evaluation.
  • No complex steps or repeated computations required.
  • Offers a quick estimate of model performance using unseen data.
  • Requires less computational time compared to advanced validation techniques.

Disadvantages of the Validation Set Approach

  • Model performance heavily depends on the specific data split used.
  • Training on a single subset can lead to biased and less generalizable results.

Similar Reads