Logistic regression double loop in R
Last Updated :
20 Jun, 2024
Performing logistic regression with a double loop in R can be useful for tasks like cross-validation or nested model selection, where you might need to iterate over different sets of predictors and then further iterate over different subsets or parameter settings within those sets. This approach is beneficial for fine-tuning the model and improving its predictive performance.
Steps to Perform Logistic Regression with a Double Loop
Here are the main steps to Perform Logistic Regression with a Double Loop in R Programming Language.
- Prepare the Data: Ensure your dataset is suitable for logistic regression.
- Define Predictor Sets and Subsets: Create lists of different sets of predictors and subsets or parameter settings.
- Iterate Over Predictor Sets: Use an outer loop to iterate over different sets of predictors.
- Iterate Over Subsets or Parameters: Use an inner loop to iterate over subsets or parameter settings within each predictor set.
- Fit Logistic Regression Models: Fit the logistic regression models within the inner loop.
- Store and Summarize Results: Collect and summarize the results from each model.
Let's walk through an example using a built-in dataset, such as mtcars, modified for binary classification.
Step 1: Prepare the Data
Load the mtcars dataset and create a binary response variable.
R
# Load the necessary dataset
data(mtcars)
# Create a binary response variable: 1 if mpg > 20, else 0
mtcars$mpg_binary <- ifelse(mtcars$mpg > 20, 1, 0)
# Convert to factors where necessary
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
# View the dataset
head(mtcars)
Output:
mpg cyl disp hp drat wt qsec vs am gear carb mpg_binary
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0
Step 2: Define Predictor Sets and Subsets
Create lists of different sets of predictors and subsets.
R
# Define the sets of predictors
predictor_sets <- list(
c("hp"),
c("hp", "wt"),
c("hp", "wt", "qsec")
)
# Define subsets (if any) or other parameters to iterate over within each predictor set
# For this example, we'll iterate over different interaction terms
interaction_terms <- list(
NULL,
c("hp:wt"),
c("hp:wt", "wt:qsec")
)
Step 3: Iterate Over Predictor Sets and Subsets
Use nested loops to fit logistic regression models.
R
# Initialize a list to store models and results
results <- list()
# Outer loop: iterate over sets of predictors
for (i in seq_along(predictor_sets)) {
predictors <- predictor_sets[[i]]
# Inner loop: iterate over interaction terms
for (j in seq_along(interaction_terms)) {
interactions <- interaction_terms[[j]]
# Create the formula
formula_parts <- c(predictors, interactions)
formula <- as.formula(paste("mpg_binary ~", paste(formula_parts, collapse = " + ")))
# Fit the logistic regression model
model <- glm(formula, data = mtcars, family = binomial)
# Store the model and summary in the results list
results[[paste("Model", i, "Interaction", j)]] <- summary(model)
}
}
Step 4 : Fit Logistic Regression Models and Store Summarize Results
Models are fitted within the inner loop, and the results are stored.
R
# Print the results summaries
for (name in names(results)) {
cat("\n", name, "\n")
print(results[[name]])
}
Output:
Model 1 Interaction 1
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.4431 14.7518 1.657 0.0975 .
hp -0.2110 0.1309 -1.612 0.1069
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.8601 on 31 degrees of freedom
Residual deviance: 8.4922 on 30 degrees of freedom
AIC: 12.492
Number of Fisher Scoring iterations: 10
Model 1 Interaction 2
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 360.311 105224.092 0.003 0.997
hp 3.583 1144.003 0.003 0.998
hp:wt -2.080 606.438 -0.003 0.997
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 2.4373e-08 on 29 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 25
Model 1 Interaction 3
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.481e+02 3.331e+05 0.002 0.998
hp -7.213e-01 2.222e+03 0.000 1.000
hp:wt -6.887e-01 6.110e+02 -0.001 0.999
wt:qsec -4.887e+00 3.386e+03 -0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 1.2814e-08 on 28 degrees of freedom
AIC: 8
Number of Fisher Scoring iterations: 25
Model 2 Interaction 1
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 894.228 365884.162 0.002 0.998
hp -2.021 858.062 -0.002 0.998
wt -202.865 84688.218 -0.002 0.998
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 1.1156e-08 on 29 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 25
Model 2 Interaction 2
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1376.310 847009.895 0.002 0.999
hp -7.017 4404.678 -0.002 0.999
wt -384.770 240635.444 -0.002 0.999
hp:wt 1.847 1187.706 0.002 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 4.7507e-09 on 28 degrees of freedom
AIC: 8
Number of Fisher Scoring iterations: 25
Model 2 Interaction 3
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1337.773 886478.329 0.002 0.999
hp -7.012 4486.808 -0.002 0.999
wt -336.040 580802.371 -0.001 1.000
hp:wt 1.781 1303.441 0.001 0.999
wt:qsec -1.525 18207.652 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 4.6651e-09 on 27 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
Model 3 Interaction 1
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.109e+03 2.074e+06 0.001 1.000
hp -2.588e+00 5.439e+03 0.000 1.000
wt -1.745e+02 2.734e+05 -0.001 0.999
qsec -1.253e+01 1.176e+05 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 1.1077e-08 on 28 degrees of freedom
AIC: 8
Number of Fisher Scoring iterations: 25
Model 3 Interaction 2
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.434e+03 3.467e+06 0.000 1.000
hp -7.146e+00 8.625e+03 -0.001 0.999
wt -3.760e+02 5.690e+05 -0.001 0.999
qsec -3.498e+00 2.033e+05 0.000 1.000
hp:wt 1.835e+00 1.390e+03 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 4.6546e-09 on 27 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25
Model 3 Interaction 3
Call:
glm(formula = formula, family = binomial, data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.095e+03 7.677e+06 0 1
hp -9.549e-01 1.492e+04 0 1
wt 5.949e+02 2.465e+06 0 1
qsec 1.450e+02 3.546e+05 0 1
hp:wt 2.681e-01 4.309e+03 0 1
wt:qsec -4.168e+01 1.142e+05 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.3860e+01 on 31 degrees of freedom
Residual deviance: 2.4511e-09 on 26 degrees of freedom
AIC: 12
Number of Fisher Scoring iterations: 25
- Model Summaries: The output contains summaries for each model fitted using different sets of predictors and interaction terms.
- Coefficients: Each model summary includes coefficients, standard errors, z-values, and p-values for the predictors and interaction terms.
- Model Fit Statistics: The summaries also provide model fit statistics such as the AIC.
summaries show logistic regression models with different interactions. Most models exhibit extremely low residual deviance and non-significant p-values for all coefficients, indicating potential overfitting. Each model's AIC is low, suggesting good fit but also hinting at overfitting due to overly complex models with interactions. The models use hp
, wt
, and qsec
as predictors and their interactions. Despite low AIC values, the high standard errors and non-significant coefficients imply these models are not reliable for predictive purposes. This output highlights the importance of careful model selection and validation to avoid overfitting.
Conclusion
Using a double loop to perform logistic regression in R allows for systematic exploration of different sets of predictors and interaction terms. This approach is particularly useful for model selection and fine-tuning. By following the steps outlined in this guide, you can automate the fitting of multiple logistic regression models and efficiently compare their performance.
Similar Reads
How to Plot a Logistic Regression Curve in R?
In this article, we will learn how to plot a Logistic Regression Curve in the R programming Language. Logistic regression is basically a supervised classification algorithm. That helps us in creating a differentiating curve that separates two classes of variables. To Plot the Logistic Regression cur
3 min read
Multinomial Logistic Regression in R
Multinomial logistic regression is applied when the dependent variable has more than two categories that are not ordered. This method extends binary logistic regression to deal with multiple classes by estimating the probability of each outcome category relative to a baseline. It is commonly used in
4 min read
Logistic Regression in R Programming
Logistic regression ( also known as Binomial logistics regression) in R Programming is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary (0/1, True/False, Yes/No) in nature. At the core of logistic regression
6 min read
Plot Logistic Regression Line Over Heat Plot in R
Plotting a logistic regression line over a heat plot can be a powerful way to visualize the relationship between predictor variables and a binary outcome. This article will guide you through the steps to create such a visualization in R. What is a Logistic Regression Line?In logistic regression, the
4 min read
Weighted logistic regression in R
Weighted logistic regression is an extension of logistic regression that allows for different observations to contribute differently to the estimation process. This is particularly useful in survey data where each observation might represent a different number of units in the population, or in cases
4 min read
Local Regression in R
In this article, we will discuss what local regression is and how we implement it in the R Programming Language. What is Local Regression in R?Local regression is also known as LOESS (locally estimated scatterplot smoothing) regression. It is a flexible non-parametric method for fitting regression m
4 min read
Linear Regression on Group Data in R
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. In R programming language it can be performed using the lm() function which stands for "linear model". Sometimes, analysts need to apply linear regression sepa
3 min read
Non-Linear Regression in R
Non-Linear Regression is a statistical method that is used to model the relationship between a dependent variable and one of the independent variable(s). In non-linear regression, the relationship is modeled using a non-linear equation. This means that the model can capture more complex and non-line
6 min read
How to Calculate Log-Linear Regression in R?
Logarithmic regression is a sort of regression that is used to simulate situations in which growth or decay accelerates quickly initially and then slows down over time. The graphic below, for example, shows an example of logarithmic decay: The relationship between a predictor variable and a response
3 min read
Find the Regression Output in R
In R Programming Language we can Interpret Regression Output by using various functions depending on the type of regression analysis you are conducting. The two most common types of regression analysis are linear regression and logistic regression. Here, I'll provide examples of how to find the regr
11 min read