Weighted logistic regression in R

Last Updated : 23 Jul, 2025

Weighted logistic regression is an extension of logistic regression that allows for different observations to contribute differently to the estimation process. This is particularly useful in survey data where each observation might represent a different number of units in the population, or in cases where certain observations are more reliable or important than others. This article will explore how to implement weighted logistic regression in R Programming Language.

Weighted Logistic Regression

In standard logistic regression, each observation is treated equally. However, in weighted logistic regression, each observation is assigned a weight. The weights can reflect the relative importance of observations or the number of times each observation should be counted. The weighted logistic regression model estimates coefficients by maximizing a weighted likelihood function.

Applications of Weighted Logistic Regression

Weighted logistic regression is widely used in various fields, including:

Survey Analysis: Adjusting for sampling design to ensure estimates are representative of the population.
Epidemiology: Accounting for different levels of reliability in data from different sources.
Economics: Giving more weight to more reliable or recent data points.
Machine Learning: Handling imbalanced datasets by giving more weight to minority class observations.

Implementing Weighted Logistic Regression in R

R provides robust tools for implementing weighted logistic regression. The primary function used for this purpose is glm() (generalized linear model), with the weights argument to specify the weights for each observation. Let’s consider an example where we have survey data on individuals' smoking habits and their health outcomes. Some respondents represent larger groups in the population, so we use weights to account for this.

Here is the step-by-step implementation of Weighted Logistic Regression in R Programming Language.

Step 1. Loading Required Libraries

First we will install and load the Required Libraries.

# Load necessary libraries
library(ggplot2)
library(dplyr)

Step 2. Sample Data Preparation

Assume we have a dataset survey_data with the following structure.

# Create a sample dataset
set.seed(123)
survey_data <- data.frame(
  health_outcome = rbinom(100, 1, 0.3),
  smoking_status = rbinom(100, 1, 0.4),
  age = rnorm(100, mean = 50, sd = 10),
  weight = runif(100, min = 1, max = 5)
)

# Inspect the first few rows of the dataset
head(survey_data)

Output:

  health_outcome smoking_status      age   weight
1              0              0 42.89593 4.944217
2              1              0 52.56884 1.548270
3              0              0 47.53308 4.621238
4              1              1 46.52457 3.305207
5              1              0 40.48381 2.581795
6              0              1 49.54972 2.799210

health_outcome: Binary outcome variable (1 for poor health, 0 for good health).
smoking_status: Predictor variable (1 for smoker, 0 for non-smoker).
age: Another predictor variable.
weight: Survey weight for each observation.

Step 3. Fitting the Weighted Logistic Regression Model

We use the glm() function with the weights argument.

# Fit weighted logistic regression model
weighted_logit_model <- glm(health_outcome ~ smoking_status + age, 
                            family = binomial(link = "logit"), 
                            data = survey_data, 
                            weights = survey_data$weight)

# Summarize the model
summary(weighted_logit_model)

Output:

Call:
glm(formula = health_outcome ~ smoking_status + age, family = binomial(link = "logit"), 
    data = survey_data, weights = survey_data$weight)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)   
(Intercept)    -1.170371   0.738051  -1.586  0.11279   
smoking_status  0.845089   0.270463   3.125  0.00178 **
age            -0.004255   0.014451  -0.294  0.76844   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 339.44  on 99  degrees of freedom
Residual deviance: 329.43  on 97  degrees of freedom
AIC: 339.55

Number of Fisher Scoring iterations: 5

The summary output provides the estimated coefficients, standard errors, z-values, and p-values. These results can be interpreted similarly to those from a standard logistic regression model, but they account for the specified weights.

Conclusion

Weighted logistic regression is a powerful technique for accounting for varying importance or representation of observations in your data. In R, this can be easily implemented using the glm() function with the weights argument. This article has provided a step-by-step guide to implementing and interpreting weighted logistic regression, as well as visualizing the results to better understand the model's predictions. By appropriately using weights, you can improve the accuracy and representativeness of your logistic regression analyses.

Introduction to Machine Learning

jyotijb23

Improve

Article Tags :

Practice Tags :

Machine Learning

Weighted logistic regression in R