0% found this document useful (0 votes)
13 views

Cran.r2021-Linear Regression and Logistic Regression With Missing Covariates

The document introduces the misaem R package for performing linear and logistic regression with missing data. It provides examples of using the package functions to: 1. Estimate regression coefficients for a synthetic linear regression dataset with 15% missing values using the miss.lm function. 2. Select the best model from a dataset including a null variable using the miss.lm.model.select function. 3. Make predictions on a complete and incomplete test dataset using the predict function, having fitted a model on a training dataset with missing values.

Uploaded by

Luis Garreta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Cran.r2021-Linear Regression and Logistic Regression With Missing Covariates

The document introduces the misaem R package for performing linear and logistic regression with missing data. It provides examples of using the package functions to: 1. Estimate regression coefficients for a synthetic linear regression dataset with 15% missing values using the miss.lm function. 2. Select the best model from a dataset including a null variable using the miss.lm.model.select function. 3. Make predictions on a complete and incomplete test dataset using the predict function, having fitted a model on a training dataset with missing values.

Uploaded by

Luis Garreta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Linear regression and logistic regression with missing

covariates
cran.r-project.org/web/packages/misaem/vignettes/misaem.html

Introduction of misaem
misaem is a package to perform linear regression and logistic regression with missing data,
under MCAR (Missing completely at random) and MAR (Missing at random) mechanisms.
The covariates are assumed to be continuous variables. The methodology implemented is
based on maximization of the observed likelihood using EM-types of algorithms. The package
includes:

1. Parameters estimation:

for linear regression, we consider a joint Gaussian distribution for covariates and
response, then the norm package allows to estimate the mean vector and a variance
covariance matrix with the EM algorithm and SWEEP operator. Finally we have
reshaped them to obtain the regression coefficient.
for logistic regression, with a stochastic approximation version of EM algorithm
(SAEM) based on Metropolis-Hasting sampling.

2. Estimation of standard deviation for estimated parameters:

for linear regression, use the property that the Gram matrix of random variables
(estimates of regression coefficients) approximates their covariance matrix;
for logistic regression, with Louis formula.

3. Model selection procedure based on BIC.

library(misaem)

Linear regression

Synthetic dataset
Let’s generate a synthetic example of classical linear regression. We first generate a design
matrix of size n=50n=50 times p=2p=2 by drawing each observation from a multivariate
normal distribution N(μ,Σ)N(μ,Σ). We consider as the true values for the parameters:

μΣ=(1,1),=[1114]
μ=(1,1),Σ=[1114]

1/10
Then, we generate the response according to the linear regression model with coefficient
β=(2,3,−1)β=(2,3,−1) and variance of noise vector σ2=0.25σ2=0.25.

set.seed(1)
n <- 50 # number of rows
p <- 2 # number of explanatory variables

# Generate complete design matrix


library(MASS)
mu.X <- c(1, 1)
Sigma.X <- matrix(c(1, 1, 1, 4), nrow = 2)
X.complete <- mvrnorm(n, mu.X, Sigma.X)

# Generate response
b <- c(2, 3, -1) # regression coefficient
sigma.eps <- 0.25 # noise variance
y <- cbind(rep(1, n), X.complete) %*% b + rnorm(n, 0, sigma.eps)

Then we randomly introduced 15% of missing values in the covariates according to the MCAR
(Missing completely at random) mechanism. To do so, we use the function ampute from the
R package mice . For more details about how to generate missing values of different
mechanisms, see the resource website of missing values Rmisstastic.

library(mice)

##
## Attaching package: 'mice'

## The following objects are masked from 'package:base':


##
## cbind, rbind

# Add missing values


yX.miss <- ampute(data.frame(y, X.complete), 0.15, patterns = matrix(
c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0),
ncol = 3, byrow = TRUE), freq = c(1, 1, 1, 2, 2, 2) / 9,
mech = "MCAR", bycases = FALSE)
y.obs <- yX.miss$amp[, 1] # responses
X.obs <- as.matrix(yX.miss$amp[, 2:3]) # covariates with NAs

Have a look at our synthetic dataset:

head(X.obs)

## X1 X2
## [1,] 0.30528180 -0.1473747
## [2,] 1.59950261 1.2164969
## [3,] 0.22508791 -0.5764402
## [4,] 2.86148303 3.8938533
## [5,] 0.05283648 2.0009229
## [6,] -1.07586521 -0.1496864

Check the percentage of missing values:

2/10
sum(is.na(X.obs))/(n*p)

## [1] 0.17

Estimation for linear regression with missing values

The main function in our package to fit linear regression with missingness is miss.lm
function. The function miss.lm mimics the structure of widely used function lm for the
case without missing values. It takes an object of class formula (a symbolic description of
the model to be fitted) and the data frame as the input. Here we apply this function with its
default options.

# Estimate regression using EM with NA


df.obs = data.frame(y, X.obs)
miss.list = miss.lm(y~., data = df.obs)

## Iterations of EM:
## 1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...

Then it returns an object of self-defined class miss.lm , which consists of the estimation of
parameters, their standard error and observed log-likelihood. We can print or summarize the
obtained results as follows:

print(miss.list)

##
## Call: miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04171 0.03484 0.01936
## Log-likelihood: 31.85

print(summary(miss.list))

##
## Call:
## miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.94205 0.04171
## X1 3.05205 0.03484
## X2 -1.00424 0.01936
## Log-likelihood: 31.852

summary(miss.list)$coef

3/10
## Estimate Std. Error
## (Intercept) 1.942050 0.04170924
## X1 3.052050 0.03483511
## X2 -1.004244 0.01936094

Self-defined parameters can be also taken such as the maximum number of iterations
( maxruns ), the convergence tolerance ( tol_em ) and the logical indicating if the iterations
should be reported ( print_iter ).

# Estimate regression using self-defined parameters


miss.list2 = miss.lm(y~., data = df.obs, print_iter = FALSE, maxruns = 500, tol_em =
1e-4)
print(miss.list2)

##
## Call: miss.lm(formula = y ~ ., data = df.obs, print_iter = FALSE, maxruns = 500,
## tol_em = 1e-04)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04175 0.03487 0.01938
## Log-likelihood: 31.85

Model selection

The function miss.lm.model.select adapts a BIC criterion and step-wise method to


return the best model selected. We add a null variable with missing values to check if the
function can distinguish it from the true variables.

# Add null variable with NA


X.null <- mvrnorm(n, 1, 1)
patterns <- runif(n)<0.15 # missing completely at random
X.null[patterns] <- NA
X.obs.null <- cbind.data.frame(X.obs, X.null)

# Without model selection


df.obs.null = data.frame(y, X.obs.null)
miss.list.null = miss.lm(y~., data = df.obs.null)

## Iterations of EM:
##
1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...

print(miss.list.null)

4/10
##
## Call: miss.lm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X.null
## 1.89134 3.05640 -1.00446 0.04426
## Standard error estimates:
## (Intercept) X1 X2 X.null
## 0.05212 0.03362 0.01853 0.02713
## Log-likelihood: 10.2

# Model selection
miss.model = miss.lm.model.select(y, X.obs.null)
print(miss.model)

##
## Call: miss.lm(formula = Y ~ ., data = df, print_iter = FALSE)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04171 0.03484 0.01936
## Log-likelihood: 31.85

Prediction on test set

In order to evaluate the prediction performance, we generate a test set of size nt=20nt=20
times p=2p=2 following the same distribution as the previous design matrix, and we add or
not 15% of missing values.

# Prediction
# Generate dataset
set.seed(200)
nt <- 20 # number of new observations
Xt <- mvrnorm(nt, mu.X, Sigma.X)

# Add missing values


Xt.miss <- ampute(data.frame(Xt), 0.15, patterns = matrix(
c(0, 1, 1, 0),
ncol = 2, byrow = TRUE), freq = c(1, 1) /2,
mech = "MCAR", bycases = FALSE)
Xt.obs <- as.matrix(Xt.miss$amp) # covariates with NAs

The prediction can be performed for a complete test set:

#train with NA + test no NA


miss.comptest.pred = predict(miss.list2, data.frame(Xt), seed = 100)
print(miss.comptest.pred)

5/10
## [1] 3.3878210 2.6112345 -0.5562864 6.5926842 2.9231974 8.0234969
## [7] 0.8286503 3.9363413 6.7515266 3.3517064 6.8156632 2.2406832
## [13] 2.0321507 5.9852215 7.8101528 5.0863422 4.2238612 4.4541193
## [19] 3.5522691 3.0003519

And we can also apply the function when both train set and test set have missing values:

#both train & test with NA


miss.pred = predict(miss.list2, data.frame(Xt.obs), seed = 100)
print(miss.pred)

## [1] 3.3878210 3.4804264 -0.5562864 6.5926842 2.9231974 8.0234969


## [7] 0.1435715 3.9363413 6.7515266 3.3517064 6.8156632 2.2406832
## [13] 2.0321507 5.9150631 7.8101528 3.5570286 4.2238612 4.4541193
## [19] 3.5522691 3.0003519

Synthetic dataset

We first generate a design matrix of size n=500n=500 times p=5p=5 by drawing each
observation from a multivariate normal distribution N(μ,Σ)N(μ,Σ). Then, we generate the
response according to the logistic regression model.

We consider as the true values for the parameters

βμΣ=(0,1,−1,1,0,−1),=(1,2,3,4,5),=diag(σ)Cdiag(σ),
β=(0,1,−1,1,0,−1),μ=(1,2,3,4,5),Σ=diag(σ)Cdiag(σ),

where the σσ is the vector of standard deviations


σ=(1,2,3,4,5)
σ=(1,2,3,4,5)

and CC the correlation matrix


C=⎡⎣⎢⎢⎢⎢⎢⎢10.80000.810000010.30.6000.310.7000.60.71⎤⎦⎥⎥⎥⎥⎥⎥.
C=[10.80000.810000010.30.6000.310.7000.60.71].

6/10
# Generate dataset
set.seed(200)
n <- 500 # number of subjects
p <- 5 # number of explanatory variables
mu.star <- 1:p #rep(0,p) # mean of the explanatory variables
sd <- 1:p # rep(1,p) # standard deviations
C <- matrix(c( # correlation matrix
1, 0.8, 0, 0, 0,
0.8, 1, 0, 0, 0,
0, 0, 1, 0.3, 0.6,
0, 0, 0.3, 1, 0.7,
0, 0, 0.6, 0.7, 1), nrow=p)
Sigma.star <- diag(sd)%*%C%*%diag(sd) # covariance matrix
beta.star <- c(1, -1, 1, 1, -1) # coefficients
beta0.star <- 0 # intercept
beta.true = c(beta0.star,beta.star)

# Design matrix
X.complete <- matrix(rnorm(n*p), nrow=n)%*%chol(Sigma.star)+
matrix(rep(mu.star,n), nrow=n, byrow = TRUE)

# Reponse vector
p1 <- 1/(1+exp(-X.complete%*%beta.star-beta0.star))
y <- as.numeric(runif(n)<p1)

Then we randomly introduced 10% of missing values in the covariates according to the
MCAR (Missing completely at random) mechanism.

# Generate missingness
set.seed(200)
p.miss <- 0.10
patterns <- runif(n*p)<p.miss # missing completely at random
X.obs <- X.complete
X.obs[patterns] <- NA

Have a look at our synthetic dataset:

head(X.obs)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1.0847563 1.71119812 5.0779956 9.731254821 13.02285225
## [2,] 1.2264603 0.04664033 5.3758000 6.383093558 4.84730504
## [3,] 1.4325565 1.77934455 NA 8.421927692 7.26902254
## [4,] 1.5580652 5.69782193 5.5942869 -0.440749372 -0.96662931
## [5,] 1.0597553 -0.38470918 0.4462986 0.008402997 0.04745022
## [6,] 0.8853591 0.56839374 3.4641522 7.047389616 NA

Logistic regression

Estimation for logistic regression with missingness

7/10
The main function for fitting logistic regression with missing covariates in our package is
miss.glm function, which mimics the structure of widely used function glm . Note that we
don’t need to specify the binomial family in the input of miss.glm function. Here we apply
this function with its default options, and then we can print or summarize the obtained
results as follows:

df.obs = data.frame(y, X.obs)

#logistic regression with NA


miss.list = miss.glm(y~., data = df.obs, seed = 100)

## Iteration of SAEM:
## 50 100 150 200

print(miss.list)

##
## Call: miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.03659 1.50705 -1.28208 1.12342 1.03435 -1.07691
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3210 0.3446 0.2056 0.1408 0.1240 0.1284
## Log-likelihood: -171.7

print(summary(miss.list))

##
## Call:
## miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.03659 0.32104
## X1 1.50705 0.34456
## X2 -1.28208 0.20560
## X3 1.12342 0.14076
## X4 1.03435 0.12396
## X5 -1.07691 0.12843
## Log-likelihood: -171.74

summary(miss.list)$coef

## Estimate Std. Error


## (Intercept) -0.03659218 0.3210369
## X1 1.50704588 0.3445570
## X2 -1.28208040 0.2056000
## X3 1.12341764 0.1407630
## X4 1.03435057 0.1239566
## X5 -1.07690679 0.1284274

Model selection

8/10
To perform model selection with missing values, we adapt criterion BIC and step-wise
method. The function miss.glm.model.select outputs the best model selected. With the
current implementation, when pp is greater than 20, it may encounter computational
difficulties for the BIC based model selection. In the following simulation, we add a null
variable with missing values to check if the function can distinguish it from the true
variables.

# Add null variable with NA


X.null <- mvrnorm(n, 1, 1)
patterns <- runif(n)<0.10 # missing completely at random
X.null[patterns] <- NA
X.obs.null <- cbind.data.frame(X.obs, X.null)

# Without model selection


df.obs.null = data.frame(y, X.obs.null)
miss.list.null = miss.glm(y~., data = df.obs.null)

## Iteration of SAEM:
## 50 100 150 200

print(miss.list.null)

##
## Call: miss.glm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.08280 1.52860 -1.29067 1.13314 1.05171 -1.09399
## X.null
## 0.03964
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3585 0.3514 0.2084 0.1417 0.1241 0.1291
## X.null
## 0.1666
## Log-likelihood: -171.4

# model selection for SAEM


miss.model = miss.glm.model.select(y, X.obs.null)
print(miss.model)

##
## Call: miss.glm(formula = Y ~ ., data = df, print_iter = FALSE, subsets =
subset_choose)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.06956 1.55837 -1.30913 1.14401 1.06008 -1.10143
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3244 0.3500 0.2094 0.1440 0.1279 0.1317
## Log-likelihood: -172

9/10
Prediction on test set
In order to evaluate the prediction performance, we generate a test set of size nt=100nt=100
times p=5p=5 following the same distribution as the design matrix, and without and with
10% of missing values. We evaluate the prediction quality with a confusion matrix.

# Generate test set with missingness


set.seed(200)
nt = 100
X.test <- matrix(rnorm(nt*p), nrow=nt)%*%chol(Sigma.star)+
matrix(rep(mu.star,nt), nrow = nt, byrow = TRUE)

# Generate the test set


p1 <- 1/(1+exp(-X.test%*%beta.star-beta0.star))
y.test <- as.numeric(runif(nt)<p1)

# Generate missingness on test set


p.miss <- 0.10
X.test[runif(nt*p)<p.miss] <- NA

# Prediction on test set


pr.saem <- predict(miss.list, data.frame(X.test))

# Confusion matrix
pred.saem = (pr.saem>0.5)*1
table(y.test,pred.saem )

## pred.saem
## y.test 0 1
## 0 34 8
## 1 6 52

Reference
Logistic Regression with Missing Covariates – Parameter Estimation, Model Selection and
Prediction (2020, Jiang W., Josse J., Lavielle M., TraumaBase Group), Computational
Statistics & Data Analysis.

10/10

You might also like