Full Information Multiple Imputationfor Linear Regression Modelwith Missing Response Variable
Full Information Multiple Imputationfor Linear Regression Modelwith Missing Response Variable
net/publication/381770222
CITATIONS READS
0 23
2 authors, including:
Guangbao Guo
Shandong University of Technology
51 PUBLICATIONS 167 CITATIONS
SEE PROFILE
All content following this page was uploaded by Guangbao Guo on 28 June 2024.
Abstract—Linear regression models are commonly used to Define the observable and missing values in the response
determine the quantitative relationships between variables and variable Y to be denoted Yobs and Ymis , respectively, the
utilize the resulting regression equations to make predictions. parts of the matrix X corresponding to Yobs and Ymis to be
This paper proposes a fully informative multiple imputation
method based on a linear regression model with a missing re- denoted Xobs and Xmis , respectively.
sponse variable, utilizing all observable data to obtain estimates For addressing the imputation problem of missing response
of the regression coefficients and thereby the predicted values variables in linear regression models, the most common
of the missing response variable. This not only provides a good methods are mean imputation and regression imputation, but
explanation of the relationship between the response variable these approaches also have some disadvantages. For instance,
and their respective variables, but also effectively enhances the
imputation accuracy of the response variable. The stability mean imputation can reduce the correlation between vari-
and sensitivity of the fiMI method are evaluated through a ables, while regression imputation can artificially increase
simulation study. Subsequently, the proposed method is applied this correlation. Wang et al. [1] (2009) used the expectation
to two real data sets, the admission prediction data set and the and maximization (EM) method to calculate the asymptotic
goalkeeper data set, and is discussed and analyzed. variances and standard errors of the maximum likelihood
Index Terms—linear regression models, missing response estimator (MLE) for linear models with missing data for the
variables, full information, multiple imputation. missing response variable. However, the standard deviation
can only be calculated after the operations have converged
I. I NTRODUCTION and cannot be obtained directly. Liu (2012) proposed a new
expectation recursive least squares (ERLS) method based on
W E consider the following linear regression model
Y = Xβ + ε, (1)
the EM algorithm for linear regression models. Avoiding the
difficulty of finding the inverse of the correlation matrix of
high-dimensional data. However, the calculation of regres-
where X = (Xij ) ∈ Rn×p is the independent variable,
sion coefficients requires several iterations, which increases
Xi. = (Xi1 , Xi2 , · · · , Xip ) represents the i − th row of
the computational time.
matrix X(i = 1, · · · , n), X.j = (X1j , X2j , · · · , Xnj )⊤
The method for dealing with missing data has under-
represents the j − th row of matrix X(j = 1, · · · , p),
gone two main methods: single imputation and multiple
β = (β1 , β2 , · · · , βp )⊤ ∈ Rp×1 is a vector of unknown
imputation. The emergence of multiple imputation methods
parameters, Y = (Y1 , Y2 , · · · Yn )⊤ ∈ Rn×1 is the response
has addressed the shortcomings of single imputation. Rubin
variable, and ε = (ε1 , ε2 , · · · εn )⊤ ∈ Rn×1 is the residual
[4] (1987) proposed a multiple imputation procedure that
vector. εi ∼ N (0, σ 2 In ) and independent of each other.
involves replacing each missing data point with a range
Suppose there are imperfectly independent and identi-
of potential data sets (thus also reflecting the uncertainty
cally distributed samples {(Xi. , Yi , δi ), 1 ≤ i ≤ n},where
associated with the imputed values); subsequent to this,
{Xi. , 1 ≤ i ≤ n} is fully observable, {Yi , 1 ≤ i ≤ n} is
analyzing these multiple imputed data sets using standard
missing, and δi is the variable indicating that Yi is missing,
procedures applicable to complete data sets; and ultimately
i.e.
0, if Yi is missing; generalizing and consolidating the findings from these anal-
δi = yses. Buuren et al. [2] (2011) used the R package mice to
1, if Yi is not missing.
impute incomplete multivariate data using chained equations,
Assume that Y satisfies the MAR mechanism, i.e. providing a practical step-by-step approach to addressing the
P (δi = 1|Xi. , Yi ) = P (δi = 1|Xi. , Yi ) = P (Xi. ), issue of missing data in applications. The mice package is
commonly used to impute missing response variables under
i.e. under a given Xi. , Yi is conditionally independent of δi . linear regression models, with the most commonly used
The number of cells in the response variable Y with no methods being predictive mean matching multiple imputation
missing data andP the number of cells with missing data to be (PMMMI) method, bayesian multiple imputation (BayesMI)
n
denoted nOB = i=1 δi and nNA = n − nOB , respectively. method, and bootstrap multiple imputation (bootstrapMI)
method. Rubin [6] (1999) and Schafer [7] (1997) have
Manuscript received July 1, 2023; revised November 10, 2023.
This work was supported by a grant from National Social Science Foun- conducted a series of studies on Bayesian multiple impu-
dation Project under project ID 23BTJ059, a grant from Natural Science tation methods, where the imputation accuracy is strongly
Foundation of Shandong under project ID ZR2020MA022, and a grant from influenced by the missing data mechanism. Little [8] (1988),
National Statistical Research Program under project ID 2022LY016.
Limin Song is a postgraduate student of Mathematics and Statistics, Morris et al. [9] (2015), and Buuren [10] (2018) further
Shandong University of Technology, Zibo, China. (e-mail: songsong- discussed the predictive mean multiple imputation methods
[email protected]). and found that the missing data mechanism has a small
Guangbao Guo is a professor of Mathematics and Statistics, Shandong
University of Technology, Zibo, China (Corresponding author to provide impact on the imputation accuracy. Chang et al. [5] (2020)
phone:15269366362; e-mail: [email protected]). studied the problem of missing data for independent variables
in a distributed methods environment, and developed an process is repeated m times. To avoid extraneous complexity,
efficient distributed multiple imputation method for horizon- we assume that nOB > p.
tally divided incomplete data communication. However, no First, we calculate the matrix
solution is provided when the response variable is missing.
⊤
A = Xobs Xobs + λIp×p ,
II. F ULL INFORMATION MULTIPLE IMPUTATION where λ is the regularization parameter, which allows a
Multiple imputation (MI) is arguably the most popular limited solution to the over-fitting problem in (2). The
method for dealing with missing data. The MI method regression weights
replaces each missing value with a sample from its posterior
predictive distribution. The predictive imputation model is β ∗ = (A)−1 Xobs
⊤
Yobs
estimated from the observed data and does not use the miss-
ing values. The missing values are imputed multiple times are obtained with reference to (2) and the matrix A. Next,
in order to account for the uncertainty of the imputation, Choleskey’s decomposition of the positive definite matrix A
and each imputed data set is then used to fit an analysis yields matrix CA , i.e.
model. The parameter estimate β is combined with the results ⊤
A = (CA CA ),
of these analyses to produce a final estimate from multiple
imputed data sets. This method yields estimates that are more where CA is the upper triangular matrix. We obtain estimates
robust than those obtained by using a single value to fill in of the regression coefficients as follows:
for the missing data.
A straightforward method to analyzing data is to aggregate β̂ = β ∗ + σ(CA )−1 g, (4)
information from the minimum observable data so that it
will impute by analyzing all observable data. We refer to where g = (g1 , g2 , · · · , gp )⊤ is a gi ∼ N (0, 1) and mutually
this method as full information (fi) method, and next we independent p− dimensional variable. At this point
will extend it to the full information multiple imputation β̂f i = β̂,
(fiMI) method. In linear regression models with missing
response variable, the general linear regression imputation 1 X
p
⊤ ⊤ ¯
requires only Xobs Xobs and Xobs Yobs to obtain least squares Cov(β̂f i ) = ((β̂f i )i − β̂f i ).
p − 1 i=1
estimates of the regression coefficients, as can be seen from
the following equation:
According to sufficient statistics β̂f i and Cov(β̂f i ) of
⊤ the normal distribution, samples β1 , · · · , βM are ob-
β̂ = (Xobs Xobs )−1 (Xobs
⊤
Yobs ). (2)
tained as being independent of each other and obeying
However, the regression coefficients estimated in equation N (β̂f i , Cov(β̂f i )). Send the multiple regression coefficients
(2) may suffer from overfitting, leading to inaccurate pre- β1 , · · · , βM to the imputation model and integrate the mul-
dictions. To address this issue, we propose to fit a linear tiple imputation results using Rubin’s rule to obtain β̂ and
regression imputation model using the fi method, which Cov(β̂). Based on the final obtained β̂, impute the missing
can be interpreted as fitting the imputation model using all values Ŷmis = Xmis β̂ of the response variable and expand
observable data. By passing the imputed model parameters to obtain Ŷ .
to the full observable data set, it is expected to achieve the
best computational performance because it fully exploits all
III. N UMERICAL A NALYSIS
available information.
According to (1), it follows that Yi ∼ N (Xi. β, σ 2 ) with A. Evaluation indicators
priors 1) Mean square error of Ŷ
π(σ 2 ) ∝ IG(1/2, 1/2), The mean square error (MSE) calculates the difference
β | σ 2 ∼ N (0, σ 2 λ−1 I), between the imputed value and the original true value.
n
where IG and N are denoted as inverse gamma and mul- 1X
MSE(Ŷ) = (Yi − Ŷi )2 ,
tivariate Gaussian distributions, respectively. The posterior n i=1
distribution of (σ 2 , β) is given by
where Yi and Ŷi denote the original true value and the
2 imputed value respectively.
σ |Xobs ∼ IG((nOB + 1)/2, (SSE + 1)/2),
⊤ 2) Mean absolute error of Ŷ
β|σ 2 , Xobs ∼ N ((Xobs Xobs + λI)−1 Xobs
⊤
Yobs , (3)
⊤
The mean absolute error (MAE) is the average of the
σ 2 (Xobs Xobs + λI)−1 ) absolute differences between each predicted value and the
where corresponding actual value.
SSE = ∥Yobs − Xobs β ∗ ∥22 , n
1X
MAE(Ŷ) = |Ŷi − Yi |.
the specific representation of β ∗ will be given later. The fiMI n i=1
method samples (σ 2 , β) from (3), imputes the missing values
of the response variable from (1), and fits the analytical linear When the difference between the predicted and true values
regression model using the estimated complete data. This is smaller, it means that the imputation is better.
1.30
0.81
B. Simulation
2.1
6
5.45 2.049 0.801
5.262 2.017 0.655 0.652
0.65
1.224 0.794
2.0
1.204
1.20
0.79
Firstly, the initial parameters have been fixed at
MSE
MSE
MSE
MSE
MSE
1.9
4
0.55
(n, p, M R) = (1000, 5, 10%), and the values of MSE(Ŷ)
1.10
0.77
1.8
3
1.775 1.047 0.497
and MAE(Ŷ) have been calculated for the fiMI method and 2.25 0.754
1.00
0.75
0.45
1.7
2
ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI
the comparison method under missing response variables. n=300 n=500 n=1000 n=1500 n=2000
(b) MAE values of Y(10−1)
^
According to Table I, when there is a missing response
0.12
0.82
0.88
0.92
variable in the linear regression model, the fiMI method has 0.105 0.915
0.872
0.803 0.804 0.863 0.872
0.101 0.903
0.10
0.80
0.86
0.852
the lowest values for MSE and MAE. Overall, for imputation 0.87
0.88
MAE
MAE
MAE
MAE
MAE
0.78
0.84
0.868
0.08
of linear regression models with missing response variables,
0.84
0.76
0.82
0.756
0.067 0.865
the fiMI method has the highest imputation accuracy for 0.805 0.806
0.864
0.06
0.74
0.80
0.80
ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI
the parameter combination (n, p, M R) = (1000, 5, 10%) , n=300 n=500 n=1000 n=1500 n=2000
meaning that the imputed values from the fiMI method are Fig. 2. Results of MSE and MAE values obtained by fiMI, ERLS, and
closest to the true values. EMRE methods in simulated data with different n values
TABLE I
MSE A ND MAE VALUES O F F I MI M ETHOD A ND C OMPARISON Case 2. Varying p with fixed (n, M R)
M ETHODS I N S IMULATED DATA The parameter values are set as (n, M R) = (1000, 10%)
and p = (3, 5, 10, 15, 20). The comparison results are shown
Indicators fiMI ERLS EMRE PMMMI BayesMI bootstrapMI in Fig. 3 and Fig. 4.
MSE
1.0472 1.2040 1.224 1.7151 2.3004 2.1565
(10−4 ) (a) MSE values of Y(10−4)
^
MAE 6
8.0541 8.5226 8.6269 1.0401 1.2393 1.1368
(10−2 ) 5
MSE
4
3 ●
●
2 ● ●
●
1
Next, the parameters (n, p, M R) are varied to examine the 3 5 10 15 20
p
MSE, MAE, and MRE values of the fiMI method and the (b) MAE values of Y(10−1)
^
comparison method under different sample sizes, numbers
1.4
of variables, and missing ratios for sensitivity and stability
MAE
●
1.2 ●
●
●
analysis. 1.0
●
0.8
Case 1. Varying n with fixed (p, M R) 3 5 10
15 20
p
The parameter values are set as (p, M R) = (5, 10)% and BayesMI bootstrapMI fiMI PMMMI
●
6
MSE
1.30
1.30
●
1.0 1.1 1.2 1.3 1.4 1.5 1.6
2
1.4
●
●
● 1.235 1.453
1.28 1.224 1.334
1.27 1.204
1.20
1.20
1.3
3 5 10 15 20 1.189
1.343 1.254
n(102)
MSE
MSE
MSE
MSE
MSE
1.2
1.10
1.10
1.00
1.0
● ● ●
●
ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI
1.0
MAE
1.00
0.92
●
0.92
0.92
0.95
0.91
3 5 10 15 20 0.891 0.892
0.852 0.897
0.908
n(102)
0.88
0.88
0.88
MAE
MAE
MAE
MAE
MAE
0.876
0.84
0.90
0.84
0.84
0.82
0.85
0.831 0.835
0.825
0.805
0.804
0.80
0.80
0.80
0.80
0.80
Fig. 1. Results of MSE and MAE values obtained by the fiMI method ERLS EMRE
p=3
fiMI ERLS EMRE
p=5
fiMI ERLS EMRE
p=10
fiMI ERLS EMRE
p=15
fiMI ERLS EMRE
p=20
fiMI
fluctuates within the range of 8.04380e-02∼8.3516e-02, with each variable of the admission prediction data set as shown
the smallest range of fluctuation; the MAE values of the other in Table II:
methods are higher than the fiMI method. TABLE II
Case 3. Varying M R with fixed (n, p) T HE C ORRELATION C OEFFICIENT A ND P- VALUE B ETWEEN T HE
The parameter values are set as (n, p) = (1000, 5) and I NDEPENDENT VARIABLES A ND R ESPONSE VARIABLE I N A DMISSION
P REDICT DATA SET
M R = (10%, 20%, 30%, 40%, 50%) . The results are shown
in Fig. 5 and Fig. 6. Statistical GRE TOEFL University
SOP LOR CGPA
tests Score Score Rating
(a) MSE values of Y(10−4)
^
12.5
CC 0.803 0.792 0.711 0.676 0.670 0.873
10.0
●
P-value 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16
MSE
7.5 ●
●
5.0 ●
2.5 ●
The correlation and significance test analysis show that
10 20 30 40 50
MR(%) these six characteristic variables are all highly correlated with
(b) MAE values of Y(10−1)
^
the response variable chance of admission, so the above six
6
●
●
characteristic variables are selected as independent variables.
MAE
10 20 30 40 50 6
MR(%) X
BayesMI bootstrapMI fiMI PMMMI
●
Yi = Xij βj + εi , i = 1, 2, · · · 400.
j=1
Fig. 5. Results MSE and MAE values obtained by fiMI method and
multiple comparison methods in simulated data with different M R values For the admission prediction data set, we set the M R
(case 3) of admission chances to 50% , then impute with the fiMI
method and comparison method, and finally compare the
imputation methods in terms of imputation accuracy.
(a) MSE values of Y(10−4)
^
1.30
4.512
(a) MSE values of Y(10−6)
5.319 ^
4.4
11.648 11.504
12
4.2
MSE
MSE
MSE
MSE
MSE
4.073
2.104 10.381
4.0
10
1.10
MSE
5.177
3.8
1.047
5.121 7.313 7.369
8
3.6
ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI
6
1.70
2.45
4.25
3.5
1.68
2.40
4.20
0.852
2.6 3.0 3.4 3.8
3.697
3.3
MAE
MAE
MAE
MAE
MAE
0.84
1.66
2.35
4.15
3.228
3.358 3.382
3.2
MAE
4.102
0.82
1.64
2.30
4.10
1.637 2.295
3.1
4.085
3.066
0.805 1.623 2.869 2.8638
2.749
0.80
1.62
2.25
4.05
3.0
ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI ERLS EMRE fiMI
MR=10% MR=20% MR=30% MR=40% MR=50%
ERLS EMRE PMMMI BayesMI bootstrapMI fiMI
Fig. 6. Results of MSE and MAE values obtained by fiMI, ERLS, and
EMRE methods in simulated data with different M R values Fig. 7. MSE and MAE values obtained by fiMI method and comparison
methods in admission predict data set
Upon observing Fig. 5(a) and Fig. 6(a), it can be seen
that the MSE values of both the fiMI method and the other It can be seen from Fig. 7 that the fiMI method has
comparative methods show an overall increasing trend with the lowest MSE and MAE values for the response variable
the increase of the M R for the fixed parameter (n, p), with admission chances of M R = 50%, indicating that the
the MSE value of the fiMI method fluctuating within the fiMI method has the best imputation effect. Overall, for the
range of 1.0472e-04 to 5.1215e-04; the other imputation admission prediction data set with a large ratio of missing
methods have higher MSE values than the fiMI method. values, the fiMI method has the highest imputation accuracy,
Observing Fig. 5(b) and Fig. 6(b) reveals that the MAE followed by the ERLS and EMRE methods.
values of both the fiMI method and the other comparative The second data set for the empirical study is the goal-
methods roughly increase linearly with the increase of the keeper player data set. Rating is the response variable.
M R, but the MAE value of the fiMI method is the smallest Firstly, we perform correlation analysis on each variable of
among all the imputation methods, and the fluctuation range the goalkeeper data set, and the correlation coefficients and p-
is also the smallest. values between each characteristic variable and the response
variable are calculated as shown in Table III below:
C. Real Data Analysis TABLE III
T HE C ORRELATION C OEFFICIENT A ND P- VALUE B ETWEEN T HE
In this section, two real data sets are selected: the ad- I NDEPENDENT VARIABLES A ND R ESPONSE VARIABLE I N GK DATA S ET
mission prediction data set and the goalkeeper data set,
and the data set for this empirical study is obtained from Statistical
Positioning Diving Kicking Handling Reflexes
tests
a third-party data science community, the Heywhale. The
response variable is the chance of admission in the admission CC 0.923319 0.9217224 0.7543833 0.9113288 0.9262662
P-value 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16
prediction data set. Firstly, correlation analysis is done for
The correlation and significance test analysis show that The performance of the imputation method discussed in
these five characteristic variables are all strongly correlated this paper is mainly verified through a large number of sim-
with rating, so the above five characteristic variables are ulation experiments and real data, mainly from practice and
selected as independent variables. The goalkeeper data set applications. Next, more attention will be paid to theoretical
is suitable for multiple linear regression modeling, so p = 5 support, and the theorem proving of each imputation method
, our regression model is as follows: will be studied, which can also serve as a direction for us in
the future.
5
X
Yi = Xij βj + εi , i = 1, 2, · · · 2003.
j=1
R EFERENCES
[1] J. X. Wang and M. Yu, “Note on the EM algorithm in linear regression
For the goalkeeper data set, we still consider the case of model,” International Mathematical Forum, vol. 4, no. 38, pp. 1883-
a large percentage of missing response variables and set the 1889, 2009.
missing ratio of the response variable rating M R = 50% [2] S. V. Buuren and K. G. Oudshoorn, “mice: Multivariate imputation by
chained equations in r,” Journal of Statistical Software, vol. 45, no. 3,
then impute with the fiMI method and the comparison pp. 1–67, 2011.
method, and finally compare the imputation methods in terms [3] G. Guo, Y. Sun and X. Jiang, “A partitioned quasi-likelihood for
of imputation accuracy. distributed statistical inference,”Computational Statistics, vol. 35, no.
4, pp. 1577–1596, 2020.
[4] G. Guo, H. Song, and L. Zhu, “ISR: The Iterated Score Regression-
(a) MSE values of Y(10−4)
^
Based Estimation Algorithm,” 2022.
3.24 3.053 [5] G. Guo, C. Wei, and G. Qian, “Sparse online principal component
3.047
2.0 3.0
ERLS EMRE PMMMI BayesMI bootstrapMI fiMI analysis of incomplete data in distributed health data networks,” Nature
Communications, vol. 11, no. 1, pp. 5467-547, 2020.
(b) MAE values of Y(10−1) [7] D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. John
^
Wiley, 1999.
4.5
4.293 4.377
4.004 [8] J. L. Schafer, Analysis of incomplete multivariate data. London:
Chapman &Hall, 1997.
MAE
3.5
3.06 3.008 2.962 [9] R. J. A. Little, “Missing-Data Adjustments in Large Surveys,” Journal
of Business & Economic Statistics, vol. 6, no. 3, pp. 287-296, 1988.
2.5
ERLS EMRE PMMMI BayesMI bootstrapMI fiMI [10] T. P. Morris, I. R. White and P. Royston, “Tuning multiple imputation
by predictive mean matching and local residual draws,” BMC Med Res
Methodol, vol. 14, no. 1, pp. 1-13, 2015.
Fig. 8. MSE and MAE values obtained by fiMI method and comparison [11] S. V. Buuren, Flexible Imputation of Missing Data, 2nd ed. Chapman
methods in GK data set & Hall/CRC, 2018.
IV. C ONCLUSION
Big data statistical analysis has become one of the main-
stream positions in current statistical research. As missing
data in statistical analysis is objective and inevitable in
reality, techniques for dealing with missing data have re-
ceived much attention from the statistical community, and
imputation methods for missing data have been widely used
in many fields. To address this issue, this paper investigates
imputation methods for handling missing response variables
in linear regression models to better improve the imputation
accuracy while interpolating missing data. The work accom-
plished is as follows: the six methods are compared in terms
of method steps, the advantages and disadvantages of the six
methods are summarised, and the six methods are compared
in terms of computational performance.
For the problem of imputation accuracy, the sensitivity and
stability of the method are investigated through simulation,
and real data analysis is carried out to verify the performance
of the method. It is found that the proposed method has
higher imputation accuracy and is more effective in dealing
with data with missing ratios.