Handling missing data
Handling missing data
com
IMPUTATION
FILLING IN HOLES IN DATASETS
2
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Why Imputation?
Is it worth it?
Preserves Data
Fooled by Randomness
Imputation prevents the reduction of
Having more data prevents us from falling
sample size due to missing values. This
prey to overly optimistic models that are
helps to preserve all responses in the
fit to more noise than signals
sample
Impute and
Assess Risk!
3
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
DATA
Foundation of All
Analyses
4 possible
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION
To impute or not to impute, that is the question
01 02 03
Imputation is a powerful method that is useful for filling blanks when they are missing within a dataset
An analyst must understand the data intimately to know if a blank means that the factor is not applicable for
that data point
5
Sometimes a blank does not reflect a nonresponse and should be observed “as is”
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
6
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
7
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
METHODS ALLOWING
MISSING DATA
Complete-Case Analysis
Approach that excludes any records with missing data.
Disadvantage – bias becomes introduced into the analysis
due to the removal of data that may provide insight into the
population
Available-Case Analysis
Approach allows the analysis of subsets of the complete
dataset so that multiple aspects of a problem can be
studied. Disadvantage – bias is again introduced if data are
missing in a pattern
8
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION METHODS
To retain as much of the precious gold (data) as possible, we should consider using imputation
methods. There are several methods you can choose to make a best statistical inference at a
response that will close a data gap
9
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION METHODS
How do they compare?
R Python
R is a language and Python is a high-level
environment for statistical programming language with
computing and graphics. It is dynamic semantics. Like R,
an integrated suite of software Python supports modules and
facilities for data manipulation, packages to help with analysis
calculation and graphical
display
11
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MICE
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MULTIPLE IMPUTATION BY
CHAINED EQUATIONS
MICE
Method
This method creates multiple imputations for a missing value
that accounts for the statistical uncertainty in the imputation
Assumptions
This method operates under the assumption that the missing
data is MAR. MAR occurs when a data gap is full accounted
for by variables where there is complete information
Iterations
Multiple regression models are conducted and each variable
with missing data is modeled conditionally on the responses
of the other variables within the dataset. With this method,
each variable is modeled according to its own distribution
13
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
01 02 03
06 05 04
NUMBER #06 NUMBER #05 NUMBER #04
The coefficients of the individual equation The P-step proceeds by taking a random In posterior step (P-step), the mean and
are averaged using a simple, unweighted draw from the mean and covariance covariance distributions are calculated
mean. Goodness-of-fit measures are distributions, which are used to calculate from the filled-in data
14
calculated using the pooled results regression coefficients
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Given the multiple imputations, the coefficients of the individual equation are averaged (using a
simple, unweighted mean). The other parameters, including the degrees of freedom, standard
errors, and R2s are combined using what is known as Rubin’s Rules, after the statistician who
developed them
15
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Dataset
The data used for analysis is a Wheeled and Tracked Vehicle
Engine dataset. The dataset is small, which makes the use of
imputation very important
Included Features
Identification (ID), Brake Horsepower (bHP), Displacement
(DISP), Engine Speed (EngSP), Cylinders (CYL), Unit Cost in
Dollars (UC), Dry Weight (DryWGT)
Missing Counts
Of the seven features included in the dataset, four of those
seven have missing values.
N=9
16
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Dataset Example
Four variables have missing data
17
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING MICE
01 02 03
We used the statistical Conduct linear regression on Pooling Results
programming platform R and each of the five imputed
Combining the results of these separate
the ‘mice’ package to datasets analyses is referred to as pooling
calculate imputed data
To view each of the imputed datasets, we The pooled regression equation has
use the complete() function: coefficients that are the arithmetic means
R code:
of the coefficients for the five individual
install.packages('mice’)
R code: regressions
library(mice)
completedData<-complete(imputedata,1)
data<-read(“Example.csv”)
Let m denote the number of imputed
imputdata<-mice(data, m=5, meth=‘pmm’,
The number one in the complete function datasets, 𝛽𝑖 denote the ith coefficient, and 𝛽𝑖𝑗
seed=23109)
indicates that you want to see the first denote the ith coefficient for the jth imputed dataset;
iteration. To see the other 2-5 datasets, you then:
Fixed seed to ensure the analysis is
will need to write functions to create and σ𝑚𝑗=1 𝛽𝑖𝑗
repeatable 𝛽𝑖 =
view those datasets 𝑚
The default in mice is m=5. This parameter
will need to be included if another value of
imputations is desired
18
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING MICE
04 05 06
Pooling Results - 2 Goodness-of-Fit Statistics Compare Results
To fit a linear model to a dataset, use the Unlike the coefficients, you cannot simply Compare the results from the imputed
lm() function. Then, pool the m estimates average the R2 values, standard errors, the dataset to the original dataset with missing
𝑄 (1) , … , 𝑄 (𝑚) into one model 𝑄.
ഥ F-stats, etc., in order to calculate the values removed
goodness-of-fit statistics
R code:
Fit1<-with(imputedata,lm(UC~bHP)) R code:
Summary(pool(Fit1)) pool.r.squared(fit4, adjusted = FALSE)
poolF<-mi.anova(mi.res=imputedata,
formula="UC~bHP")
19
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
ANALYZING RESULTS
Creating plots to determine reasonableness of imputations
Scatterplot Analysis
There is a linear relationship
between UC and bHP. The pattern
of the relationship seems plausible
for the imputed values (pink) as
compared to the observed values
(blue)
20
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MICE Results
ID bHP EngSP CYL DryWGT DISP UC
FIT RESULTS
Comparing results from the original dataset to the imputed (pooled) dataset
The model is a solid one with a statistically significant p-value less than Though the R2 statistic is lower than the original dataset, we gained some
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due degrees of freedom with the use of imputation with the creation of this
to missing a unit cost value statistically significant model. The model does not gain a full degree of
freedom since the iterations are pooled
22
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EXPECTATION
MAXIMIZATION
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Expectation
Maximization
Imputing by optimizing
Maximum Likelihood
The maximum likelihood method is used to impute missing values.
This method uses available data to impute a value and then checks
to determine the reasonableness of the guess
Covariance
The covariation among variables is used to infer probable values for
the missing data
Two-Step Process
The method follows a two-step process to fill in missing data
24
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EM TWO-STEP PROCESS
How EM fills data gaps
The algorithm begins by filling the process The covariance matrix is then used
to derive regression equations for
gaps with the conditional mean of
used to fill the next iteration and the cycle
the missing values.
data gaps continues until the difference
between the covariance matrices
in subsequent runs falls below the
convergence criteria
25
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING EM
01 02 03
Show missingness patterns Performing maximum Pooling Results
likelihood estimation using
The function prelim.norm if used on a matrix The average of the imputations is
of the x (bHP) and y (cost) variables to sort EM algorithm calculated for the variable with missing
rows according to the missingness patterns values
Fixed seed to ensure the analysis is R code:
repeatable b<-em.norm(a) R code:
c1<-getparam.norm(a,b) c1$mu[1]
R code:
a<-prelim.norm(cbind(y,x) This function produces a vector which can The estimates for the coefficients of the
then be used to return a list of parameters model are then estimated
b.est<-c(c1$mu[1]-
(c1$sigma[1,2]/c1$sigma[2,2])*c1$mu[2],c1
$sigma[1,2]/c1$sigma[2,2])
26
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EM
ID bHP UC
1 290 $40,079
2 330 $40,927
3 330 $29,563
4 515 $63,931
5 675 $111,976
6 675 $120,661
7 500 $47,873
8 362 $59,771
9 340 $40,661
27
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
FIT RESULTS - 2
Comparing results from the original dataset to the EM imputed dataset
The model is a solid one with a statistically significant p-value less than Compared to the results produced from removing the data points with
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due missing values, this is a better performing model. A degree of freedom
to missing a unit cost value was gained and the R2 metric increased while the model retained
statistical significance
28
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EXPECTATION MAXIMIZATION
Why choose EM?
ADVANTAGES DISADVANTAGES
EM preserves the relationship with other EM can sometime underestimate standard
variables, unlike mean imputation error
29
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
COMPARING METHODS
MICE VERSUS EM
MICE and EM are based on similar For small data sets, it is wise to run both and
assumptions and in practice they often compare the results, as small differences in
produce similar results. The Bayesian the methods could have an outsized
estimation in MICE is asymptotically impact when the number of data points is
equivalent to the maximum likelihood limited
estimates in EM, so for large data sets the
two methods should provide similar results
There are multiple methods which can be used to impute data. Two of the strongest techniques, MICE
and EM, should be considered first as they preserve relationships between independent and
dependent variables and estimate error more accurately.
The MICE method for imputation has an edge over EM since MICE calculates multiple imputations for
the missing values instead of one single estimate.
30
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Q&A
3
1
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Presenters
32