0% found this document useful (0 votes)
8 views

Handling missing data

The document discusses the challenges and techniques for handling missing data, emphasizing the importance of imputation methods to preserve data integrity and improve predictive accuracy. It highlights multiple imputation using Bayesian methods, specifically the MICE (Multiple Imputation by Chained Equations) approach, which accounts for uncertainty in missing data. The document also outlines various imputation techniques and their comparative advantages and disadvantages, providing insights into practical implementation using statistical programming tools like R and Python.

Uploaded by

Mayura D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Handling missing data

The document discusses the challenges and techniques for handling missing data, emphasizing the importance of imputation methods to preserve data integrity and improve predictive accuracy. It highlights multiple imputation using Bayesian methods, specifically the MICE (Multiple Imputation by Chained Equations) approach, which accounts for uncertainty in missing data. The document also outlines various imputation techniques and their comparative advantages and disadvantages, providing insights into practical implementation using statistical programming tools like R and Python.

Uploaded by

Mayura D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.

com

Dealing with Missing Data-


The Art and Science of Imputation
May 2021

For the International Cost Estimating and Analysis


Association Conference – May 2021
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPUTATION
FILLING IN HOLES IN DATASETS

THE PROBLEM OF MISSING DATA


A significant problem, especially for small datasets
Often dealt with by removing observations with missing data

TECHNIQUES FOR HANDLING MISSING DATA


A variety of techniques exist for filling in missing data, though
some perform better than others

FILLING IN HOLES WITH STATISTICS


Recognizing the inherent uncertainty in missing data, we
adopt and advocate the method of multiple imputation
using Bayesian methods (“chained equations”)

2
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Why Imputation?
Is it worth it?

Preserves Data
Fooled by Randomness
Imputation prevents the reduction of
Having more data prevents us from falling
sample size due to missing values. This
prey to overly optimistic models that are
helps to preserve all responses in the
fit to more noise than signals
sample

Impute and
Assess Risk!

Preserves Structure of Data


Predictive Accuracy
When we remove data points, we could
Reducible uncertainty can be reduced by
be missing important patterns in the data,
increasing sample size. This helps to
which can cause our analysis to distort
improve predictive accuracy
true patterns within the data

3
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

DATA
Foundation of All
Analyses

The goal is to turn


data into
How Should We Handle It? information, and
The bulk of the time in analytics should
be spent on collecting, normalizing
information into
and verifying data. In defense and insight.
aerospace applications, datasets are
small. Data should be preserved when -Carly Fiorina

4 possible
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPUTATION
To impute or not to impute, that is the question

01 02 03

Understand Determine Know when


the available variables that blanks are
data would benefit intentional
from imputation

Imputation is a powerful method that is useful for filling blanks when they are missing within a dataset
An analyst must understand the data intimately to know if a blank means that the factor is not applicable for
that data point
5
Sometimes a blank does not reflect a nonresponse and should be observed “as is”
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Is the response missing at random?

The US Census Bureau


deals with missing data all
the time. If no response is
provided for the name of
Person 7 on the Census
form from the household
of six members, this missing
value is not an omission;
the response is “Not
Applicable”

6
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

ISSUES WITH DATA GAPS


What can go wrong?

Fewer Degrees of Reduction of Predictive Inability to Use


Freedom Power Advanced Methods
Removing observations with Predictive power is diminished Certain Machine Learning
missing values results in fewer when degrees of freedom are methods cannot be applied
degrees of freedom in models small when missing values are
prevalent

7
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

METHODS ALLOWING
MISSING DATA
Complete-Case Analysis
Approach that excludes any records with missing data.
Disadvantage – bias becomes introduced into the analysis
due to the removal of data that may provide insight into the
population

Available-Case Analysis
Approach allows the analysis of subsets of the complete
dataset so that multiple aspects of a problem can be
studied. Disadvantage – bias is again introduced if data are
missing in a pattern

Alternative to Allowing Missingness


Though methods exist to continue with analysis upon removal
of missing data, better alternatives exist for filling data gaps

8
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPUTATION METHODS

Mean Imputation Imputing using Regression Expectation


Related Observations Imputation Maximization
Filling missing values with the Filling missing values with Replacing missing values with Replacing missing values by
mean of the observed values responses from related a predicted value based on exploring the covariation
observations the results of fitting a among variables in order to
regression line to the available infer values for the missing
data data

To retain as much of the precious gold (data) as possible, we should consider using imputation
methods. There are several methods you can choose to make a best statistical inference at a
response that will close a data gap

9
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPUTATION METHODS
How do they compare?

Mean Imputation Related Observations Regression Imputation Expectation Maximization


This method helps to restrict the This method also helps to restrict This method uses regression to This method uses maximum
variability of the data variability in the data predict missing values. MICE is a likelihood method to estimate
regression imputation method missing values
Disadvantage: it weakens Disadvantage: Introduces
covariances and correlations measurement error Advantage: Produces unbiased Advantage: Increases precision
amount features estimates with data that are and decreases parameter bias
10
Missing At Random (MAR)
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Tools for Imputation

R Python
R is a language and Python is a high-level
environment for statistical programming language with
computing and graphics. It is dynamic semantics. Like R,
an integrated suite of software Python supports modules and
facilities for data manipulation, packages to help with analysis
calculation and graphical
display

11
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

MICE
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

MULTIPLE IMPUTATION BY
CHAINED EQUATIONS
MICE

Method
This method creates multiple imputations for a missing value
that accounts for the statistical uncertainty in the imputation

Assumptions
This method operates under the assumption that the missing
data is MAR. MAR occurs when a data gap is full accounted
for by variables where there is complete information

Iterations
Multiple regression models are conducted and each variable
with missing data is modeled conditionally on the responses
of the other variables within the dataset. With this method,
each variable is modeled according to its own distribution

13
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

HOW MICE FILLS GAPS


Several imputed versions of the data are created using plausible data values

01 02 03

NUMBER #01 NUMBER #02 NUMBER #03


Multiple imputation is a series of stochastic The first step is an imputation step (I-step) The number of iterations, m, are specified
regression imputations that fills data gaps using stochastic for the number of imputations that are
regression conducted in the I-step

06 05 04
NUMBER #06 NUMBER #05 NUMBER #04
The coefficients of the individual equation The P-step proceeds by taking a random In posterior step (P-step), the mean and
are averaged using a simple, unweighted draw from the mean and covariance covariance distributions are calculated
mean. Goodness-of-fit measures are distributions, which are used to calculate from the filled-in data
14
calculated using the pooled results regression coefficients
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

THE MICE PROCESS

Given the multiple imputations, the coefficients of the individual equation are averaged (using a
simple, unweighted mean). The other parameters, including the degrees of freedom, standard
errors, and R2s are combined using what is known as Rubin’s Rules, after the statistician who
developed them

15
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

UNDERSTANDING THE DATA


Exploring engine data

Dataset
The data used for analysis is a Wheeled and Tracked Vehicle
Engine dataset. The dataset is small, which makes the use of
imputation very important

Included Features
Identification (ID), Brake Horsepower (bHP), Displacement
(DISP), Engine Speed (EngSP), Cylinders (CYL), Unit Cost in
Dollars (UC), Dry Weight (DryWGT)

Missing Counts
Of the seven features included in the dataset, four of those
seven have missing values.
N=9

16
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Dataset Example
Four variables have missing data

ID bHP EngSP CYL DryWGT DISP UC


1 290 2600 6 7.2 $40,079
2 330 2400 6 1296 7.2 $40,927
3 330 2200 6 1905 8.8 $29,563
4 515 1500 6 3090 15.2 $63,931
5 675 2101 8 14.8 $111,976
6 675 2101 8 14.8 $120,661
7 500 2100 8 12.1 $47,873
8 362 2300 3230 12.1
9 340 8 912 6.6 $40,661

17
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPLEMENTING MICE

01 02 03
We used the statistical Conduct linear regression on Pooling Results
programming platform R and each of the five imputed
Combining the results of these separate
the ‘mice’ package to datasets analyses is referred to as pooling
calculate imputed data
To view each of the imputed datasets, we The pooled regression equation has
use the complete() function: coefficients that are the arithmetic means
R code:
of the coefficients for the five individual
install.packages('mice’)
R code: regressions
library(mice)
completedData<-complete(imputedata,1)
data<-read(“Example.csv”)
Let m denote the number of imputed
imputdata<-mice(data, m=5, meth=‘pmm’,
The number one in the complete function datasets, 𝛽𝑖 denote the ith coefficient, and 𝛽𝑖𝑗
seed=23109)
indicates that you want to see the first denote the ith coefficient for the jth imputed dataset;
iteration. To see the other 2-5 datasets, you then:
Fixed seed to ensure the analysis is
will need to write functions to create and σ𝑚𝑗=1 𝛽𝑖𝑗
repeatable 𝛽𝑖 =
view those datasets 𝑚
The default in mice is m=5. This parameter
will need to be included if another value of
imputations is desired
18
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPLEMENTING MICE

04 05 06
Pooling Results - 2 Goodness-of-Fit Statistics Compare Results
To fit a linear model to a dataset, use the Unlike the coefficients, you cannot simply Compare the results from the imputed
lm() function. Then, pool the m estimates average the R2 values, standard errors, the dataset to the original dataset with missing
𝑄෠ (1) , … , 𝑄෠ (𝑚) into one model 𝑄.
ഥ F-stats, etc., in order to calculate the values removed
goodness-of-fit statistics
R code:
Fit1<-with(imputedata,lm(UC~bHP)) R code:
Summary(pool(Fit1)) pool.r.squared(fit4, adjusted = FALSE)

poolF<-mi.anova(mi.res=imputedata,
formula="UC~bHP")

19
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

ANALYZING RESULTS
Creating plots to determine reasonableness of imputations

Scatterplot Analysis
There is a linear relationship
between UC and bHP. The pattern
of the relationship seems plausible
for the imputed values (pink) as
compared to the observed values
(blue)

Density Plot Analysis


Density plots provide a visual into
the shapes of each imputation. The
plot is useful to determine outlier
imputations and works for variables
with two or more missing values

20
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

MICE Results
ID bHP EngSP CYL DryWGT DISP UC

1 290 2600 6 3090, 1296, 7.2 $40,079


1905, 1905, 912
2 330 2400 6 1296 7.2 $40,927

3 330 2200 6 1905 8.8 $29,563

4 515 1500 6 3090 15.2 $63,931

5 675 2101 8 912, 3230, 1296, 14.8 $111,976


3090, 1905
6 675 2101 8 3090,1905, 14.8 $120,661
3090, 912, 912
7 500 2100 8 912, 3090, 1296, 12.1 $47,873
3090, 912
8 362 2300 8, 8, 8, 3230 12.1 $47,873,
6 $47,873,
$40,079
$40,927
$111,976
9 340 2400, 2400, 8 912 6.6 $40,661
2300, 2300,
2400
21
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

FIT RESULTS
Comparing results from the original dataset to the imputed (pooled) dataset

Linear Model MICE Imputed Model

The model is a solid one with a statistically significant p-value less than Though the R2 statistic is lower than the original dataset, we gained some
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due degrees of freedom with the use of imputation with the creation of this
to missing a unit cost value statistically significant model. The model does not gain a full degree of
freedom since the iterations are pooled

22
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

EXPECTATION
MAXIMIZATION
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Expectation
Maximization
Imputing by optimizing

Maximum Likelihood
The maximum likelihood method is used to impute missing values.
This method uses available data to impute a value and then checks
to determine the reasonableness of the guess

Covariance
The covariation among variables is used to infer probable values for
the missing data

Two-Step Process
The method follows a two-step process to fill in missing data

24
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

EM TWO-STEP PROCESS
How EM fills data gaps

STEP #01 01 02 STEP #02


Iterative Process
The maximum likelihood estimates
EM is an of the mean vector and

First Pass at Filling Gaps iterative covariance matrix are calculated.

The algorithm begins by filling the process The covariance matrix is then used
to derive regression equations for
gaps with the conditional mean of
used to fill the next iteration and the cycle
the missing values.
data gaps continues until the difference
between the covariance matrices
in subsequent runs falls below the
convergence criteria

25
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

IMPLEMENTING EM

01 02 03
Show missingness patterns Performing maximum Pooling Results
likelihood estimation using
The function prelim.norm if used on a matrix The average of the imputations is
of the x (bHP) and y (cost) variables to sort EM algorithm calculated for the variable with missing
rows according to the missingness patterns values
Fixed seed to ensure the analysis is R code:
repeatable b<-em.norm(a) R code:
c1<-getparam.norm(a,b) c1$mu[1]
R code:
a<-prelim.norm(cbind(y,x) This function produces a vector which can The estimates for the coefficients of the
then be used to return a list of parameters model are then estimated
b.est<-c(c1$mu[1]-
(c1$sigma[1,2]/c1$sigma[2,2])*c1$mu[2],c1
$sigma[1,2]/c1$sigma[2,2])

The model can then be used to calculate


the missing values for the dataset

26
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

EM
ID bHP UC
1 290 $40,079
2 330 $40,927
3 330 $29,563
4 515 $63,931
5 675 $111,976
6 675 $120,661
7 500 $47,873
8 362 $59,771
9 340 $40,661

27
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

FIT RESULTS - 2
Comparing results from the original dataset to the EM imputed dataset

Linear Model EM Imputed Model

The model is a solid one with a statistically significant p-value less than Compared to the results produced from removing the data points with
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due missing values, this is a better performing model. A degree of freedom
to missing a unit cost value was gained and the R2 metric increased while the model retained
statistical significance

28
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

EXPECTATION MAXIMIZATION
Why choose EM?

ADVANTAGES DISADVANTAGES
EM preserves the relationship with other EM can sometime underestimate standard
variables, unlike mean imputation error

29
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

COMPARING METHODS
MICE VERSUS EM

MICE and EM are based on similar For small data sets, it is wise to run both and
assumptions and in practice they often compare the results, as small differences in
produce similar results. The Bayesian the methods could have an outsized
estimation in MICE is asymptotically impact when the number of data points is
equivalent to the maximum likelihood limited
estimates in EM, so for large data sets the
two methods should provide similar results

There are multiple methods which can be used to impute data. Two of the strongest techniques, MICE
and EM, should be considered first as they preserve relationships between independent and
dependent variables and estimate error more accurately.

The MICE method for imputation has an edge over EM since MICE calculates multiple imputations for
the missing values instead of one single estimate.

30
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Q&A

THE FUTURE. DELIVERED.


Galorath provides solutions that help organizational leaders make complex business decisions
with confidence. Our predictive analytics products and services give complete insight into the
implications of significant technical or financial decisions, allowing organizations to execute a
plan with assurance and reach their goals with absolute certainty.

Learn more or schedule a demo


(310) 906-6320 • [email protected] Kimberly Roye Christian Smart, PhD, CCEA
[email protected] [email protected]

3
1
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com

Presenters

Kimberly Roye Christian Smart Dustin Hilton


Senior Data Scientist Chief Scientist Senior Cost Analyst
[email protected] [email protected] [email protected]

32

You might also like