Handling Missing Data
Handling Missing Data
Handling
Missing Data Imputation Techniques
Let’s explore one example of mobile data. Here, one sample has a
missing value, not because of dataset variables but because of
another reason.
Missing completely at random (MCAR) analysis assumes that
missingness is unrelated of any unobserved data (response and
covariate), meaning that the probability of a missing data value is
independent of any observation in the data set.
MCAR produces reliable estimates that are unbiased but still there is a
loss power due to poor design but not due to absence of the data.
Educated Guessing
It sounds arbitrary and isn’t a preferred course of action, but one can
sometimes infer a missing value based on other response. For
related questions, for example, like those often presented in a
matrix, if the participant responds with all “2s”, assume that the
missing value is a 2.
Discard Data
3) Dropping Variables
If there is too much data missing for a variable, it may be an option
to delete the variable or the column from the dataset. There is no
rule of thumbs for this, but it depends on the situation, and a proper
analysis of data is needed before the variable is dropped altogether.
This should be the last option, and we need to check if model
performance improves after the deletion of a variable.
Random Forest
Random forest is a non-parametric imputation method applicable to
various variable types that work well with both data missing at
random and not missing at random. Random forest uses multiple
decision trees to estimate missing values and outputs OOB (out of
the bag) imputation error estimates. One caveat is that random
forest works best with large datasets, and using random forest on
small datasets runs the risk of overfitting.
Maximum likelihood
The assumption that the observed data are a sample drawn from a
multivariate normal distribution is relatively easy to understand.
After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just
been estimated. Several strategies are using the maximum likelihood
method to handle the missing data.
Expectation-Maximization
Expectation-Maximization (EM) is the maximum likelihood method
used to create a new data set. All missing values are imputed with
values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the
parameters (e.g., variances, covariances, and means) are estimated,
perhaps using the listwise deletion. Those estimates are then used to
create a regression equation to predict the missing data. The
maximization step uses those equations to fill in the missing data.
The expectation step is then repeated with the new parameters,
where the new regression equations are determined to “fill in” the
missing data. The expectation and maximization steps are repeated
until the system stabilizes.
Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the
uncertainty in the output of a model can be allocated to the different
sources of uncertainty in its inputs. When analysing the missing
data, additional assumptions on the missing data are made, and
these assumptions are often applicable to the primary analysis.
However, the assumptions cannot be definitively validated for
correctness. Therefore, the National Research Council has proposed
that the sensitivity analysis be conducted to evaluate the robustness
of the results to the deviations from the MAR assumption.