0% found this document useful (0 votes)
102 views23 pages

Handling Missing Data

The document discusses different types of missing data and techniques for handling missing data. It defines three categories of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). It then describes several strategies for handling missing data, including discarding data through listwise or pairwise deletion, imputing missing values using statistical techniques like mean, median, and mode imputation, or carrying forward the last observed value. The goal of imputation techniques is to produce a complete dataset that can be used for machine learning and modeling.

Uploaded by

ssakhare2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views23 pages

Handling Missing Data

The document discusses different types of missing data and techniques for handling missing data. It defines three categories of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). It then describes several strategies for handling missing data, including discarding data through listwise or pairwise deletion, imputing missing values using statistical techniques like mean, median, and mode imputation, or carrying forward the last observed value. The goal of imputation techniques is to produce a complete dataset that can be used for machine learning and modeling.

Uploaded by

ssakhare2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

All About Missing Data

Handling
Missing Data Imputation Techniques

Missing data is an everyday problem that a data professional need to


deal with. Though there are many articles, blogs, videos already
available, I found it is difficult to find a piece of concise consolidated
information in a single place. That’s why I am putting my effort here,
hoping it will be useful to any data practitioner or enthusiast.

What is missing data? Missing data are defined as not available


values, and that would be meaningful if observed. Missing data can
be anything from missing sequence, incomplete feature, files
missing, information incomplete, data entry error etc. Most datasets
in the real world contain missing data. Before you can use data with
missing data fields, you need to transform those fields to be used for
analysis and modelling. Like many other aspects of data science, this
too may actually be more art than science. Understanding the data
and the domain from which it comes is very important.

Having missing values in your data is not necessarily a setback. Still,


it is an opportunity to perform the right feature engineering to guide
the model to interpret the missing information the right way. There
are machine learning algorithms and packages that can
automatically detect and deal with missing data. However, it’s still
recommended to transform the missing data manually through
analysis and coding strategy. First, we need to understand what are
the types of missing data. Missingness is broadly categorized into 3
categories:

Missing Completely at Random (MCAR)


When we say data are missing completely at random, we mean
that the missingness has nothing to do with the observation being
studied (Completely Observed Variable (X) and Partly Missing
Variable (Y)). For example, a weighing scale that ran out of batteries,
a questionnaire might be lost in the post, or a blood sample might be
damaged in the lab. MCAR is an ideal but unreasonable assumption.
Generally, data are regarded as being MCAR when data are missing
by design, because of an equipment failure or because the samples
are lost in transit or technically unsatisfactory. The statistical
advantage of data that are MCAR is that the analysis
remains unbiased. A pictorial view of MCAR is below where
missingness has no relation to dataset variables X or Y.
Missingness is not related to X or Y but some other reason Z.

Let’s explore one example of mobile data. Here, one sample has a
missing value, not because of dataset variables but because of
another reason.
Missing completely at random (MCAR) analysis assumes that
missingness is unrelated of any unobserved data (response and
covariate), meaning that the probability of a missing data value is
independent of any observation in the data set.

In this case, missing and observed observations are generated from


the same distribution, means there is no systematic mechanism that
makes the data to be missing more than others. when this assumption
is confirmed, you can perform a complete case(CC) analysis on the
observed data.

MCAR produces reliable estimates that are unbiased but still there is a
loss power due to poor design but not due to absence of the data.

Missing at Random (MAR)


When we say data are missing at random, we mean that missing
data on a partly missing variable (Y) is related to some other
completely observed variables(X) in the analysis model but not to
the values of Y itself.
A pictorial view of MAR as below where missingness relates to
dataset variable X but not with Y. It can have other relationships (Z).
It is not specifically related to the missing information. For example,
if a child does not attend an examination because the child is ill, this
might be predictable from other data about the child’s health, but it
would not be related to what we would have examined had the child
not been ill. Some may think that MAR does not present a problem.
However, MAR does not mean that the missing data can be ignored.
Missing not at Random (MNAR)
If the data characters do not meet those of MCAR or MAR, they fall
into the category of missing not at random (MNAR). When data
are missing, not at random, the missingness is specifically
related to what is missing, e.g. a person does not attend a drug test
because the person took drugs the night before. A person did not
take an English proficiency test due to his poor English language
skill. The cases of MNAR data are problematic. The only way to
obtain an unbiased estimate of the parameters in such a case is to
model the missing data, but that requires proper understanding and
domain knowledge of the missing variable. The model may then be
incorporated into a more complex one for estimating the missing
values. A pictorial view of MNAR is below where missingness
directly relates to variable Y. It can have other relationships (X &
Z).
Several strategies can be applied to handle missing data to make the
Machine Learning/Statistical Model.

Try to obtain the missing data.


This may be possible in the data collection phase in a survey like
situation where one can check if survey data is captured in its
entirety before the respondent leaves the room. Sometimes it may be
possible to reach out to the source to get the data, like asking the
missing question again for a response. In a real-world scenario, this
is an improbable way to resolve the missing data problem.

Educated Guessing
It sounds arbitrary and isn’t a preferred course of action, but one can
sometimes infer a missing value based on other response. For
related questions, for example, like those often presented in a
matrix, if the participant responds with all “2s”, assume that the
missing value is a 2.

Discard Data

1) list-wise (Complete-case analysis — CCA) deletion


The most common approach to the missing data is to omit those
cases with the missing data and analyse the remaining data. This
approach is known as the complete case (or available case) analysis
or list-wise deletion.

If there is a large enough sample, where power is not an issue, and


the assumption of MCAR is satisfied, the listwise deletion may be a
reasonable strategy. However, when there is not a large sample or
the assumption of MCAR is not satisfied, then listwise deletion is not
the optimal strategy. It also introduces bias if it does not satisfy
MCAR.

Refer to below sample observation after deletion

2) Pairwise (available case analysis — ACA) Deletion


In this case, only the missing observations are ignored, and analysis
is done on the variables present. If there is missing data elsewhere in
the data set, the existing values are used. Since a pairwise deletion
uses all information observed, it preserves more information than
the listwise deletion.

Pairwise deletion is known to be less biased for the MCAR or MAR


data. However, if there are many missing observations, the analysis
will be deficient. The problem with pairwise deletion is that even
though it takes the available cases, one can’t compare analyses
because they are different every time.

3) Dropping Variables
If there is too much data missing for a variable, it may be an option
to delete the variable or the column from the dataset. There is no
rule of thumbs for this, but it depends on the situation, and a proper
analysis of data is needed before the variable is dropped altogether.
This should be the last option, and we need to check if model
performance improves after the deletion of a variable.

Retain All Data


Any imputation technique aims to produce a complete dataset that
can then be then used for machine learning. There are few ways we
can do imputation to retain all data for analysis and building the
model.
1) Mean, Median and Mode
In this imputation technique goal is to replace missing data with
statistical estimates of the missing values. Mean, Median or Mode
can be used as imputation value.

In a mean substitution, the mean value of a variable is used in place


of the missing data value for that same variable. This has the benefit
of not changing the sample mean for that variable. The theoretical
background of the mean substitution is that the mean is a
reasonable estimate for a randomly selected observation from a
normal distribution. However, with missing values that are not
strictly random, especially in the presence of great inequality in the
number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Distortion of
original variance and Distortion of co-variance with remaining
variables within the dataset are two major drawbacks of this
method.

Median can be used when the variable has a skewed distribution.

The rationale for Mode is to replace the population of missing values


with the most frequent value since this is the most likely occurrence.
2) Last Observation Carried Forward (LOCF)
If data is time-series data, one of the most widely used imputation
methods is the last observation carried forward (LOCF). Whenever a
value is missing, it is replaced with the last observed value. This
method is advantageous as it is easy to understand and
communicate. Although simple, this method strongly assumes that
the value of the outcome remains unchanged by the missing data,
which seems unlikely in many settings.
3) Next Observation Carried Backward (NOCB)
A similar approach like LOCF works oppositely by taking the first
observation after the missing value and carrying it backward (“next
observation carried backwards”, or NOCB).
4) Linear Interpolation
Interpolation is a mathematical method that adjusts a function to
data and uses this function to extrapolate the missing data. The
simplest type of interpolation is linear interpolation, which means
between the values before the missing data and the value. Of course,
we could have a pretty complex pattern in data, and linear
interpolation could not be enough. There are several different types
of interpolation. Just in Pandas, we have the following options like:
‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’,
‘cubic’, ‘polynomial’, ‘spline’, ‘piecewise polynomial’ and many more.
5) Common-Point Imputation
For a rating scale, using the middle point or most commonly chosen
value. For example, on a five-point scale, substitute a 3, the
midpoint, or a 4, the most common value (in many cases). It is
similar to the mean value but more suitable for ordinal values.

6) Adding a category to capture NA


This is perhaps the most widely used method of missing data
imputation for categorical variables. This method consists of
treating missing data as an additional label or category of the
variable. All the missing observations are grouped in the newly
created label ‘Missing’. It does not assume anything on the
missingness of the values. It is very well suited when the number of
missing data is high.

7) Frequent category imputation


Replacement of missing values by the most frequent category is the
equivalent of mean/median imputation. It consists of replacing all
occurrences of missing values within a variable with the variable's
most frequent label or category.
8) Arbitrary Value Imputation
Arbitrary value imputation consists of replacing all occurrences of
missing values within a variable with an arbitrary value. Ideally, the
arbitrary value should be different from the median/mean/mode
and not within the normal values of the variable. Typically used
arbitrary values are 0, 999, -999 (or other combinations of 9’s) or -1
(if the distribution is positive). Sometimes data already contain an
arbitrary value from the originator for the missing values. This
works reasonably well for numerical features predominantly positive
in value and for tree-based models in general. This used to be a more
common method when the out-of-the-box machine learning
libraries and algorithms were not very adept at working with
missing data.
9) Adding a variable to capture NA
When data are not missing completely at random, we can capture
the importance of missingness by creating an additional variable
indicating whether the data was missing for that observation (1) or
not (0). The additional variable is binary: it takes only the values 0
and 1, 0 indicating that a value was present for that observation, and
1 indicating that the value was missing. Typically, mean/median
imputation is done to add a variable to capture those observations
where the data was missing.
10) Random Sampling Imputation
Random sampling imputation is in principle similar to
mean/median imputation because it aims to preserve the statistical
parameters of the original variable, for which data is missing.
Random sampling consists of taking a random observation from the
pool of available observations and using that randomly extracted
value to fill the NA. In Random Sampling, one takes as many
random observations as missing values are present in the variable.
Random sample imputation assumes that the data are missing
completely at random (MCAR). If this is the case, it makes sense to
substitute the missing values with values extracted from the original
variable distribution.
Multiple Imputation
Multiple Imputation (MI) is a statistical technique for handling
missing data. The key concept of MI is to use the distribution of the
observed data to estimate a set of plausible values for the missing
data. Random components are incorporated into these estimated
values to show their uncertainty. Multiple datasets are created and
then analysed individually but identically to obtain a set of
parameter estimates. Estimates are combined to obtain a set of
parameter estimates. The benefit of the multiple imputations is that
restoring the natural variability of the missing values incorporates
the uncertainty due to the missing data, which results in a valid
statistical inference. As a flexible way of handling more than one
missing variable, apply a Multiple Imputation by Chained Equations
(MICE) approach. Refer to the reference section to get more
information on MI and MICE. Below is a schematic representation
of MICE.

Predictive/Statistical models that impute the


missing data
This should be done in conjunction with some cross-validation
scheme to avoid leakage. This can be very effective and can help with
the final model. There are many options for such a predictive model,
including a neural network. Here I am listing a few which are very
popular.
Linear Regression
In regression imputation, the existing variables are used to predict,
and then the predicted value is substituted as if an actually obtained
value. This approach has several advantages because the imputation
retains a great deal of data over the listwise or pairwise deletion and
avoids significantly altering the standard deviation or the shape of
the distribution. However, as in a mean substitution, while a
regression imputation substitutes a value predicted from other
variables, no novel information is added, while the sample size has
been increased and the standard error is reduced.

Random Forest
Random forest is a non-parametric imputation method applicable to
various variable types that work well with both data missing at
random and not missing at random. Random forest uses multiple
decision trees to estimate missing values and outputs OOB (out of
the bag) imputation error estimates. One caveat is that random
forest works best with large datasets, and using random forest on
small datasets runs the risk of overfitting.

k-NN (k Nearest Neighbour)


k-NN imputes the missing attribute values based on the nearest K
neighbour. Neighbours are determined based on a distance
measure. Once K neighbours are determined, the missing value is
imputed by taking mean/median or mode of known attribute values
of the missing attribute.

Maximum likelihood
The assumption that the observed data are a sample drawn from a
multivariate normal distribution is relatively easy to understand.
After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just
been estimated. Several strategies are using the maximum likelihood
method to handle the missing data.
Expectation-Maximization
Expectation-Maximization (EM) is the maximum likelihood method
used to create a new data set. All missing values are imputed with
values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the
parameters (e.g., variances, covariances, and means) are estimated,
perhaps using the listwise deletion. Those estimates are then used to
create a regression equation to predict the missing data. The
maximization step uses those equations to fill in the missing data.
The expectation step is then repeated with the new parameters,
where the new regression equations are determined to “fill in” the
missing data. The expectation and maximization steps are repeated
until the system stabilizes.

Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the
uncertainty in the output of a model can be allocated to the different
sources of uncertainty in its inputs. When analysing the missing
data, additional assumptions on the missing data are made, and
these assumptions are often applicable to the primary analysis.
However, the assumptions cannot be definitively validated for
correctness. Therefore, the National Research Council has proposed
that the sensitivity analysis be conducted to evaluate the robustness
of the results to the deviations from the MAR assumption.

Algorithms that Support Missing Values


Not all algorithms fail when there is missing data. Some algorithms
can be made robust to missing data, such as k-Nearest Neighbours,
that can ignore a column from a distance measure when a value is
missing. Some algorithms can use the missing value as a unique and
different value when building the predictive model, such as
classification and regression trees. An algorithm like XGBoost takes
into consideration of any missing data. If your imputation does not
work well, try a model that is robust to missing data.
Recommendations
Missing data reduces the power of a model. Some missing data is
expected, and the target sample size is increased to allow for it.
However, such cannot eliminate the potential bias. More attention
should be paid to the missing data in the design and performance of
the studies and the analysis of the resulting data. The machine
learning model techniques should only be performed after the
maximal efforts put into reducing missing data in the design and
prevention techniques.

A statistically valid analysis that has appropriate mechanisms and


assumptions for the missing data strongly recommended. Most of
the imputation technique can cause bias. It is difficult to know
whether the multiple imputations or full maximum likelihood
estimation is best, but both are superior to the traditional
approaches. Both techniques are best used with large samples. In
general, multiple imputations is a good approach when analysing
data sets with missing data.

You might also like