0% found this document useful (0 votes)

102 views23 pages

Handling Missing Data

The document discusses different types of missing data and techniques for handling missing data. It defines three categories of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). It then describes several strategies for handling missing data, including discarding data through listwise or pairwise deletion, imputing missing values using statistical techniques like mean, median, and mode imputation, or carrying forward the last observed value. The goal of imputation techniques is to produce a complete dataset that can be used for machine learning and modeling.

Uploaded by

ssakhare2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views23 pages

Handling Missing Data

Uploaded by

ssakhare2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

All About Missing Data

Handling
Missing Data Imputation Techniques

Missing data is an everyday problem that a data professional need to

deal with. Though there are many articles, blogs, videos already
available, I found it is difficult to find a piece of concise consolidated
information in a single place. That’s why I am putting my effort here,
hoping it will be useful to any data practitioner or enthusiast.

What is missing data? Missing data are defined as not available

values, and that would be meaningful if observed. Missing data can
be anything from missing sequence, incomplete feature, files
missing, information incomplete, data entry error etc. Most datasets
in the real world contain missing data. Before you can use data with
missing data fields, you need to transform those fields to be used for
analysis and modelling. Like many other aspects of data science, this
too may actually be more art than science. Understanding the data
and the domain from which it comes is very important.

Having missing values in your data is not necessarily a setback. Still,

it is an opportunity to perform the right feature engineering to guide
the model to interpret the missing information the right way. There
are machine learning algorithms and packages that can
automatically detect and deal with missing data. However, it’s still
recommended to transform the missing data manually through
analysis and coding strategy. First, we need to understand what are
the types of missing data. Missingness is broadly categorized into 3
categories:

Missing Completely at Random (MCAR)

When we say data are missing completely at random, we mean
that the missingness has nothing to do with the observation being
studied (Completely Observed Variable (X) and Partly Missing
Variable (Y)). For example, a weighing scale that ran out of batteries,
a questionnaire might be lost in the post, or a blood sample might be
damaged in the lab. MCAR is an ideal but unreasonable assumption.
Generally, data are regarded as being MCAR when data are missing
by design, because of an equipment failure or because the samples
are lost in transit or technically unsatisfactory. The statistical
advantage of data that are MCAR is that the analysis
remains unbiased. A pictorial view of MCAR is below where
missingness has no relation to dataset variables X or Y.
Missingness is not related to X or Y but some other reason Z.

Let’s explore one example of mobile data. Here, one sample has a
missing value, not because of dataset variables but because of
another reason.
Missing completely at random (MCAR) analysis assumes that
missingness is unrelated of any unobserved data (response and
covariate), meaning that the probability of a missing data value is
independent of any observation in the data set.

In this case, missing and observed observations are generated from

the same distribution, means there is no systematic mechanism that
makes the data to be missing more than others. when this assumption
is confirmed, you can perform a complete case(CC) analysis on the
observed data.

MCAR produces reliable estimates that are unbiased but still there is a
loss power due to poor design but not due to absence of the data.

Missing at Random (MAR)

When we say data are missing at random, we mean that missing
data on a partly missing variable (Y) is related to some other
completely observed variables(X) in the analysis model but not to
the values of Y itself.
A pictorial view of MAR as below where missingness relates to
dataset variable X but not with Y. It can have other relationships (Z).
It is not specifically related to the missing information. For example,
if a child does not attend an examination because the child is ill, this
might be predictable from other data about the child’s health, but it
would not be related to what we would have examined had the child
not been ill. Some may think that MAR does not present a problem.
However, MAR does not mean that the missing data can be ignored.
Missing not at Random (MNAR)
If the data characters do not meet those of MCAR or MAR, they fall
into the category of missing not at random (MNAR). When data
are missing, not at random, the missingness is specifically
related to what is missing, e.g. a person does not attend a drug test
because the person took drugs the night before. A person did not
take an English proficiency test due to his poor English language
skill. The cases of MNAR data are problematic. The only way to
obtain an unbiased estimate of the parameters in such a case is to
model the missing data, but that requires proper understanding and
domain knowledge of the missing variable. The model may then be
incorporated into a more complex one for estimating the missing
values. A pictorial view of MNAR is below where missingness
directly relates to variable Y. It can have other relationships (X &
Z).
Several strategies can be applied to handle missing data to make the
Machine Learning/Statistical Model.

Try to obtain the missing data.

This may be possible in the data collection phase in a survey like
situation where one can check if survey data is captured in its
entirety before the respondent leaves the room. Sometimes it may be
possible to reach out to the source to get the data, like asking the
missing question again for a response. In a real-world scenario, this
is an improbable way to resolve the missing data problem.

Educated Guessing
It sounds arbitrary and isn’t a preferred course of action, but one can
sometimes infer a missing value based on other response. For
related questions, for example, like those often presented in a
matrix, if the participant responds with all “2s”, assume that the
missing value is a 2.

Discard Data

1) list-wise (Complete-case analysis — CCA) deletion

The most common approach to the missing data is to omit those
cases with the missing data and analyse the remaining data. This
approach is known as the complete case (or available case) analysis
or list-wise deletion.

If there is a large enough sample, where power is not an issue, and

the assumption of MCAR is satisfied, the listwise deletion may be a
reasonable strategy. However, when there is not a large sample or
the assumption of MCAR is not satisfied, then listwise deletion is not
the optimal strategy. It also introduces bias if it does not satisfy
MCAR.

Refer to below sample observation after deletion

2) Pairwise (available case analysis — ACA) Deletion

In this case, only the missing observations are ignored, and analysis
is done on the variables present. If there is missing data elsewhere in
the data set, the existing values are used. Since a pairwise deletion
uses all information observed, it preserves more information than
the listwise deletion.

Pairwise deletion is known to be less biased for the MCAR or MAR

data. However, if there are many missing observations, the analysis
will be deficient. The problem with pairwise deletion is that even
though it takes the available cases, one can’t compare analyses
because they are different every time.

3) Dropping Variables
If there is too much data missing for a variable, it may be an option
to delete the variable or the column from the dataset. There is no
rule of thumbs for this, but it depends on the situation, and a proper
analysis of data is needed before the variable is dropped altogether.
This should be the last option, and we need to check if model
performance improves after the deletion of a variable.

Retain All Data

Any imputation technique aims to produce a complete dataset that
can then be then used for machine learning. There are few ways we
can do imputation to retain all data for analysis and building the
model.
1) Mean, Median and Mode
In this imputation technique goal is to replace missing data with
statistical estimates of the missing values. Mean, Median or Mode
can be used as imputation value.

In a mean substitution, the mean value of a variable is used in place

of the missing data value for that same variable. This has the benefit
of not changing the sample mean for that variable. The theoretical
background of the mean substitution is that the mean is a
reasonable estimate for a randomly selected observation from a
normal distribution. However, with missing values that are not
strictly random, especially in the presence of great inequality in the
number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Distortion of
original variance and Distortion of co-variance with remaining
variables within the dataset are two major drawbacks of this
method.

Median can be used when the variable has a skewed distribution.

The rationale for Mode is to replace the population of missing values

with the most frequent value since this is the most likely occurrence.
2) Last Observation Carried Forward (LOCF)
If data is time-series data, one of the most widely used imputation
methods is the last observation carried forward (LOCF). Whenever a
value is missing, it is replaced with the last observed value. This
method is advantageous as it is easy to understand and
communicate. Although simple, this method strongly assumes that
the value of the outcome remains unchanged by the missing data,
which seems unlikely in many settings.
3) Next Observation Carried Backward (NOCB)
A similar approach like LOCF works oppositely by taking the first
observation after the missing value and carrying it backward (“next
observation carried backwards”, or NOCB).
4) Linear Interpolation
Interpolation is a mathematical method that adjusts a function to
data and uses this function to extrapolate the missing data. The
simplest type of interpolation is linear interpolation, which means
between the values before the missing data and the value. Of course,
we could have a pretty complex pattern in data, and linear
interpolation could not be enough. There are several different types
of interpolation. Just in Pandas, we have the following options like:
‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’,
‘cubic’, ‘polynomial’, ‘spline’, ‘piecewise polynomial’ and many more.
5) Common-Point Imputation
For a rating scale, using the middle point or most commonly chosen
value. For example, on a five-point scale, substitute a 3, the
midpoint, or a 4, the most common value (in many cases). It is
similar to the mean value but more suitable for ordinal values.

6) Adding a category to capture NA

This is perhaps the most widely used method of missing data
imputation for categorical variables. This method consists of
treating missing data as an additional label or category of the
variable. All the missing observations are grouped in the newly
created label ‘Missing’. It does not assume anything on the
missingness of the values. It is very well suited when the number of
missing data is high.

7) Frequent category imputation

Replacement of missing values by the most frequent category is the
equivalent of mean/median imputation. It consists of replacing all
occurrences of missing values within a variable with the variable's
most frequent label or category.
8) Arbitrary Value Imputation
Arbitrary value imputation consists of replacing all occurrences of
missing values within a variable with an arbitrary value. Ideally, the
arbitrary value should be different from the median/mean/mode
and not within the normal values of the variable. Typically used
arbitrary values are 0, 999, -999 (or other combinations of 9’s) or -1
(if the distribution is positive). Sometimes data already contain an
arbitrary value from the originator for the missing values. This
works reasonably well for numerical features predominantly positive
in value and for tree-based models in general. This used to be a more
common method when the out-of-the-box machine learning
libraries and algorithms were not very adept at working with
missing data.
9) Adding a variable to capture NA
When data are not missing completely at random, we can capture
the importance of missingness by creating an additional variable
indicating whether the data was missing for that observation (1) or
not (0). The additional variable is binary: it takes only the values 0
and 1, 0 indicating that a value was present for that observation, and
1 indicating that the value was missing. Typically, mean/median
imputation is done to add a variable to capture those observations
where the data was missing.
10) Random Sampling Imputation
Random sampling imputation is in principle similar to
mean/median imputation because it aims to preserve the statistical
parameters of the original variable, for which data is missing.
Random sampling consists of taking a random observation from the
pool of available observations and using that randomly extracted
value to fill the NA. In Random Sampling, one takes as many
random observations as missing values are present in the variable.
Random sample imputation assumes that the data are missing
completely at random (MCAR). If this is the case, it makes sense to
substitute the missing values with values extracted from the original
variable distribution.
Multiple Imputation
Multiple Imputation (MI) is a statistical technique for handling
missing data. The key concept of MI is to use the distribution of the
observed data to estimate a set of plausible values for the missing
data. Random components are incorporated into these estimated
values to show their uncertainty. Multiple datasets are created and
then analysed individually but identically to obtain a set of
parameter estimates. Estimates are combined to obtain a set of
parameter estimates. The benefit of the multiple imputations is that
restoring the natural variability of the missing values incorporates
the uncertainty due to the missing data, which results in a valid
statistical inference. As a flexible way of handling more than one
missing variable, apply a Multiple Imputation by Chained Equations
(MICE) approach. Refer to the reference section to get more
information on MI and MICE. Below is a schematic representation
of MICE.

Predictive/Statistical models that impute the

missing data
This should be done in conjunction with some cross-validation
scheme to avoid leakage. This can be very effective and can help with
the final model. There are many options for such a predictive model,
including a neural network. Here I am listing a few which are very
popular.
Linear Regression
In regression imputation, the existing variables are used to predict,
and then the predicted value is substituted as if an actually obtained
value. This approach has several advantages because the imputation
retains a great deal of data over the listwise or pairwise deletion and
avoids significantly altering the standard deviation or the shape of
the distribution. However, as in a mean substitution, while a
regression imputation substitutes a value predicted from other
variables, no novel information is added, while the sample size has
been increased and the standard error is reduced.

Random Forest
Random forest is a non-parametric imputation method applicable to
various variable types that work well with both data missing at
random and not missing at random. Random forest uses multiple
decision trees to estimate missing values and outputs OOB (out of
the bag) imputation error estimates. One caveat is that random
forest works best with large datasets, and using random forest on
small datasets runs the risk of overfitting.

k-NN (k Nearest Neighbour)

k-NN imputes the missing attribute values based on the nearest K
neighbour. Neighbours are determined based on a distance
measure. Once K neighbours are determined, the missing value is
imputed by taking mean/median or mode of known attribute values
of the missing attribute.

Maximum likelihood
The assumption that the observed data are a sample drawn from a
multivariate normal distribution is relatively easy to understand.
After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just
been estimated. Several strategies are using the maximum likelihood
method to handle the missing data.
Expectation-Maximization
Expectation-Maximization (EM) is the maximum likelihood method
used to create a new data set. All missing values are imputed with
values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the
parameters (e.g., variances, covariances, and means) are estimated,
perhaps using the listwise deletion. Those estimates are then used to
create a regression equation to predict the missing data. The
maximization step uses those equations to fill in the missing data.
The expectation step is then repeated with the new parameters,
where the new regression equations are determined to “fill in” the
missing data. The expectation and maximization steps are repeated
until the system stabilizes.

Sensitivity analysis
Sensitivity analysis is defined as the study which defines how the
uncertainty in the output of a model can be allocated to the different
sources of uncertainty in its inputs. When analysing the missing
data, additional assumptions on the missing data are made, and
these assumptions are often applicable to the primary analysis.
However, the assumptions cannot be definitively validated for
correctness. Therefore, the National Research Council has proposed
that the sensitivity analysis be conducted to evaluate the robustness
of the results to the deviations from the MAR assumption.

Algorithms that Support Missing Values

Not all algorithms fail when there is missing data. Some algorithms
can be made robust to missing data, such as k-Nearest Neighbours,
that can ignore a column from a distance measure when a value is
missing. Some algorithms can use the missing value as a unique and
different value when building the predictive model, such as
classification and regression trees. An algorithm like XGBoost takes
into consideration of any missing data. If your imputation does not
work well, try a model that is robust to missing data.
Recommendations
Missing data reduces the power of a model. Some missing data is
expected, and the target sample size is increased to allow for it.
However, such cannot eliminate the potential bias. More attention
should be paid to the missing data in the design and performance of
the studies and the analysis of the resulting data. The machine
learning model techniques should only be performed after the
maximal efforts put into reducing missing data in the design and
prevention techniques.

A statistically valid analysis that has appropriate mechanisms and

assumptions for the missing data strongly recommended. Most of
the imputation technique can cause bias. It is difficult to know
whether the multiple imputations or full maximum likelihood
estimation is best, but both are superior to the traditional
approaches. Both techniques are best used with large samples. In
general, multiple imputations is a good approach when analysing
data sets with missing data.

PATTERN RECOGNITION Final Notes
90% (10)
PATTERN RECOGNITION Final Notes
40 pages
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
100% (1)
Assignment 11: Introduction To Machine Learning Prof. B. Ravindran
3 pages
TimeSeries Analysis State Space Methods
100% (1)
TimeSeries Analysis State Space Methods
57 pages
Finite Mixture of Skewed Distributions: Víctor Hugo Lachos Dávila Celso Rômulo Barbosa Cabral Camila Borelli Zeller
No ratings yet
Finite Mixture of Skewed Distributions: Víctor Hugo Lachos Dávila Celso Rômulo Barbosa Cabral Camila Borelli Zeller
108 pages
Missing Data
100% (2)
Missing Data
35 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
VVImp Missing Values v14
No ratings yet
VVImp Missing Values v14
35 pages
Latent Class Análysis
No ratings yet
Latent Class Análysis
33 pages
GMM PDF
No ratings yet
GMM PDF
7 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
Mcar, Mar, Mnar
No ratings yet
Mcar, Mar, Mnar
6 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
SPSS
No ratings yet
SPSS
92 pages
Missing Value Paper
No ratings yet
Missing Value Paper
10 pages
R - CDM Manual
No ratings yet
R - CDM Manual
28 pages
Milsap Allison
No ratings yet
Milsap Allison
18 pages
Missing Data in Stata
No ratings yet
Missing Data in Stata
12 pages
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
No ratings yet
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
20 pages
Mining Students Data To Analyze Learning Behavior: A Case Study
No ratings yet
Mining Students Data To Analyze Learning Behavior: A Case Study
4 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
No ratings yet
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
20 pages
Missing Data Part 1: Overview, Traditional Methods
No ratings yet
Missing Data Part 1: Overview, Traditional Methods
11 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Imputation
No ratings yet
Imputation
10 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
An Introduction To Modern Missing Data Analyses
No ratings yet
An Introduction To Modern Missing Data Analyses
33 pages
IBM SPSS Missing Values
100% (1)
IBM SPSS Missing Values
34 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
BBL Presentation March2013 Followup
No ratings yet
BBL Presentation March2013 Followup
3 pages
EM GaussianMixture Example
No ratings yet
EM GaussianMixture Example
2 pages
MCQ - Practice Question Template (16.11.2020)
No ratings yet
MCQ - Practice Question Template (16.11.2020)
25 pages
Ganesh Resume
No ratings yet
Ganesh Resume
7 pages
Gaussian Mixture Models
No ratings yet
Gaussian Mixture Models
5 pages
Boxcoxmix PDF
No ratings yet
Boxcoxmix PDF
19 pages
Missing Data and Multi Imputation
No ratings yet
Missing Data and Multi Imputation
5 pages
Forecasting Individualized Disease Trajectories Using Interpretable Deep Learning
No ratings yet
Forecasting Individualized Disease Trajectories Using Interpretable Deep Learning
19 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
No ratings yet
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
19 pages
Missing Data
No ratings yet
Missing Data
25 pages
Missing Data Stata
No ratings yet
Missing Data Stata
18 pages
DM Missing Value
No ratings yet
DM Missing Value
21 pages
Solutions For Missing Data in Structural Equation Modeling
No ratings yet
Solutions For Missing Data in Structural Equation Modeling
6 pages
PFE Book - Mass Analytics - 2022
No ratings yet
PFE Book - Mass Analytics - 2022
22 pages
Missingdata
No ratings yet
Missingdata
10 pages
ST-14 Handling Missing Data With Multiple Imputation Using PROC MI in SAS
No ratings yet
ST-14 Handling Missing Data With Multiple Imputation Using PROC MI in SAS
5 pages
Values
No ratings yet
Values
30 pages
Missing Data
No ratings yet
Missing Data
14 pages
S3 Missing Value Analysis Imputation
No ratings yet
S3 Missing Value Analysis Imputation
15 pages
Safari - Feb 29, 2024 at 8:02 AM
No ratings yet
Safari - Feb 29, 2024 at 8:02 AM
1 page
Brambilla 2005
No ratings yet
Brambilla 2005
8 pages
Missing Data DAGS R448-Reprint
No ratings yet
Missing Data DAGS R448-Reprint
12 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Analizing Missing Data
No ratings yet
Analizing Missing Data
12 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
No ratings yet
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
16 pages
Analysis of Seized Drug Samples
No ratings yet
Analysis of Seized Drug Samples
20 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Static and Dynamic Novelty Detection Methods For Jet Engine Health Monitoring
No ratings yet
Static and Dynamic Novelty Detection Methods For Jet Engine Health Monitoring
22 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Graham2009 Missing Values Analysis
No ratings yet
Graham2009 Missing Values Analysis
31 pages
Dyad 008
No ratings yet
Dyad 008
8 pages
CS8082-Machine Learning Techniques
No ratings yet
CS8082-Machine Learning Techniques
13 pages
FDS U4
No ratings yet
FDS U4
93 pages
Missing Data
No ratings yet
Missing Data
7 pages
Week 5 Lecture - Data Wrangling
No ratings yet
Week 5 Lecture - Data Wrangling
26 pages
EM Algorithm
No ratings yet
EM Algorithm
30 pages
M Akaba 2019
No ratings yet
M Akaba 2019
7 pages
Topic Five
No ratings yet
Topic Five
55 pages
Chakraborty Et Al 2022 Attribute Sentiment Scoring With Online Text Reviews Accounting For Language Structure and
No ratings yet
Chakraborty Et Al 2022 Attribute Sentiment Scoring With Online Text Reviews Accounting For Language Structure and
23 pages
AD3461 - ML Lab Manual
No ratings yet
AD3461 - ML Lab Manual
54 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Unit 2 - 2
No ratings yet
Unit 2 - 2
22 pages
Mice Vs Ppca
No ratings yet
Mice Vs Ppca
8 pages
MSC Artificial Intll Ud 2024 25 (1) - OK
No ratings yet
MSC Artificial Intll Ud 2024 25 (1) - OK
75 pages
21CSC305P ML - Unit4
No ratings yet
21CSC305P ML - Unit4
76 pages
Missing Data
No ratings yet
Missing Data
5 pages
The Frailty Model Full Chapter Download
100% (13)
The Frailty Model Full Chapter Download
15 pages
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
No ratings yet
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
4 pages
The Influence of The Regularization Parameter and The First Estimate On The Performance of Tikhonov Regularized Non-Linear Image Restoration Algorithms
No ratings yet
The Influence of The Regularization Parameter and The First Estimate On The Performance of Tikhonov Regularized Non-Linear Image Restoration Algorithms
13 pages
Missng Data
No ratings yet
Missng Data
8 pages
Unit II
No ratings yet
Unit II
13 pages
34-Em Algorithm-07-03-2023
No ratings yet
34-Em Algorithm-07-03-2023
19 pages
LNCS 2810 Selective Sampling With A Hierarchical Latent Variable Model 1st Edition by Hiroshi Mamitsuka ISBN 3540408134 978-3540408130
100% (9)
LNCS 2810 Selective Sampling With A Hierarchical Latent Variable Model 1st Edition by Hiroshi Mamitsuka ISBN 3540408134 978-3540408130
47 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Understanding Missing Values
No ratings yet
Understanding Missing Values
3 pages
Machine Learning A Bayesian and Optimization Perspective 1st Edition by Sergios Theodoridis
No ratings yet
Machine Learning A Bayesian and Optimization Perspective 1st Edition by Sergios Theodoridis
329 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet

Handling Missing Data

Uploaded by

Handling Missing Data

Uploaded by

All About Missing Data

Missing data is an everyday problem that a data professional need to

What is missing data? Missing data are defined as not available

Having missing values in your data is not necessarily a setback. Still,

Missing Completely at Random (MCAR)

In this case, missing and observed observations are generated from

Missing at Random (MAR)

Try to obtain the missing data.

1) list-wise (Complete-case analysis — CCA) deletion

If there is a large enough sample, where power is not an issue, and

Refer to below sample observation after deletion

2) Pairwise (available case analysis — ACA) Deletion

Pairwise deletion is known to be less biased for the MCAR or MAR

Retain All Data

In a mean substitution, the mean value of a variable is used in place

Median can be used when the variable has a skewed distribution.

The rationale for Mode is to replace the population of missing values

6) Adding a category to capture NA

7) Frequent category imputation

Predictive/Statistical models that impute the

k-NN (k Nearest Neighbour)

Algorithms that Support Missing Values

A statistically valid analysis that has appropriate mechanisms and

You might also like