0% found this document useful (0 votes)
17 views10 pages

Data - Preprocessing - 2

The document discusses different techniques for handling missing values in data including dropping rows or features, simple imputation using means or medians, model-based imputation using machine learning techniques, and multivariate imputation which performs multiple regressions and averages results.

Uploaded by

Madina Dates
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Data - Preprocessing - 2

The document discusses different techniques for handling missing values in data including dropping rows or features, simple imputation using means or medians, model-based imputation using machine learning techniques, and multivariate imputation which performs multiple regressions and averages results.

Uploaded by

Madina Dates
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Big Data Analytics

Data Preprocessing: Handling Missing Values

Prof. Dr. Fazlul Hasan Siddiqui


Dept. of CSE, DUET, Gazipur
BSc:IUT; MSc:BUET; PhD:ANU (Australia)
[email protected]
Source:
www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation
https://youtu.be/YpqUbirqFxQ
Reasons for Missing Values
Reasons for Missing Values
Reasons for Missing Values
Reasons for Missing Values
Handling Missing Values
Dropping rows with null values:

The easiest and quickest approach to a missing data problem is dropping the offending
entries. This is an acceptable solution if we are confident that the missing data in the
dataset is missing at random, and if the number of data points we have access to is
sufficiently high that dropping some of them will not cause us to lose generalizability in
the models we build.

Dropping data missing not at random is dangerous. It will result in significant bias in
your model in cases where data being absent corresponds with some real-world
phenomenon. Because this requires domain knowledge, usually the only way to
determine if this is a problem is through manual inspection.

Dropping too much data is also dangerous. It can create significant bias by depriving
your algorithms of space. This is especially true of classifiers sensitive to the curse of
dimensionality.
Handling Missing Values
Dropping features with high nullity:

A feature that has a high number of empty values is unlikely to be very useful for
prediction. It can often be safely dropped. For example in the Titanic dataset we could
drop the Cabin feature.

Dropping rare features simplifies your model, but obviously gives you fewer features to
work with. Before dropping features outright, consider subsetting the part of the dataset
that this value is available for and checking its feature importance when it is used to
train a model in this subset. If in doing so you discover that the variable is important in
the subset it is defined, consider making an effort to retain it.
Handling Missing Values
Simple Imputation -- Mean or median or other summary statistic substitution:

The remainder of the techniques available are imputation methods, as opposed to data-
dropping methods. The simplest imputation method is replacing missing values with the
mean or median values of the dataset at large, or some similar summary statistic. This
has the advantage of being the simplest possible approach, and one that doesn't
introduce any undue bias into the dataset. But:

However, with missing values that are not strictly random, especially in the presence of a
great inequality in the number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Furthermore, this approach adds no
new information but only increases the sample size and leads to an underestimate of the
errors. Thus, mean substitution is not generally accepted.
Handling Missing Values
Model Imputation (KNN, Semi_Supervised, Maximum_Likelihood):

Here, we can fix missing values by applying machine learning to that dataset! If we
consider a column with missing data as our target variable, and existing columns with
complete data as our predictor variables, then we may construct a machine learning
model using complete records as our train and test datasets and the records with
incomplete data as our generalization target.

This approach has a number of advantages, because the imputation retains a great deal
of data over the listwise or pairwise deletion and avoids significantly altering the
standard deviation or the shape of the distribution. However, as in a mean substitution,
while a regression imputation substitutes a value that is predicted from other variables,
no novel information is added, while the sample size has been increased and the
standard error is reduced.
Handling Missing Values
Multivariate Feature Imputation:

All of the techniques discussed so far are what one might call "single imputation": each
value in the dataset is filled in exactly once. In general, the limitation with single
imputation is that because these techniques find maximally likely values, they do not
generate entries which accurately reflect the distribution of the underlying data.

Multiple imputations find missing values by modeling each feature with missing values
as a function of other features in a round-robin fashion. It performs multiple regressions
over random sample of the data, then takes the average of the multiple regression
values and uses that value to impute the missing value. In other words, multiple
imputation breaks imputation out into three steps: imputation (multiple times), analysis
(staging how the results should be combined), and pooling (integrating the results into
the final imputed matrix).

The most popular algorithm for multiple imputation is called MICE, and a Python
implementation thereof is available as part of the fancyimpute package.

You might also like