0% found this document useful (0 votes)

17 views10 pages

Data - Preprocessing - 2

The document discusses different techniques for handling missing values in data including dropping rows or features, simple imputation using means or medians, model-based imputation using machine learning techniques, and multivariate imputation which performs multiple regressions and averages results.

Uploaded by

Madina Dates

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Data - Preprocessing - 2

Uploaded by

Madina Dates

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Big Data Analytics

Data Preprocessing: Handling Missing Values

Prof. Dr. Fazlul Hasan Siddiqui

Dept. of CSE, DUET, Gazipur
BSc:IUT; MSc:BUET; PhD:ANU (Australia)
[email protected]
Source:
www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation
https://youtu.be/YpqUbirqFxQ
Reasons for Missing Values
Reasons for Missing Values
Reasons for Missing Values
Reasons for Missing Values
Handling Missing Values
Dropping rows with null values:

The easiest and quickest approach to a missing data problem is dropping the offending
entries. This is an acceptable solution if we are confident that the missing data in the
dataset is missing at random, and if the number of data points we have access to is
sufficiently high that dropping some of them will not cause us to lose generalizability in
the models we build.

Dropping data missing not at random is dangerous. It will result in significant bias in
your model in cases where data being absent corresponds with some real-world
phenomenon. Because this requires domain knowledge, usually the only way to
determine if this is a problem is through manual inspection.

Dropping too much data is also dangerous. It can create significant bias by depriving
your algorithms of space. This is especially true of classifiers sensitive to the curse of
dimensionality.
Handling Missing Values
Dropping features with high nullity:

A feature that has a high number of empty values is unlikely to be very useful for
prediction. It can often be safely dropped. For example in the Titanic dataset we could
drop the Cabin feature.

Dropping rare features simplifies your model, but obviously gives you fewer features to
work with. Before dropping features outright, consider subsetting the part of the dataset
that this value is available for and checking its feature importance when it is used to
train a model in this subset. If in doing so you discover that the variable is important in
the subset it is defined, consider making an effort to retain it.
Handling Missing Values
Simple Imputation -- Mean or median or other summary statistic substitution:

The remainder of the techniques available are imputation methods, as opposed to data-
dropping methods. The simplest imputation method is replacing missing values with the
mean or median values of the dataset at large, or some similar summary statistic. This
has the advantage of being the simplest possible approach, and one that doesn't
introduce any undue bias into the dataset. But:

However, with missing values that are not strictly random, especially in the presence of a
great inequality in the number of missing values for the different variables, the mean
substitution method may lead to inconsistent bias. Furthermore, this approach adds no
new information but only increases the sample size and leads to an underestimate of the
errors. Thus, mean substitution is not generally accepted.
Handling Missing Values
Model Imputation (KNN, Semi_Supervised, Maximum_Likelihood):

Here, we can fix missing values by applying machine learning to that dataset! If we
consider a column with missing data as our target variable, and existing columns with
complete data as our predictor variables, then we may construct a machine learning
model using complete records as our train and test datasets and the records with
incomplete data as our generalization target.

This approach has a number of advantages, because the imputation retains a great deal
of data over the listwise or pairwise deletion and avoids significantly altering the
standard deviation or the shape of the distribution. However, as in a mean substitution,
while a regression imputation substitutes a value that is predicted from other variables,
no novel information is added, while the sample size has been increased and the
standard error is reduced.
Handling Missing Values
Multivariate Feature Imputation:

All of the techniques discussed so far are what one might call "single imputation": each
value in the dataset is filled in exactly once. In general, the limitation with single
imputation is that because these techniques find maximally likely values, they do not
generate entries which accurately reflect the distribution of the underlying data.

Multiple imputations find missing values by modeling each feature with missing values
as a function of other features in a round-robin fashion. It performs multiple regressions
over random sample of the data, then takes the average of the multiple regression
values and uses that value to impute the missing value. In other words, multiple
imputation breaks imputation out into three steps: imputation (multiple times), analysis
(staging how the results should be combined), and pooling (integrating the results into
the final imputed matrix).

The most popular algorithm for multiple imputation is called MICE, and a Python
implementation thereof is available as part of the fancyimpute package.

Missing Data
No ratings yet
Missing Data
25 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Unit 3
No ratings yet
Unit 3
30 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
PS ML Lect 5 9 Unit 2
No ratings yet
PS ML Lect 5 9 Unit 2
114 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
No ratings yet
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
20 pages
Chapter 3
No ratings yet
Chapter 3
58 pages
Unit - 3 - R Programming
No ratings yet
Unit - 3 - R Programming
16 pages
Imputation
No ratings yet
Imputation
10 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Handling Missing Values
No ratings yet
Handling Missing Values
5 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
01 Dealing With Missing Data The Art and Science of Imputation
No ratings yet
01 Dealing With Missing Data The Art and Science of Imputation
26 pages
Missing Values
No ratings yet
Missing Values
3 pages
Missing Data
No ratings yet
Missing Data
14 pages
Data Imputation For Missing Values
No ratings yet
Data Imputation For Missing Values
14 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
FDS U4
No ratings yet
FDS U4
93 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
221FJ01056
No ratings yet
221FJ01056
4 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Hypothesis Testing For The Population Proportion: One-Tailed Test
No ratings yet
Hypothesis Testing For The Population Proportion: One-Tailed Test
5 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Handling Missing Data
No ratings yet
Handling Missing Data
32 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Stochastic Processes Beamer
No ratings yet
Stochastic Processes Beamer
43 pages
3 - Missing Values-1
No ratings yet
3 - Missing Values-1
9 pages
Lec 45
No ratings yet
Lec 45
9 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
Business Statistics Level 3/series 4 2008 (3009)
100% (1)
Business Statistics Level 3/series 4 2008 (3009)
19 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Basic Statistics (3685) PPT - Lecture On 22-01-2019
No ratings yet
Basic Statistics (3685) PPT - Lecture On 22-01-2019
29 pages
Question Bank 2025 Ca3 BSM 201
No ratings yet
Question Bank 2025 Ca3 BSM 201
4 pages
MSF Hand Book 24-25
No ratings yet
MSF Hand Book 24-25
29 pages
Non-Stationarity and Unit Roots
No ratings yet
Non-Stationarity and Unit Roots
25 pages
Statistics Probability The Central Limit Theorem
No ratings yet
Statistics Probability The Central Limit Theorem
11 pages
Statistical Treatment of Data
No ratings yet
Statistical Treatment of Data
12 pages
Business Mathematics and Statistics: Syllabus
No ratings yet
Business Mathematics and Statistics: Syllabus
4 pages
Applied Maths-Unit5
No ratings yet
Applied Maths-Unit5
4 pages
HRA Chapter 5
No ratings yet
HRA Chapter 5
16 pages
WK 1 Appendix Review
No ratings yet
WK 1 Appendix Review
26 pages
ProSta Chap4 MI2036
No ratings yet
ProSta Chap4 MI2036
77 pages
Chapter 7 PDF Lecture Notes
No ratings yet
Chapter 7 PDF Lecture Notes
42 pages
(Online Module #2) : Colegio San Agustin-Bacolod Basic Education Department Senior High School
No ratings yet
(Online Module #2) : Colegio San Agustin-Bacolod Basic Education Department Senior High School
17 pages
The Olympic Medals Ranking: Does The Past Predict The Future?
No ratings yet
The Olympic Medals Ranking: Does The Past Predict The Future?
7 pages
Going Beyond Linear Regression: Ita Cirovic Donev
No ratings yet
Going Beyond Linear Regression: Ita Cirovic Donev
42 pages
Risk and Return
No ratings yet
Risk and Return
30 pages
Exame - 2022:2023 (2º Sem) - Soluções
No ratings yet
Exame - 2022:2023 (2º Sem) - Soluções
14 pages
JGGHJCBC
No ratings yet
JGGHJCBC
23 pages
Definitions: Figure 1. The Normal Distribution Graph
No ratings yet
Definitions: Figure 1. The Normal Distribution Graph
8 pages
Probit Model Analysis
No ratings yet
Probit Model Analysis
14 pages
19 Most Elegant Sklearn Tricks I Found After 3 Years of Use - by Bex T. - Towards AI
No ratings yet
19 Most Elegant Sklearn Tricks I Found After 3 Years of Use - by Bex T. - Towards AI
9 pages
Work To Be Done
No ratings yet
Work To Be Done
12 pages
IJSCM - Rosnalini - Forecasting of Electricity Consumption For Campus University Using Univariate Time Series and BoxJenkins
No ratings yet
IJSCM - Rosnalini - Forecasting of Electricity Consumption For Campus University Using Univariate Time Series and BoxJenkins
7 pages
Randomistic Data Elements
No ratings yet
Randomistic Data Elements
22 pages
A Confidence Interval
No ratings yet
A Confidence Interval
1 page
STAT272
No ratings yet
STAT272
2 pages

Data - Preprocessing - 2

Uploaded by

Data - Preprocessing - 2

Uploaded by

Big Data Analytics

Data Preprocessing: Handling Missing Values

Prof. Dr. Fazlul Hasan Siddiqui

You might also like