0% found this document useful (0 votes)

63 views25 pages

Missing Data

Uploaded by

vedalamuparna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views25 pages

Missing Data

Uploaded by

vedalamuparna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Missing Data

Introduction
• Missing values are a common issue in machine learning. This occurs
when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of
your models.
• It is essential to address missing values efficiently to ensure strong and
impartial results in your machine-learning projects.
What is a Missing Value?
• Missing values are data
points that are absent for a
specific variable in a
dataset.
• They can be represented in
various ways, such as
blank cells, null values, or
special symbols like “NA”
or “unknown.”
• These missing data points
pose a significant
challenge in data analysis
and can lead to inaccurate
or biased results.
How is a Missing Value Represented in a Dataset?
Missing values in a dataset can be represented in various ways, depending on the
source of the data and the conventions used. Here are some common representations:
•NaN (Not a Number): In many programming languages and data analysis tools, missing
values are represented as NaN. This is the default for libraries like Pandas in Python.
•NULL or None: In databases and some programming languages, missing values are
often represented as NULL or None. For instance, in SQL databases, a missing value is
typically recorded as NULL.
•Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is
common in text-based data or CSV files where a field might be left blank.
•Special Indicators: Datasets might use specific indicators like -999, 9999, or other
unlikely values to signify missing data.
•Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values
might be represented by spaces or blank fields.
Continued…
Missing values can pose a significant challenge in data analysis, as they
can:
• Reduce the sample size: This can decrease the accuracy and reliability of
your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias the
results of your analysis.
• Make it difficult to perform certain analyses: Some statistical techniques
require complete data for all variables, making them inapplicable when
missing values are present.
Why is Data Missing From the Dataset?
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing of data from the dataset affect the approach of
handling missing data. So it’s necessary to understand why the data could be
missing.

Some of the reasons are listed below:

•Past data might get corrupted due to improper maintenance.
•Observations are not recorded for certain fields due to some reasons.

There might be a failure in recording the values due to human error.

•The user has not provided the values intentionally
•Item nonresponse: This means the participant refused to respond.
Why Is Data Missing From the Dataset?
• Data can be missing for many reasons like technical issues, human
errors, privacy concerns, data processing issues, or the nature of the
variable itself.
• Understanding the cause of missing data helps choose appropriate
handling strategies and ensure the quality of your analysis.
• It’s important to understand the reasons behind missing data:
• Identifying the type of missing data: Is it Missing Completely at
Random (MCAR), Missing at Random (MAR), or Missing Not at Random
(MNAR)?
• Evaluating the impact of missing data: Is the missingness causing bias
or affecting the analysis?
• Choosing appropriate handling strategies: Different techniques are
suitable for different types of missing data.
Types of Missing Values
• There are three main types of missing values:

Missing Completely at Random (MCAR):

• MCAR is a specific type of missing data in which the probability of a data
point being missing is entirely random and independent of any other
variable in the dataset. In simpler terms, whether a value is missing or not
has nothing to do with the values of other variables or the characteristics
of the data point itself.
Example: In a survey about library books, some overdue book values in the
dataset are missing due to human error in recording.
Missing at Random (MAR):
• MAR is a type of missing data where the probability of a data point missing
depends on the values of other variables in the dataset, but not on the
missing variable itself. This means that the missingness mechanism is not
entirely random, but it can be predicted based on the available
information.
Example: In a survey, ‘Age’ values might be missing for those who did not
disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’,
but the missing ‘Age’ values are random among those who did not disclose
their ‘Gender’.
Missing Not at Random (MNAR):
• MNAR is the most challenging type of missing data to deal with. It
occurs when the probability of a data point being missing is related to
the missing value itself. This means that the reason for the missing
data is informative and directly associated with the variable that is
missing.
Example: In a survey about library books, people with more overdue
books might be less likely to respond to the survey. Thus, the number
of overdue books is missing and depends on the number of books
overdue.
Methods for Identifying Missing Data
• Locating and understanding patterns of missingness in the dataset is an
important step in addressing its impact on analysis.
• There are several useful functions for detecting, removing, and replacing
null values in Pandas DataFrame.
Functions Descriptions
.isnull() Identifies missing values in a Series or DataFrame.
Check for missing values in a pandas Series or DataFrame. It returns a
.notnull() boolean Series or DataFrame, where True indicates non-missing values
and False indicates missing values.
Displays information about the DataFrame, including data types,
.info()
memory usage, and presence of missing values.
similar to notnull() but returns True for missing values and False for non-
.isna()
missing values.
Drops rows or columns containing missing values based on custom
dropna()
criteria.
Fills missing values with specific values, means, medians, or other
fillna()
calculated values.
Effective Strategies for Handling Missing Values in Data Analysis
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123xyz', '456abc', '789lmn', '101def', np.nan, '222mno', '444tuv', '555pqr'],
'City': ['Delhi', 'Lucknow', 'Kolkata', 'Haldwani', 'Haldwani', np.nan, 'Dehradun', 'Varanasi'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Removing Rows with Missing Values
•Simple and efficient: Removes data points with missing values altogether.
•Reduces sample size: Can lead to biased results if missingness is not random.
•Not recommended for large datasets: Can discard valuable information.

In this example, we are removing rows with missing values from the original
DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame
(df_cleaned).

# Removing rows with missing values

df_cleaned = df.dropna()

# Displaying the DataFrame after removing missing values

print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)
Imputation Methods
•Replacing missing values with estimated values.
•Preserves sample size: Doesn’t reduce data points.
•Can introduce bias: Estimated values might not be accurate.

Here are some common imputation methods:

1- Mean, Median, and Mode Imputation:
•Replace missing values with the mean, median, or mode of the
relevant variable.
•Simple and efficient: Easy to implement.
•Can be inaccurate: Doesn’t consider the relationships between
variables.

In this example, we are explaining the imputation techniques for

handling missing values in the ‘Marks’ column of the DataFrame (df).
It calculates and fills missing values with the mean, median, and
mode of the existing values in that column, and then prints the
results for observation.
1.Mean Imputation: Calculates the mean of the ‘Marks’ column in the
DataFrame (df).
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
mean value.
•mean_imputation: The result is stored in the variable mean_imputation.
2.Median Imputation: Calculates the median of the ‘Marks’ column in the
DataFrame (df).
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
median value.
•median_imputation: The result is stored in the variable median_imputation.
3.Mode Imputation: Calculates the mode of the ‘Marks’ column in the
DataFrame (df). The result is a Series.
•.iloc[0]: Accesses the first element of the Series, which represents the
mode.
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
# Mean, Median, and Mode Imputation
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])

print("\nImputation using Mean:")

print(mean_imputation)

print("\nImputation using Median:")

print(median_imputation)

print("\nImputation using Mode:")

print(mode_imputation)
2. Forward and Backward Fill
•Replace missing values with the previous or next non-missing value in
the same variable.
•Simple and intuitive: Preserves temporal order.
•Can be inaccurate: Assumes missing values are close to observed
values

These fill methods are particularly useful when there is a logical

sequence or order in the data, and missing values can be reasonably
assumed to follow a pattern.
The method parameter in fillna() allows to specify the filling strategy, and
here, it’s set to ‘ffill’ for forward fill and ‘bfill’ for backward fill.
1.Forward Fill (forward_fill)
•df['Marks'].fillna(method='ffill'): This method fills missing values in
the ‘Marks’ column of the DataFrame (df) using a forward fill
strategy. It replaces missing values with the last observed non-
missing value in the column.
•forward_fill: The result is stored in the variable forward_fill.

2.Backward Fill (backward_fill)

•df['Marks'].fillna(method='bfill'): This method fills missing values in
the ‘Marks’ column using a backward fill strategy. It replaces
missing values with the next observed non-missing value in the
column.
•backward_fill: The result is stored in the variable backward_fill.
# Forward and Backward Fill
• forward_fill = df['Marks'].fillna(method='ffill')
• backward_fill = df['Marks'].fillna(method='bfill')

• print("\nForward Fill:")
• print(forward_fill)

• print("\nBackward Fill:")
• print(backward_fill)

Note
• Forward fill uses the last valid observation to fill missing values.
• Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
•Estimate missing values based on surrounding data points
using techniques like linear interpolation or spline
interpolation.
•More sophisticated than mean/median imputation: Captures
relationships between variables.
•Requires additional libraries and computational resources.

These interpolation techniques are useful when the

relationship between data points can be reasonably assumed
to follow a linear or quadratic pattern. The method parameter in
the interpolate() method allows to specify the interpolation
strategy.
1.Linear Interpolation
•df['Marks'].interpolate(method='linear'): This method performs linear
interpolation on the ‘Marks’ column of the DataFrame (df). Linear
interpolation estimates missing values by considering a straight line
between two adjacent non-missing values.
•linear_interpolation: The result is stored in the variable linear_interpolation.

2.Quadratic Interpolation
•df['Marks'].interpolate(method='quadratic'): This method performs quadratic
interpolation on the ‘Marks’ column. Quadratic interpolation estimates
missing values by considering a quadratic curve that passes through three
adjacent non-missing values.
•quadratic_interpolation: The result is stored in the variable
quadratic_interpolation.
# Interpolation Techniques
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')

print("\nLinear Interpolation:")
print(linear_interpolation)

print("\nQuadratic Interpolation:")
print(quadratic_interpolation)

Note:
Linear interpolation assumes a straight line between two adjacent non-missing values.
Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.

Choosing the right strategy depends on several factors:

Type of missing data: MCAR, MAR, or MNAR.
Proportion of missing values.
Data type and distribution.
Analytical goals and assumptions.
Impact of Handling Missing Values
Handling missing values effectively is crucial to ensure the accuracy and reliability of your findings.
Here are some key impacts of handling missing values:
• Improved data quality: Addressing missing values enhances the overall quality of the dataset. A
cleaner dataset with fewer missing values is more reliable for analysis and model training.
• Enhanced model performance: Machine learning algorithms often struggle with missing
data, leading to biased and unreliable results. By appropriately handling missing values, models
can be trained on a more complete dataset, leading to improved performance and accuracy.
• Preservation of Data Integrity: Handling missing values helps maintain the integrity of the
dataset. Imputing or removing missing values ensures that the dataset remains consistent and
suitable for analysis.
• Reduced bias: Ignoring missing values may introduce bias in the analysis or modeling process.
Handling missing data allows for a more unbiased representation of the underlying patterns in the
data.
• Descriptive statistics, such as means, medians, and standard deviations, can be more accurate
when missing values are appropriately handled. This ensures a more reliable summary of the
dataset.
• Increased efficiency: Efficiently handling missing values can save you time and effort during data
analysis and Modeling.

Migrating Older Historian Runtime Database To Historian 2017, 2020 or 2023
No ratings yet
Migrating Older Historian Runtime Database To Historian 2017, 2020 or 2023
7 pages
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet
1.TAFJ Install Day1Sn1
100% (3)
1.TAFJ Install Day1Sn1
38 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Missing Data
No ratings yet
Missing Data
14 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Unit 3
No ratings yet
Unit 3
30 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
FDS U4
No ratings yet
FDS U4
93 pages
DA Unit 2 15m Handling Missing Data
No ratings yet
DA Unit 2 15m Handling Missing Data
3 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
DM Missing Value
No ratings yet
DM Missing Value
21 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Understanding Missing Values
No ratings yet
Understanding Missing Values
3 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Handling Missing Values in Data Mining
No ratings yet
Handling Missing Values in Data Mining
12 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
What Are The Different Ways To Handle Missing Values
No ratings yet
What Are The Different Ways To Handle Missing Values
2 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Pandas
No ratings yet
Pandas
4 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Missing Value Paper
No ratings yet
Missing Value Paper
10 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
37 pages
Unit 2
No ratings yet
Unit 2
76 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Lecture 10
No ratings yet
Lecture 10
20 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
PS ML Lect 5 9 Unit 2
No ratings yet
PS ML Lect 5 9 Unit 2
114 pages
Summary of The Chapter "Working With Missing Values"
No ratings yet
Summary of The Chapter "Working With Missing Values"
5 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Handling Missing Values
No ratings yet
Handling Missing Values
5 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
Null Values in Data Complete Guide
No ratings yet
Null Values in Data Complete Guide
5 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
How To Handle Missing Data in A Dataset 1710294197
No ratings yet
How To Handle Missing Data in A Dataset 1710294197
14 pages
M Akaba 2019
No ratings yet
M Akaba 2019
7 pages
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
No ratings yet
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
20 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Missing Data
No ratings yet
Missing Data
7 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
VVImp Missing Values v14
No ratings yet
VVImp Missing Values v14
35 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Statistic Part 2
No ratings yet
Statistic Part 2
22 pages
Parallelism
No ratings yet
Parallelism
8 pages
CBNST Pass
No ratings yet
CBNST Pass
3 pages
Farre 2
No ratings yet
Farre 2
1 page
Chapter 1
No ratings yet
Chapter 1
25 pages
VAST2024 - MC2 Data Description
No ratings yet
VAST2024 - MC2 Data Description
3 pages
PSUT Coursera Courses To Select From
No ratings yet
PSUT Coursera Courses To Select From
35 pages
Tani
No ratings yet
Tani
33 pages
Week-1 Lecture Notes
No ratings yet
Week-1 Lecture Notes
121 pages
Power Bi Data Modelling
No ratings yet
Power Bi Data Modelling
18 pages
2020 Book AdvancesInBioinformaticsAndCom
No ratings yet
2020 Book AdvancesInBioinformaticsAndCom
284 pages
Triggers in Salesforce!
100% (1)
Triggers in Salesforce!
21 pages
Python Code
No ratings yet
Python Code
8 pages
Module 04
No ratings yet
Module 04
42 pages
ZimTTECH Job Opportunities
No ratings yet
ZimTTECH Job Opportunities
48 pages
Sridhara GT Fujitsu
No ratings yet
Sridhara GT Fujitsu
5 pages
DBMS Practical File
No ratings yet
DBMS Practical File
35 pages
Mysql Procedures
No ratings yet
Mysql Procedures
22 pages
T3-TECHNICAL Sambeli Tan Molina Licas
No ratings yet
T3-TECHNICAL Sambeli Tan Molina Licas
6 pages
Amozesh GIS
No ratings yet
Amozesh GIS
84 pages
Data Mi Nin: Find The Answers To These Questions in The Following Text
No ratings yet
Data Mi Nin: Find The Answers To These Questions in The Following Text
3 pages
Tenorshare 4ukey Crack Latest Version
100% (1)
Tenorshare 4ukey Crack Latest Version
45 pages
Lecture 2.2.3 Database Security
No ratings yet
Lecture 2.2.3 Database Security
78 pages
Generalized Isolation Level Definitions
No ratings yet
Generalized Isolation Level Definitions
12 pages
Data Types in SQL Server
No ratings yet
Data Types in SQL Server
3 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
Farmer Registry Faqs
No ratings yet
Farmer Registry Faqs
18 pages
Dbvisit v11 Training Nov 2022
No ratings yet
Dbvisit v11 Training Nov 2022
51 pages
Dokumen - Tips Manual-Foxpro 2023
No ratings yet
Dokumen - Tips Manual-Foxpro 2023
36 pages
2.2 Language Settings: Databases Report Templates Standard Files
No ratings yet
2.2 Language Settings: Databases Report Templates Standard Files
2 pages
Laporan Basis Data 2.3-2.4
No ratings yet
Laporan Basis Data 2.3-2.4
10 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages

Missing Data

Uploaded by

Missing Data

Uploaded by

Missing Data

Some of the reasons are listed below:

There might be a failure in recording the values due to human error.

Missing Completely at Random (MCAR):

# Removing rows with missing values

# Displaying the DataFrame after removing missing values

Here are some common imputation methods:

In this example, we are explaining the imputation techniques for

print("\nImputation using Mean:")

print("\nImputation using Median:")

print("\nImputation using Mode:")

These fill methods are particularly useful when there is a logical

2.Backward Fill (backward_fill)

These interpolation techniques are useful when the

Choosing the right strategy depends on several factors:

You might also like