0% found this document useful (0 votes)
63 views25 pages

Missing Data

Uploaded by

vedalamuparna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views25 pages

Missing Data

Uploaded by

vedalamuparna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Missing Data

Introduction
• Missing values are a common issue in machine learning. This occurs
when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of
your models.
• It is essential to address missing values efficiently to ensure strong and
impartial results in your machine-learning projects.
What is a Missing Value?
• Missing values are data
points that are absent for a
specific variable in a
dataset.
• They can be represented in
various ways, such as
blank cells, null values, or
special symbols like “NA”
or “unknown.”
• These missing data points
pose a significant
challenge in data analysis
and can lead to inaccurate
or biased results.
How is a Missing Value Represented in a Dataset?
Missing values in a dataset can be represented in various ways, depending on the
source of the data and the conventions used. Here are some common representations:
•NaN (Not a Number): In many programming languages and data analysis tools, missing
values are represented as NaN. This is the default for libraries like Pandas in Python.
•NULL or None: In databases and some programming languages, missing values are
often represented as NULL or None. For instance, in SQL databases, a missing value is
typically recorded as NULL.
•Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is
common in text-based data or CSV files where a field might be left blank.
•Special Indicators: Datasets might use specific indicators like -999, 9999, or other
unlikely values to signify missing data.
•Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values
might be represented by spaces or blank fields.
Continued…
Missing values can pose a significant challenge in data analysis, as they
can:
• Reduce the sample size: This can decrease the accuracy and reliability of
your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias the
results of your analysis.
• Make it difficult to perform certain analyses: Some statistical techniques
require complete data for all variables, making them inapplicable when
missing values are present.
Why is Data Missing From the Dataset?
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing of data from the dataset affect the approach of
handling missing data. So it’s necessary to understand why the data could be
missing.

Some of the reasons are listed below:


•Past data might get corrupted due to improper maintenance.
•Observations are not recorded for certain fields due to some reasons.

There might be a failure in recording the values due to human error.


•The user has not provided the values intentionally
•Item nonresponse: This means the participant refused to respond.
Why Is Data Missing From the Dataset?
• Data can be missing for many reasons like technical issues, human
errors, privacy concerns, data processing issues, or the nature of the
variable itself.
• Understanding the cause of missing data helps choose appropriate
handling strategies and ensure the quality of your analysis.
• It’s important to understand the reasons behind missing data:
• Identifying the type of missing data: Is it Missing Completely at
Random (MCAR), Missing at Random (MAR), or Missing Not at Random
(MNAR)?
• Evaluating the impact of missing data: Is the missingness causing bias
or affecting the analysis?
• Choosing appropriate handling strategies: Different techniques are
suitable for different types of missing data.
Types of Missing Values
• There are three main types of missing values:

Missing Completely at Random (MCAR):


• MCAR is a specific type of missing data in which the probability of a data
point being missing is entirely random and independent of any other
variable in the dataset. In simpler terms, whether a value is missing or not
has nothing to do with the values of other variables or the characteristics
of the data point itself.
Example: In a survey about library books, some overdue book values in the
dataset are missing due to human error in recording.
Missing at Random (MAR):
• MAR is a type of missing data where the probability of a data point missing
depends on the values of other variables in the dataset, but not on the
missing variable itself. This means that the missingness mechanism is not
entirely random, but it can be predicted based on the available
information.
Example: In a survey, ‘Age’ values might be missing for those who did not
disclose their ‘Gender’. Here, the missingness of ‘Age’ depends on ‘Gender’,
but the missing ‘Age’ values are random among those who did not disclose
their ‘Gender’.
Missing Not at Random (MNAR):
• MNAR is the most challenging type of missing data to deal with. It
occurs when the probability of a data point being missing is related to
the missing value itself. This means that the reason for the missing
data is informative and directly associated with the variable that is
missing.
Example: In a survey about library books, people with more overdue
books might be less likely to respond to the survey. Thus, the number
of overdue books is missing and depends on the number of books
overdue.
Methods for Identifying Missing Data
• Locating and understanding patterns of missingness in the dataset is an
important step in addressing its impact on analysis.
• There are several useful functions for detecting, removing, and replacing
null values in Pandas DataFrame.
Functions Descriptions
.isnull() Identifies missing values in a Series or DataFrame.
Check for missing values in a pandas Series or DataFrame. It returns a
.notnull() boolean Series or DataFrame, where True indicates non-missing values
and False indicates missing values.
Displays information about the DataFrame, including data types,
.info()
memory usage, and presence of missing values.
similar to notnull() but returns True for missing values and False for non-
.isna()
missing values.
Drops rows or columns containing missing values based on custom
dropna()
criteria.
Fills missing values with specific values, means, medians, or other
fillna()
calculated values.
Effective Strategies for Handling Missing Values in Data Analysis
import pandas as pd
import numpy as np
# Creating a sample DataFrame with missing values
data = {
'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
'Address': ['123xyz', '456abc', '789lmn', '101def', np.nan, '222mno', '444tuv', '555pqr'],
'City': ['Delhi', 'Lucknow', 'Kolkata', 'Haldwani', 'Haldwani', np.nan, 'Dehradun', 'Varanasi'],
'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
Removing Rows with Missing Values
•Simple and efficient: Removes data points with missing values altogether.
•Reduces sample size: Can lead to biased results if missingness is not random.
•Not recommended for large datasets: Can discard valuable information.

In this example, we are removing rows with missing values from the original
DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame
(df_cleaned).

# Removing rows with missing values


df_cleaned = df.dropna()

# Displaying the DataFrame after removing missing values


print("\nDataFrame after removing rows with missing values:")
print(df_cleaned)
Imputation Methods
•Replacing missing values with estimated values.
•Preserves sample size: Doesn’t reduce data points.
•Can introduce bias: Estimated values might not be accurate.

Here are some common imputation methods:


1- Mean, Median, and Mode Imputation:
•Replace missing values with the mean, median, or mode of the
relevant variable.
•Simple and efficient: Easy to implement.
•Can be inaccurate: Doesn’t consider the relationships between
variables.

In this example, we are explaining the imputation techniques for


handling missing values in the ‘Marks’ column of the DataFrame (df).
It calculates and fills missing values with the mean, median, and
mode of the existing values in that column, and then prints the
results for observation.
1.Mean Imputation: Calculates the mean of the ‘Marks’ column in the
DataFrame (df).
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
mean value.
•mean_imputation: The result is stored in the variable mean_imputation.
2.Median Imputation: Calculates the median of the ‘Marks’ column in the
DataFrame (df).
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
median value.
•median_imputation: The result is stored in the variable median_imputation.
3.Mode Imputation: Calculates the mode of the ‘Marks’ column in the
DataFrame (df). The result is a Series.
•.iloc[0]: Accesses the first element of the Series, which represents the
mode.
•df['Marks'].fillna(...): Fills missing values in the ‘Marks’ column with the
# Mean, Median, and Mode Imputation
mean_imputation = df['Marks'].fillna(df['Marks'].mean())
median_imputation = df['Marks'].fillna(df['Marks'].median())
mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0])

print("\nImputation using Mean:")


print(mean_imputation)

print("\nImputation using Median:")


print(median_imputation)

print("\nImputation using Mode:")


print(mode_imputation)
2. Forward and Backward Fill
•Replace missing values with the previous or next non-missing value in
the same variable.
•Simple and intuitive: Preserves temporal order.
•Can be inaccurate: Assumes missing values are close to observed
values

These fill methods are particularly useful when there is a logical


sequence or order in the data, and missing values can be reasonably
assumed to follow a pattern.
The method parameter in fillna() allows to specify the filling strategy, and
here, it’s set to ‘ffill’ for forward fill and ‘bfill’ for backward fill.
1.Forward Fill (forward_fill)
•df['Marks'].fillna(method='ffill'): This method fills missing values in
the ‘Marks’ column of the DataFrame (df) using a forward fill
strategy. It replaces missing values with the last observed non-
missing value in the column.
•forward_fill: The result is stored in the variable forward_fill.

2.Backward Fill (backward_fill)


•df['Marks'].fillna(method='bfill'): This method fills missing values in
the ‘Marks’ column using a backward fill strategy. It replaces
missing values with the next observed non-missing value in the
column.
•backward_fill: The result is stored in the variable backward_fill.
# Forward and Backward Fill
• forward_fill = df['Marks'].fillna(method='ffill')
• backward_fill = df['Marks'].fillna(method='bfill')

• print("\nForward Fill:")
• print(forward_fill)

• print("\nBackward Fill:")
• print(backward_fill)

Note
• Forward fill uses the last valid observation to fill missing values.
• Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
•Estimate missing values based on surrounding data points
using techniques like linear interpolation or spline
interpolation.
•More sophisticated than mean/median imputation: Captures
relationships between variables.
•Requires additional libraries and computational resources.

These interpolation techniques are useful when the


relationship between data points can be reasonably assumed
to follow a linear or quadratic pattern. The method parameter in
the interpolate() method allows to specify the interpolation
strategy.
1.Linear Interpolation
•df['Marks'].interpolate(method='linear'): This method performs linear
interpolation on the ‘Marks’ column of the DataFrame (df). Linear
interpolation estimates missing values by considering a straight line
between two adjacent non-missing values.
•linear_interpolation: The result is stored in the variable linear_interpolation.

2.Quadratic Interpolation
•df['Marks'].interpolate(method='quadratic'): This method performs quadratic
interpolation on the ‘Marks’ column. Quadratic interpolation estimates
missing values by considering a quadratic curve that passes through three
adjacent non-missing values.
•quadratic_interpolation: The result is stored in the variable
quadratic_interpolation.
# Interpolation Techniques
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')

print("\nLinear Interpolation:")
print(linear_interpolation)

print("\nQuadratic Interpolation:")
print(quadratic_interpolation)

Note:
Linear interpolation assumes a straight line between two adjacent non-missing values.
Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.

Choosing the right strategy depends on several factors:


Type of missing data: MCAR, MAR, or MNAR.
Proportion of missing values.
Data type and distribution.
Analytical goals and assumptions.
Impact of Handling Missing Values
Handling missing values effectively is crucial to ensure the accuracy and reliability of your findings.
Here are some key impacts of handling missing values:
• Improved data quality: Addressing missing values enhances the overall quality of the dataset. A
cleaner dataset with fewer missing values is more reliable for analysis and model training.
• Enhanced model performance: Machine learning algorithms often struggle with missing
data, leading to biased and unreliable results. By appropriately handling missing values, models
can be trained on a more complete dataset, leading to improved performance and accuracy.
• Preservation of Data Integrity: Handling missing values helps maintain the integrity of the
dataset. Imputing or removing missing values ensures that the dataset remains consistent and
suitable for analysis.
• Reduced bias: Ignoring missing values may introduce bias in the analysis or modeling process.
Handling missing data allows for a more unbiased representation of the underlying patterns in the
data.
• Descriptive statistics, such as means, medians, and standard deviations, can be more accurate
when missing values are appropriately handled. This ensures a more reliable summary of the
dataset.
• Increased efficiency: Efficiently handling missing values can save you time and effort during data
analysis and Modeling.

You might also like