Missing Data
Missing Data
Introduction
• Missing values are a common issue in machine learning. This occurs
when a particular variable lacks data points, resulting in incomplete
information and potentially harming the accuracy and dependability of
your models.
• It is essential to address missing values efficiently to ensure strong and
impartial results in your machine-learning projects.
What is a Missing Value?
• Missing values are data
points that are absent for a
specific variable in a
dataset.
• They can be represented in
various ways, such as
blank cells, null values, or
special symbols like “NA”
or “unknown.”
• These missing data points
pose a significant
challenge in data analysis
and can lead to inaccurate
or biased results.
How is a Missing Value Represented in a Dataset?
Missing values in a dataset can be represented in various ways, depending on the
source of the data and the conventions used. Here are some common representations:
•NaN (Not a Number): In many programming languages and data analysis tools, missing
values are represented as NaN. This is the default for libraries like Pandas in Python.
•NULL or None: In databases and some programming languages, missing values are
often represented as NULL or None. For instance, in SQL databases, a missing value is
typically recorded as NULL.
•Empty Strings: Sometimes, missing values are denoted by empty strings (""). This is
common in text-based data or CSV files where a field might be left blank.
•Special Indicators: Datasets might use specific indicators like -999, 9999, or other
unlikely values to signify missing data.
•Blanks or Spaces: In some cases, particularly in fixed-width text files, missing values
might be represented by spaces or blank fields.
Continued…
Missing values can pose a significant challenge in data analysis, as they
can:
• Reduce the sample size: This can decrease the accuracy and reliability of
your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias the
results of your analysis.
• Make it difficult to perform certain analyses: Some statistical techniques
require complete data for all variables, making them inapplicable when
missing values are present.
Why is Data Missing From the Dataset?
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing of data from the dataset affect the approach of
handling missing data. So it’s necessary to understand why the data could be
missing.
In this example, we are removing rows with missing values from the original
DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame
(df_cleaned).
• print("\nForward Fill:")
• print(forward_fill)
• print("\nBackward Fill:")
• print(backward_fill)
Note
• Forward fill uses the last valid observation to fill missing values.
• Backward fill uses the next valid observation to fill missing values.
3. Interpolation Techniques
•Estimate missing values based on surrounding data points
using techniques like linear interpolation or spline
interpolation.
•More sophisticated than mean/median imputation: Captures
relationships between variables.
•Requires additional libraries and computational resources.
2.Quadratic Interpolation
•df['Marks'].interpolate(method='quadratic'): This method performs quadratic
interpolation on the ‘Marks’ column. Quadratic interpolation estimates
missing values by considering a quadratic curve that passes through three
adjacent non-missing values.
•quadratic_interpolation: The result is stored in the variable
quadratic_interpolation.
# Interpolation Techniques
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')
print("\nLinear Interpolation:")
print(linear_interpolation)
print("\nQuadratic Interpolation:")
print(quadratic_interpolation)
Note:
Linear interpolation assumes a straight line between two adjacent non-missing values.
Quadratic interpolation assumes a quadratic curve that passes through three adjacent non-missing values.