0% found this document useful (0 votes)

7 views48 pages

Exploratory Data Analysis

The document provides a comprehensive overview of Exploratory Data Analysis (EDA), detailing its importance, steps, and techniques for data transformation and handling missing data. It covers various data types, visualization methods, and the implications of data transformation, including normalization and standardization. Additionally, it discusses methods for identifying and managing missing data, emphasizing the significance of data deduplication in ensuring clean datasets for analysis.

Uploaded by

divyansh gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views48 pages

Exploratory Data Analysis

Uploaded by

divyansh gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

EDA

(EXPLORATORY DATA
ANALYSIS )
2

OUTLINES
Introduction to Exploratory Data Analysis (EDA)
Steps in EDA
Data Types
Data Transformation
Introduction to Missing data, handling missing
data Data Visualization using Matplotlib,
Seaborn
3

INTRODUCTION
• EDA is a crucial step in the data analysis process.

• It involves investigating datasets to summarize their main characteristics,

often using visual methods. It uses graphs and stats to find patterns and
weird things in the data without guessing or proving anything.

• EDA helps in understanding the data structure, detecting anomalies,

discovering patterns, and checking assumptions, all of which inform
further analysis.

• By checking each piece of data and how they relate, EDA helps us
understand what’s in the data. As well as decide what to do next.

• It’s like a guide that helps us make sense of all the information we have,
so we can make smart decisions.
4
STEPS IN EDA

1- Understand the Data Structure

- Data Collection
-Data Description
2-Data Cleaning
-Handling Missing Values
-Removing Duplicates
-Data Correction
-Outlier Detection
3-Data Transformation
-Feature Engineering
-Normalization/Standardization
-Encoding Categorical Variables
5
4-Data Visualization
-Univariate Analysis-histograms, box plots, and
density plots
-Bivariate Analysis-scatter plots, correlation
matrices, and line charts
-Multivariate -Analysis-pair plots and heat maps
5-Data Summarization-
Descriptive Statistics- mean, mode, median
Correlation Analysis- correlation, covariance
Distribution Analysis- normal distribution
6-Data Interpretation
Pattern Recognition
Hypothesis Testing
7-Report Findings-Summarize insights and findings from the
EDA process.
6
DATA TYPES

1.Numerical Data- 3.Binary Data

o Continuous
4.Text Data
o Discrete
2.Categorical Data 5.Time Series Data
o Nominal:

o ordinal Date/Time Data

DATA
TRANSFORMATION
8

DATA TRANSFORMATION
• Data transformation is a crucial step in the
Exploratory Data Analysis (EDA) process.
• It involves converting data into a suitable
format or structure to improve its quality and
usability.
• This can help to uncover hidden patterns,
identify anomalies, and ensure that the data is
ready for modeling.
9

BENEFITS OF DATA
TRANSFORMATION
• Improved Model Performance
• Enhanced Data Visualizations
• Handling Outliers
• Better Convergence-Faster Training, Stability,
better Performance
• Dimensionality Reduction
• Better Insights from Feature Engineering
10

DISADVANTAGES
• Information Loss
• Risk of Overfitting
• Data Leakage
• Increased Complexity
• Assumption Violation
11

TRANSFORMATION
TECHNIQUES
1. Normalization 6.Binning or
and Discretization
Standardization
2. Log
Transformation 7. Handling Missing
3. 3. Box-Cox Values
Transformation
8. Scaling
4. Square Root and
Cube Root 9. Feature
Transformations Engineering
5. Categorical 10. Dimensionality
Encoding Reduction
12

NORMALIZATION AND STANDARDIZATION

• Are two common techniques in machine learning for scaling or
transforming features, especially when working with data that
vary significantly in magnitude or units.
• Proper scaling often improves model performance, particularly
in algorithms that are sensitive to feature magnitudes, such as
gradient-based models and distance-based models.
NORMALIZATION 13

• Normalization typically refers to scaling the

features to a specific range, usually [0, 1] or [-1, 1].
• This ensures that the values of all features lie
within a uniform range, preserving the relationships
among data points while changing their scale.
• Formula:
• Xnorm= X−Xmin/ Xmax−Xmin
• Where:
• X is the original feature value,
• Xminand Xmax are the minimum and maximum
values of the feature.
STANDARDIZATION 14

• Standardization involves transforming the

features to have a mean of 0 and a standard
deviation of 1. This process shifts the
distribution of the data so that it is centered
around 0, with unit variance. It is particularly
useful when features have different units or are
distributed on vastly different scales.
• Formula:
• Xstd= X−μ/σ
• Where:
• μ is the mean of the feature,
• σ is the standard deviation of the feature.
15

DISCRETIZATION AND
BINNING
• Binning (or discretization) is the process of
converting continuous data into discrete bins
or categories.

• It is often used in data analysis and

preprocessing to simplify the data, reduce the
effects of small observation errors, and make
patterns more visible.

• Binning can also help in handling skewed data

and outliers, improving the performance of
some machine learning models.
16

TYPES OF BINNING
1. Fixed-width Binning (Equal-width Binning)-
Divides the range of data into intervals (bins) of
equal size.

Example: If you have a continuous range of data from 0 to 100

and you decide to use 5 bins, each bin would have a width of
Click icon to add picture
20 (0-20, 20-40, 40-60, 60-80, 80-100).

2. Quantile Binning (Equal-frequency Binning)-

Divides the data into bins so that each bin contains
an equal number of data points.

Example: For a dataset of 100 values, if you create 4 bins

using quantile binning, each bin will have 25 values
17

3.Custom Binning-
The user defines the bin edges based on
domain knowledge or specific
requirements.

Example: Age categories like 0-18 (child), 19-35

(young adult), 36-60 (middle-aged), 61+ (senior).
Click icon to add picture
4. K-Means Binning-
Uses clustering techniques, like K-Means, to
create bins where data points in each bin
are similar.

This method is less common but can be useful for

creating bins with similar properties.
18

Advantages of Disadvantages of
Binning Binning

• Simplification • Loss of
• Handling Outliers Information
• Improved Model • Choice of Bins
Interpretability • Introduces Bias
• Data
Normalization
19
20

OUTPUT
INTRODUCTION TO
MISSING DATA,
HANDLING MISSING
DATA
22

INTRODUCTION TO MISSING DATA

• Missing data is a common issue in data analysis, where certain
values or observations are absent from a dataset.
• This can happen for various reasons, such as human error, data
entry problems, or sensor malfunctions.
• Missing data can significantly affect the quality of the analysis
and the performance of machine learning models.
23

IDENTIFYING MISSING
DATA:
•Visualization: Use heatmaps, bar
plots, or matrix plots to visualize
the distribution and pattern of
missing data.

•Summary Statistics: Use

functions like .isnull().sum() in
pandas to count missing values
per column.
24

HANDLING MISSING DATA:

• Removing Missing Data:

• Dropping Rows: Remove rows with missing data if the

proportion of missing data is small.

• Dropping Columns: Remove columns with a high

proportion of missing values, especially if they are not
critical to the analysis.
• Example- df.dropna()
25
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample DataFrame with missing

values
data = {'Name': ['Alice', 'Bob', 'Charlie',
'David', 'Eve'],
'Age': [24, np.nan, 22, 30, np.nan],
'Salary': [50000, 54000, np.nan,
60000, 59000]}

df = pd.DataFrame(data)

# Show the data with missing values

OUTPUT
27

# Option 1: Drop rows with missing values

df_dropped = df.dropna()

print("\nAfter Dropping Rows with Missing

Values:")

print(df_dropped)

OUTPUT
28

DROP COLUMN

#Drop the 'Salary' column

df_dropped = df.drop('Salary', axis=1)
print(df_dropped)
29

# Option 2: Fill missing values with a specific value (e.g.,

mean, median)
# Filling missing 'Age' with the mean value

df['Age_filled'] = df['Age'].fillna(df['Age'].mean())

# Filling missing 'Salary' with the median value

df['Salary_filled'] = df['Salary'].fillna(df['Salary'].median())

print("\nAfter Filling Missing Values:")

print(df)
30

IMPUTATION:
•Mean/Median/Mode Imputation: Replace missing
values with the mean, median, or mode of the column. This is
simple but can introduce bias.

•Forward/Backward Fill: Fill missing values using the

previous (forward fill) or next (backward fill) values. This is
useful for time-series data.

•Interpolation: Estimate missing values by interpolating

between existing values.
•Example: df.fillna(df.mean())

in pandas to fill missing values with the mean.

31
32
33

DATA VISUALIZATION USING

MATPLOTLIB, SEABORN
• Line plot
• Scatter Plot
• Heat Map
• Box plot
34

NORMALIZATION
AND
STANDARDIZATION
import pandas as pd 35

import numpy as np
from sklearn.preprocessing import MinMaxScaler,
StandardScaler

# Sample dataset
data = {'Age': [25, 30, 22, 35, 28],
'Salary': [50000, 60000, 52000, 80000, 55000]}

df = pd.DataFrame(data) OUTPUT

Age Salary
# Display original data 0 25 50000
1 30 60000
print("Original Data:") 2 22 52000
print(df) 3 35 80000
4 28 55000
NORMALIZATION 36

# === Normalization using MinMaxScaler === #

# Create a MinMaxScaler object
min_max_scaler = MinMaxScaler() OUTPUT

Age
# Apply normalization to the dataset Salary
0 0.230769
normalized_data = min_max_scaler.fit_transform(df)
0.000000
1 0.615385
0.500000
# Convert the result back to a DataFrame 2 0.000000
df_normalized = pd.DataFrame(normalized_data,0.100000
3 1.000000
columns=df.columns) 1.000000
4 0.461538
0.166667
print("\nNormalized Data (MinMaxScaler):")
print(df_normalized)
37

STANDARDIZATION
# === Standardization using StandardScaler === #
# Create a StandardScaler object
OUTPUT
standard_scaler = StandardScaler()
Age
# Apply standardization to the dataset Salary
standardized_data = standard_scaler.fit_transform(df)
0 -0.707107 -
1.149082
1 0.707107 -
0.229816
# Convert the result back to a DataFrame 2 -1.414214 -
df_standardized = pd.DataFrame(standardized_data,
1.034466
3 1.414214
columns=df.columns) 1.609073
4 0.000000 -
0.804291
print("\nStandardized Data (StandardScaler):")
print(df_standardized)
FORWARD FILL
38

import pandas as pd
import numpy as np

# Sample DataFrame with missing values

data = {'Date': ['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-04', '2023-10-05'],
'Value': [10, np.nan, 20, np.nan, 30]}

df = pd.DataFrame(data)

# Display original DataFrame

print("Original DataFrame:")
print(df)

# Apply forward fill to fill missing values

df_ffill = df.fillna(method='ffill')

# Display DataFrame after forward fill

print("\nDataFrame After Forward Fill:")
print(df_ffill)
39

Original DataFrame:
Date Value
0 2023-10-01 10.0
1 2023-10-02 NaN
2 2023-10-03 20.0
3 2023-10-04 NaN
4 2023-10-05 30.0

DataFrame After Forward Fill:

Date Value
0 2023-10-01 10.0
1 2023-10-02 10.0
2 2023-10-03 20.0
3 2023-10-04 20.0
4 2023-10-05 30.0
BACKWARD FILL
40

import pandas as pd
import numpy as np

# Sample DataFrame with missing values

data = {'Date': ['2023-10-01', '2023-10-02', '2023-10-03', '2023-10-
04', '2023-10-05'],
'Value': [10, np.nan, 20, np.nan, 30]}

df = pd.DataFrame(data)

# Display original DataFrame

print("Original DataFrame:")
print(df)

# Apply backward fill to fill missing values

df_bfill = df.fillna(method='bfill')

# Display DataFrame after backward fill

print("\nDataFrame After Backward Fill:")
41

OUTPUT
Original DataFrame:
Date Value
0 2023-10-01 10.0
1 2023-10-02 NaN
2 2023-10-03 20.0
3 2023-10-04 NaN
4 2023-10-05 30.0

DataFrame After Backward Fill:

Date Value
0 2023-10-01 10.0
1 2023-10-02 20.0
2 2023-10-03 20.0
3 2023-10-04 30.0
4 2023-10-05 30.0
42

INTERPOLATE MISSING
VALUES

In machine learning, interpolating missing

values involves filling in the missing data points
by estimating their values based on the
surrounding data. This can be especially useful
when you have time-series or continuous data,
where missing values can be inferred from
existing data points.
43

INTERPOLATION TECHNIQUES

There are several interpolation techniques, including:

1.Linear Interpolation: Assumes a straight line between two known
points.
2.Polynomial Interpolation: Fits a polynomial function between the
points.
3.Spline Interpolation: Fits a spline (piecewise polynomial) between
points.
4.Time-based Interpolation: If the index of your data is a time
series, this method takes the time spacing into account.
44

import pandas as pd
import numpy as np

# Create a sample dataframe with missing values

data = {'A': [1, 2, np.nan, 4, 5],
'B': [5, np.nan, np.nan, 8, 10]}

df = pd.DataFrame(data)

print("Original Data:")
print(df)

# Perform linear interpolation

df_interpolated = df.interpolate(method='linear’)

print("\nInterpolated Data (Linear):")

print(df_interpolated)
45

DATA DEDUPLICATION
In Machine Learning (ML), data deduplication refers to
the process of identifying and removing duplicate data
entries from the dataset to ensure that the training data
is unique, clean, and efficient for model training.
Redundant or duplicate data can lead to biased models, Click icon
inefficient training processes, and increased
computational overhead.
By eliminating duplicates, deduplication enhances the
quality of the dataset, reduces storage requirements,
and improves the model's generalization ability.
COMMON TECHNIQUES USED FOR DATA 46

DEDUPLICATION

1. Exact Match Deduplication

2. Near-Duplicate Matching (Fuzzy Matching)
3. Hash-Based Deduplication
4. Feature-Based Deduplication
5. Clustering-Based Deduplication
6. Cross-Dataset Deduplication
7. Image-Based Deduplication (for Computer Vision)
# Example in pandas
df.drop_duplicates(inplace=True)
Click to add picture

ArtificiaI Intelligence Engineer - Brochure - Compressed
No ratings yet
ArtificiaI Intelligence Engineer - Brochure - Compressed
27 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
SUMSEM12024-25 CSE3002 TH AP2024257000083 2025-05-29 Reference-Material-II
No ratings yet
SUMSEM12024-25 CSE3002 TH AP2024257000083 2025-05-29 Reference-Material-II
38 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Explainable Artificial Intelligence How Face Masks Are Detected Via Deep Neural Networks
No ratings yet
Explainable Artificial Intelligence How Face Masks Are Detected Via Deep Neural Networks
9 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
01 Introduction
No ratings yet
01 Introduction
19 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Livro Das Questoes
No ratings yet
Livro Das Questoes
8 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
ML Notes
No ratings yet
ML Notes
44 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Time Dependent
No ratings yet
Time Dependent
10 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Probability Distributions
No ratings yet
Probability Distributions
129 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Quiz 1 Solutions: and Analysis of Algorithms
No ratings yet
Quiz 1 Solutions: and Analysis of Algorithms
13 pages
SSE QuantifyingGhostlyEpisodes PartII 7 29NDKD
No ratings yet
SSE QuantifyingGhostlyEpisodes PartII 7 29NDKD
39 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Convex Analysis and Optimization - Syllabus
No ratings yet
Convex Analysis and Optimization - Syllabus
3 pages
5, Informed Searching Algorithms-I
No ratings yet
5, Informed Searching Algorithms-I
54 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Week 10
No ratings yet
Week 10
50 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Chap 3
No ratings yet
Chap 3
26 pages
Normalization
No ratings yet
Normalization
35 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Multiple Regression
No ratings yet
Multiple Regression
36 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Noncomputability and The Busy Beaver Problem: Bryant A. Julstrom
No ratings yet
Noncomputability and The Busy Beaver Problem: Bryant A. Julstrom
36 pages
DSDBA Sppu Dsbda QP
No ratings yet
DSDBA Sppu Dsbda QP
11 pages
Data Mining
No ratings yet
Data Mining
33 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Libopenabe v1.0.0 Design
No ratings yet
Libopenabe v1.0.0 Design
30 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
No ratings yet
Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R
39 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
LMI Relaxations and Its Application To Data-Driven Control Design For Switched Affine Systems
No ratings yet
LMI Relaxations and Its Application To Data-Driven Control Design For Switched Affine Systems
21 pages
Color Image Compression-Encryption Algorithm Based
No ratings yet
Color Image Compression-Encryption Algorithm Based
14 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Some Addmath Solutions
No ratings yet
Some Addmath Solutions
12 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
14 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Cholesky Decomposition
No ratings yet
Cholesky Decomposition
13 pages
Stanford Center For AI Safety - Whitepaper
No ratings yet
Stanford Center For AI Safety - Whitepaper
6 pages
ISYE6669 LP 10 21 1 - AndySun - FW
No ratings yet
ISYE6669 LP 10 21 1 - AndySun - FW
8 pages
Course Description
No ratings yet
Course Description
3 pages
2025 Python Seventh 50 Projects List
No ratings yet
2025 Python Seventh 50 Projects List
5 pages
Lab Report 5
No ratings yet
Lab Report 5
6 pages
Sigml Errata
No ratings yet
Sigml Errata
3 pages
Module 4
No ratings yet
Module 4
9 pages
Lesson 1.1 - AP Precalculus - Calc Medic
No ratings yet
Lesson 1.1 - AP Precalculus - Calc Medic
2 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet

Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Uploaded by

EDA

• It involves investigating datasets to summarize their main characteristics,

• EDA helps in understanding the data structure, detecting anomalies,

1- Understand the Data Structure

1.Numerical Data- 3.Binary Data

o ordinal Date/Time Data

NORMALIZATION AND STANDARDIZATION

• Normalization typically refers to scaling the

• Standardization involves transforming the

• It is often used in data analysis and

• Binning can also help in handling skewed data

Example: If you have a continuous range of data from 0 to 100

2. Quantile Binning (Equal-frequency Binning)-

Example: For a dataset of 100 values, if you create 4 bins

Example: Age categories like 0-18 (child), 19-35

This method is less common but can be useful for

INTRODUCTION TO MISSING DATA

•Summary Statistics: Use

HANDLING MISSING DATA:

• Removing Missing Data:

• Dropping Rows: Remove rows with missing data if the

• Dropping Columns: Remove columns with a high

# Create a sample DataFrame with missing

# Show the data with missing values

# Option 1: Drop rows with missing values

print("\nAfter Dropping Rows with Missing

#Drop the 'Salary' column

# Option 2: Fill missing values with a specific value (e.g.,

# Filling missing 'Salary' with the median value

print("\nAfter Filling Missing Values:")

•Forward/Backward Fill: Fill missing values using the

•Interpolation: Estimate missing values by interpolating

in pandas to fill missing values with the mean.

DATA VISUALIZATION USING

# === Normalization using MinMaxScaler === #

# Sample DataFrame with missing values

# Display original DataFrame

# Apply forward fill to fill missing values

# Display DataFrame after forward fill

DataFrame After Forward Fill:

# Sample DataFrame with missing values

# Display original DataFrame

# Apply backward fill to fill missing values

# Display DataFrame after backward fill

DataFrame After Backward Fill:

In machine learning, interpolating missing

There are several interpolation techniques, including:

# Create a sample dataframe with missing values

# Perform linear interpolation

print("\nInterpolated Data (Linear):")

1. Exact Match Deduplication

You might also like