Exploratory Data Analysis
Exploratory Data Analysis
(EXPLORATORY DATA
ANALYSIS )
2
OUTLINES
Introduction to Exploratory Data Analysis (EDA)
Steps in EDA
Data Types
Data Transformation
Introduction to Missing data, handling missing
data Data Visualization using Matplotlib,
Seaborn
3
INTRODUCTION
• EDA is a crucial step in the data analysis process.
• By checking each piece of data and how they relate, EDA helps us
understand what’s in the data. As well as decide what to do next.
• It’s like a guide that helps us make sense of all the information we have,
so we can make smart decisions.
4
STEPS IN EDA
DATA TRANSFORMATION
• Data transformation is a crucial step in the
Exploratory Data Analysis (EDA) process.
• It involves converting data into a suitable
format or structure to improve its quality and
usability.
• This can help to uncover hidden patterns,
identify anomalies, and ensure that the data is
ready for modeling.
9
BENEFITS OF DATA
TRANSFORMATION
• Improved Model Performance
• Enhanced Data Visualizations
• Handling Outliers
• Better Convergence-Faster Training, Stability,
better Performance
• Dimensionality Reduction
• Better Insights from Feature Engineering
10
DISADVANTAGES
• Information Loss
• Risk of Overfitting
• Data Leakage
• Increased Complexity
• Assumption Violation
11
TRANSFORMATION
TECHNIQUES
1. Normalization 6.Binning or
and Discretization
Standardization
2. Log
Transformation 7. Handling Missing
3. 3. Box-Cox Values
Transformation
8. Scaling
4. Square Root and
Cube Root 9. Feature
Transformations Engineering
5. Categorical 10. Dimensionality
Encoding Reduction
12
DISCRETIZATION AND
BINNING
• Binning (or discretization) is the process of
converting continuous data into discrete bins
or categories.
TYPES OF BINNING
1. Fixed-width Binning (Equal-width Binning)-
Divides the range of data into intervals (bins) of
equal size.
3.Custom Binning-
The user defines the bin edges based on
domain knowledge or specific
requirements.
Advantages of Disadvantages of
Binning Binning
• Simplification • Loss of
• Handling Outliers Information
• Improved Model • Choice of Bins
Interpretability • Introduces Bias
• Data
Normalization
19
20
OUTPUT
INTRODUCTION TO
MISSING DATA,
HANDLING MISSING
DATA
22
IDENTIFYING MISSING
DATA:
•Visualization: Use heatmaps, bar
plots, or matrix plots to visualize
the distribution and pattern of
missing data.
df = pd.DataFrame(data)
OUTPUT
27
df_dropped = df.dropna()
print(df_dropped)
OUTPUT
28
DROP COLUMN
df['Age_filled'] = df['Age'].fillna(df['Age'].mean())
df['Salary_filled'] = df['Salary'].fillna(df['Salary'].median())
IMPUTATION:
•Mean/Median/Mode Imputation: Replace missing
values with the mean, median, or mode of the column. This is
simple but can introduce bias.
NORMALIZATION
AND
STANDARDIZATION
import pandas as pd 35
import numpy as np
from sklearn.preprocessing import MinMaxScaler,
StandardScaler
# Sample dataset
data = {'Age': [25, 30, 22, 35, 28],
'Salary': [50000, 60000, 52000, 80000, 55000]}
df = pd.DataFrame(data) OUTPUT
Age Salary
# Display original data 0 25 50000
1 30 60000
print("Original Data:") 2 22 52000
print(df) 3 35 80000
4 28 55000
NORMALIZATION 36
Age
# Apply normalization to the dataset Salary
0 0.230769
normalized_data = min_max_scaler.fit_transform(df)
0.000000
1 0.615385
0.500000
# Convert the result back to a DataFrame 2 0.000000
df_normalized = pd.DataFrame(normalized_data,0.100000
3 1.000000
columns=df.columns) 1.000000
4 0.461538
0.166667
print("\nNormalized Data (MinMaxScaler):")
print(df_normalized)
37
STANDARDIZATION
# === Standardization using StandardScaler === #
# Create a StandardScaler object
OUTPUT
standard_scaler = StandardScaler()
Age
# Apply standardization to the dataset Salary
standardized_data = standard_scaler.fit_transform(df)
0 -0.707107 -
1.149082
1 0.707107 -
0.229816
# Convert the result back to a DataFrame 2 -1.414214 -
df_standardized = pd.DataFrame(standardized_data,
1.034466
3 1.414214
columns=df.columns) 1.609073
4 0.000000 -
0.804291
print("\nStandardized Data (StandardScaler):")
print(df_standardized)
FORWARD FILL
38
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
Original DataFrame:
Date Value
0 2023-10-01 10.0
1 2023-10-02 NaN
2 2023-10-03 20.0
3 2023-10-04 NaN
4 2023-10-05 30.0
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
OUTPUT
Original DataFrame:
Date Value
0 2023-10-01 10.0
1 2023-10-02 NaN
2 2023-10-03 20.0
3 2023-10-04 NaN
4 2023-10-05 30.0
INTERPOLATE MISSING
VALUES
INTERPOLATION TECHNIQUES
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Original Data:")
print(df)
DATA DEDUPLICATION
In Machine Learning (ML), data deduplication refers to
the process of identifying and removing duplicate data
entries from the dataset to ensure that the training data
is unique, clean, and efficient for model training.
Redundant or duplicate data can lead to biased models, Click icon
inefficient training processes, and increased
computational overhead.
By eliminating duplicates, deduplication enhances the
quality of the dataset, reduces storage requirements,
and improves the model's generalization ability.
COMMON TECHNIQUES USED FOR DATA 46
DEDUPLICATION