0% found this document useful (0 votes)
40 views4 pages

Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science

Exploratory Data Analysis (EDA) is essential in data science for summarizing datasets, identifying patterns, and detecting anomalies. The process involves steps such as loading data, handling missing values, visualizing data, and feature engineering to improve data quality. EDA ultimately enhances model accuracy by ensuring a thorough understanding of the data before applying predictive models.

Uploaded by

Vikram Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science

Exploratory Data Analysis (EDA) is essential in data science for summarizing datasets, identifying patterns, and detecting anomalies. The process involves steps such as loading data, handling missing values, visualizing data, and feature engineering to improve data quality. EDA ultimately enhances model accuracy by ensuring a thorough understanding of the data before applying predictive models.

Uploaded by

Vikram Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Exploratory Data Analysis (EDA) in Data Science

1. Introduction to EDA
Exploratory Data Analysis (EDA) is a fundamental step in data science and machine
learning that involves analyzing datasets to summarize their key characteristics, identify
patterns, and detect anomalies before applying predictive models.

Objectives of EDA:

 Understand data structure and patterns.


 Identify missing values, outliers, and inconsistencies.
 Discover relationships between variables.
 Validate assumptions before building models.
 Improve data quality through feature engineering.

2. Steps in Exploratory Data Analysis


Step Description
Load Data Import dataset using Pandas
Understand Structure View column types, missing values, and basic stats
Handle Missing Values Remove or fill NaNs (mean, median, mode)
Remove Duplicates Identify and drop duplicate rows
Visualize Data Histograms, boxplots, scatter plots, heatmaps
Outlier Detection Use IQR or boxplots
Handle Categorical Data Convert to numeric format (one-hot, label encoding)
Feature Engineering Create new features and scale data
Save Cleaned Data Store processed dataset for modeling

Step 1: Load the Dataset

 Import necessary libraries and read the dataset.

import pandas as pd

df = pd.read_csv("data.csv") # Replace with actual file path


print(df.head()) # Display first five rows

Step 2: Understand Data Structure

 View column types, null values, and basic information.


print(df.info()) # Column names, data types, non-null values
print(df.describe()) # Summary statistics (mean, median, etc.)

3. Handling Missing Data


Missing data can impact model accuracy. Common techniques to handle missing values:

 Remove missing values: df.dropna()


 Fill missing values with mean/median/mode:

df.fillna(df.mean(), inplace=True) # Fill numerical NaNs with mean


df.fillna(df.mode().iloc[0], inplace=True) # Fill categorical NaNs with
mode

4. Handling Duplicate Data


 Detect and remove duplicate rows to avoid redundancy.

print("Duplicates:", df.duplicated().sum()) # Count duplicate rows


df.drop_duplicates(inplace=True) # Remove duplicates

5. Data Visualization for EDA


A. Univariate Analysis (Single Variable)

1. Histogram (Data Distribution)


o Helps understand the spread of numerical features.
2. import matplotlib.pyplot as plt
3. df["column_name"].hist(bins=30)
4. plt.show()
5. Boxplot (Outlier Detection)
o Shows quartiles and outliers.
6. import seaborn as sns
7. sns.boxplot(df["column_name"])
8. plt.show()

B. Bivariate Analysis (Two Variables)

1. Scatter Plot (Correlation between two features)


o Used for continuous variables.
2. sns.scatterplot(x="feature1", y="feature2", data=df)
3. plt.show()
4. Correlation Heatmap
o Shows relationships between numerical variables.
5. plt.figure(figsize=(10,6))
6. sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
7. plt.show()
8. Pairplot
o Visualizes pairwise relationships.
9. sns.pairplot(df)
10. plt.show()

6. Outlier Detection and Handling


A. Using IQR (Interquartile Range) Method

 Remove data points beyond 1.5 times the IQR.

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 *
IQR))).any(axis=1)]

7. Handling Categorical Data


A. Encoding Categorical Variables

1. One-Hot Encoding (Best for nominal categories)

df = pd.get_dummies(df, columns=["categorical_column"], drop_first=True)

2. Label Encoding (For ordinal categories like Low, Medium, High)

from sklearn.preprocessing import LabelEncoder


encoder = LabelEncoder()
df["encoded_column"] = encoder.fit_transform(df["categorical_column"])

8. Feature Engineering
 Creating new meaningful features to improve models.

A. Creating a New Feature


df["new_feature"] = df["feature1"] * df["feature2"]

B. Feature Scaling

1. Min-Max Scaling (Rescale to range 0-1)


2. from sklearn.preprocessing import MinMaxScaler
3. scaler = MinMaxScaler()
4. df_scaled = scaler.fit_transform(df)
5. Standardization (Mean = 0, Std Dev = 1)
6. from sklearn.preprocessing import StandardScaler
7. scaler = StandardScaler()
8. df_scaled = scaler.fit_transform(df)
9. Saving the Cleaned Dataset
df.to_csv("cleaned_data.csv", index=False)

EDA is a crucial step in data science that ensures data quality and model accuracy. By
exploring and visualizing the dataset, we can make informed decisions before applying
machine learning models.

You might also like