We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 7
Experiment-12
Exploratory Data Analysis for Classification using Panda or Matplottib.
Introduction to EDA
Data Preparation
fr ru
Structure
Data
Exploration Insight
Data Analysis Reports
(EDA) visual Graphs
Exploratory Data Analysis (EDA) is a critical step in the data analysis process, which
involves examining and visualizing data to gain insights and uncover patterns,
anomalies, and relationships within the dataset. EDA helps data analysts and
scientists understand the data they are working with before proceeding to more
advanced analytics or modeling. Below is a detailed explanation of the key steps
involved in EDA:
1. Data Collection:
Gather the dataset from various sources, such as databases, CSV files, APIs, or web
scraping.
Ensure that the data is structured and organized for analysis
2. Data Loading:
Import the dataset into your preferred data analysis environment, such as Python
using libraries like Pandas.
3. Initial Data Inspection:
Examine the first few rows of the dataset to get a sense of its structure and content.
Check the data types, column names, and missing values.
Page 128 of 1824, Data Cleaning:
Handle missing values by either imputing them or removing rows/columns with
missing data.
Correct data inconsistencies and errors, such as typos and outliers,
Ensure that data types are appropriate for each column (e.g., numeric, categorical)
5. Descriptive Statistics:
Calculate basic statistics for numerical variables, including mean, median, standard
deviation, and quartiles.
Understand the central tendencies and spread of the data,
6, Univariate Analysis:
Visualize the distribution of individual variables through histograms, density plots,
box plots, or bar charts
Identify outliers and anomalies.
7. Bivariate and Multivariate Analysis:
Explore relationships between pairs of variables through scatter plots, heatmaps, or
correlation matrices.
Investigate how variables interact with each other.
Identify potential predictors for the target variable in a classification or regression
task
8. Data Visualization:
Create meaningful visualizations such as line plots, bar charts, pie charts, and box
plots to represent data patterns.
se color and labels to make visualizations more interpretable.
9, Feature Engineering:
Create new features based on domain knowledge or insights from the EDA.
Transform variables to better suit the modeling algorithms.
Page 129 of 18210. Outlier Detection: - Identify and handle outliers that may affect the quality of
the analysis or model. - Consider whether outliers should be removed or transformed.
11. Categorical Variable Analysis: - Analyze categorical variables using frequency
tables, bar plots, or stacked bar charts. - Understand the distribution of categories
within each variable.
12. Time Series Analysis (if applicable): - For time series data, examine trends,
seasonality, and autocorrelation. - Decompose time series data to better understand its
components.
13. Hypothesis Testing (if applicable): - Perform statistical tests to validate or reject
hypotheses about the data, - Common tests include t-tests, chi-squared tests, and
ANOVA.
14. Summary and Insights: - Summarize the key findings from the EDA proce:
Document interesting patterns, relationships, and potential insights.
15. Data Visualization and Reporting: - Create clear and informative data
visualizations for reporting and presentation. - Communicate the results and insights
effectively to stakeholders.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load your dataset (replace 'your_dataset.csv' with your dataset's file path)
data = pd.read_esv(/content/Iris.csv')
# Display the first few rows of the dataset to get an overview
print(data.head()
# Summary statistics for numeric columns
print(data.deseribe())
Page 130 of 182# Missing value analysis
print("\nMissing Values:")
print(data.isnull().sum()
# Explore missing data
plt.figure(figsize=(8, 6))
sns.heatmap(data.isnull(), cbar=False, emap='viridis')
plt.itle(Missing Data’)
plt.show()
# Class distribution for classification
class_counts = data['Species'].value_counts()
print(class_counts)
# Visualization of class distribution
pit figure(figsize=(8, 6))
sns.countplot(x—'Species’, data-data)
plt.title(Class Distribution’)
plt.xlabel(’Target Class’)
pltylabel((Count’)
plt.show0)
# Import label encoder
from skleam import preprocessing
Page 131 of 182# label_encoder object knows
# how to understand word labels.
label_encoder ~ preprocessing. LabelEncoder()
# Encode labels in column 'species'.
data['Species'}= label_encoder.fit_transform(data['Species'})
data['Species'].unique()
# Correlation matrix for numeric features
correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, emap='coolwarm')
plt.title('Correlation Matrix’)
plt.show(,
# Pairplot to visualize relationships between numerical features
sns.pairplot(data, hue='Species’)
plt.show0,
# Box plots for numerical features vs. target variable
plt.figure(figsize-(12, 8))
for i, feature in enumerate(data.columns[:-1]):
plt.subplot(2, 3, i+ 1)
sns.boxplot(data=data, x='Species’, yfeature)
Page 132 of 182plttitle(f {feature} vs. Target’)
plt.tight_layout()
plt.show0)
# Box plots for numeric features by cl
pit.figure(figsize=(12, 8))
sns.boxplot(x='Species', y—'SepalLengthCm’, data=data)
plt.itle('Box Plot of SepalLengthCm by Class’)
plt.xlabel(’Target Class’)
plt.ylabel('Feature!")
pltshow()
# Histograms for numeric features
data.hist(bins=20, figsize=(12, 8))
plt.suptitle(Histograms of Numeric Features’, y-1.02)
pltshow()
# Distribution plots for numerical features
numerical_features = data.select_dtypes(include=['int64’, 'float64']).columns
plt.figure(figsize=(12, 8))
for i, feature in enumerate(numerical_features):
plt.subplot(2, 3, i+ 1)
sns.histplot(data=data, x=feature, kde=True)
plttitle(f {feature} Distribution’)
plt.tight_layout()
Page 133 of 182plt.show()
# Scatter plot for feature relationships
plt.figure(figsize=(8, 6))
sns.scatterplot(data, x='SepalLengthCm’, y’SepalWidthCm', hue='Species')
plt.title("Scatter Plot between Featurel and Feature2")
pit.show()
# Pairwise feature correlation with the target variable
correlation_with_target = data.corr()['Species'].abs().sort_values(ascending-False)
print("\nFeature Correlation with Target:")
print(correlation_with_target)
Page 134 of 182