0% found this document useful (0 votes)
21 views4 pages

Exp 12

The document outlines the process of Exploratory Data Analysis (EDA) for classification using Pandas and Matplotlib, detailing key steps such as data collection, cleaning, visualization, and feature engineering. It provides practical code examples for loading a dataset, inspecting data, handling missing values, and visualizing relationships between variables. The document emphasizes the importance of EDA in understanding data before advanced analytics or modeling.

Uploaded by

8367748261durga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Exp 12

The document outlines the process of Exploratory Data Analysis (EDA) for classification using Pandas and Matplotlib, detailing key steps such as data collection, cleaning, visualization, and feature engineering. It provides practical code examples for loading a dataset, inspecting data, handling missing values, and visualizing relationships between variables. The document emphasizes the importance of EDA in understanding data before advanced analytics or modeling.

Uploaded by

8367748261durga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Experiment-12:

Exploratory Data Analysis for Classification using Pandas or


Matplotlib.

Introduction to EDA

Exploratory Data Analysis (EDA) is a critical step in the data analysis process, which
involves examining and visualizing data to gain insights and uncover patterns, anomalies,
and relationships within the dataset. EDA helps data analysts and scientists understand the
data they are working with before proceeding to more advanced analytics or modeling.
Below is a detailed explanation of the key steps involved in EDA:
1. Data Collection:
Gather the dataset from various sources, such as databases, CSV files, APIs, or web
scraping.
Ensure that the data is structured and organized for analysis.
2. Data Loading:
Import the dataset into your preferred data analysis environment, such as Python using
libraries like Pandas.
3. Initial Data Inspection:
Examine the first few rows of the dataset to get a sense of its structure and content.
Check the data types, column names, and missing values.
4. Data Cleaning:
Handle missing values by either imputing them or removing rows/columns with missing
data.
Correct data inconsistencies and errors, such as typos and outliers.
Ensure that data types are appropriate for each column (e.g., numeric, categorical).
5. Descriptive Statistics:
Calculate basic statistics for numerical variables, including mean, median, standard
deviation, and quartiles.
Understand the central tendencies and spread of the data.
6. Univariate Analysis:
Visualize the distribution of individual variables through histograms, density plots, box
plots, or bar charts.
Identify outliers and anomalies.
7. Bivariate and Multivariate Analysis:
Explore relationships between pairs of variables through scatter plots, heatmaps, or
correlation matrices.
Investigate how variables interact with each other.
Identify potential predictors for the target variable in a classification or regression task.
8. Data Visualization:
Create meaningful visualizations such as line plots, bar charts, pie charts, and box plots to
represent data patterns.
Use color and labels to make visualizations more interpretable.
9. Feature Engineering:
Create new features based on domain knowledge or insights from the EDA.
Transform variables to better suit the modeling algorithms.
10. Outlier Detection: - Identify and handle outliers that may affect the quality of the
analysis or model. - Consider whether outliers should be removed or transformed.
11. Categorical Variable Analysis: - Analyze categorical variables using frequency
tables, bar plots, or stacked bar charts. - Understand the distribution of categories within
each variable.
12. Time Series Analysis (if applicable): - For time series data, examine trends,
seasonality, and autocorrelation. - Decompose time series data to better understand its
components.
13. Hypothesis Testing (if applicable): - Perform statistical tests to validate or reject
hypotheses about the data. - Common tests include t-tests, chi-squared tests, and ANOVA.
14. Summary and Insights: - Summarize the key findings from the EDA process. -
Document interesting patterns, relationships, and potential insights.
15. Data Visualization and Reporting: - Create clear and informative data visualizations
for reporting and presentation. - Communicate the results and insights effectively to
stakeholders.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset (replace 'your_dataset.csv' with your dataset's file path)
data = pd.read_csv('/content/Iris.csv')

# Display the first few rows of the dataset to get an overview


print(data.head())

# Summary statistics for numeric columns


print(data.describe())

# Missing value analysis


print("\nMissing Values:")
print(data.isnull().sum())

# Explore missing data


plt.figure(figsize=(8, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data')
plt.show()

# Class distribution for classification


class_counts = data['Species'].value_counts()
print(class_counts)

# Visualization of class distribution


plt.figure(figsize=(8, 6))
sns.countplot(x='Species', data=data)
plt.title('Class Distribution')
plt.xlabel('Target Class')
plt.ylabel('Count')
plt.show()

# Import label encoder


from sklearn import preprocessing

# label_encoder object knows


# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'species'.


data['Species']= label_encoder.fit_transform(data['Species'])

data['Species'].unique()

# Correlation matrix for numeric features


correlation_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Pairplot to visualize relationships between numerical features


sns.pairplot(data, hue='Species')
plt.show()

# Box plots for numerical features vs. target variable


plt.figure(figsize=(12, 8))
for i, feature in enumerate(data.columns[:-1]):
plt.subplot(2, 3, i + 1)
sns.boxplot(data=data, x='Species', y=feature)
plt.title(f'{feature} vs. Target')
plt.tight_layout()
plt.show()

# Box plots for numeric features by class


plt.figure(figsize=(12, 8))
sns.boxplot(x='Species', y='SepalLengthCm', data=data)
plt.title('Box Plot of SepalLengthCm by Class')
plt.xlabel('Target Class')
plt.ylabel('Feature1')
plt.show()
# Histograms for numeric features
data.hist(bins=20, figsize=(12, 8))
plt.suptitle('Histograms of Numeric Features', y=1.02)
plt.show()

# Distribution plots for numerical features


numerical_features = data.select_dtypes(include=['int64', 'float64']).columns
plt.figure(figsize=(12, 8))
for i, feature in enumerate(numerical_features):
plt.subplot(2, 3, i + 1)
sns.histplot(data=data, x=feature, kde=True)
plt.title(f'{feature} Distribution')
plt.tight_layout()
plt.show()

# Scatter plot for feature relationships


plt.figure(figsize=(8, 6))
sns.scatterplot(data, x='SepalLengthCm', y='SepalWidthCm', hue='Species')
plt.title("Scatter Plot between Feature1 and Feature2")
plt.show()

# Pairwise feature correlation with the target variable


correlation_with_target = data.corr()['Species'].abs().sort_values(ascending=False)
print("\nFeature Correlation with Target:")
print(correlation_with_target)

You might also like