0% found this document useful (0 votes)
3 views

Perform Exploratory Data Analysis

what is perform exploratory data analysis?

Uploaded by

Abu Sufian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Perform Exploratory Data Analysis

what is perform exploratory data analysis?

Uploaded by

Abu Sufian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

How to perform exploratory data analysis

Imagine you have a dataset of students' test scores and demographics.


Here's a simplified step-by-step approach:

1. Load Data:
o Read the data into a table.
o Example: students = pd.read_csv('students.csv')
2. Initial Exploration:
o Look at the first few rows: students.head()
o Check data structure: students.info()
3. Summary Statistics:
o Calculate mean and median of test scores.
o Count unique values in gender column.
4. Handle Missing Data:
o Identify missing entries: students.isnull().sum()
o Fill missing scores with the mean or remove those rows.
5. Visualize Data:
o Histogram of test scores.
o Box plot of test scores by gender.
o Scatter plot of test scores versus study hours.
6. Find Patterns:
o Calculate correlation between study hours and test scores.
o Cross-tabulate test scores and extracurricular participation.
7. Identify Outliers:
o Use IQR to find unusually high or low test scores.
o Use Z-score to find test scores that are far from the average.
8. Feature Engineering:
o Create a new feature combining study hours and class
participation.
Performing Exploratory Data Analysis (EDA) involves several steps, from understanding the structure of
the data to summarizing its main characteristics. Below is a detailed guide on how to perform EDA using
Python with libraries like Pandas, Matplotlib, and Seaborn.

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

# Load data

df = pd.read_csv('your_dataset.csv')

# Understand data structure

print(df.head())

print(df.shape)

print(df.info())

print(df.describe())

# Data cleaning

df.dropna(inplace=True)

df.drop_duplicates(inplace=True)
# Univariate analysis

df['column_name'].hist(bins=30)

plt.show()

sns.boxplot(x=df['column_name'])

plt.show()

# Bivariate analysis

plt.scatter(df['column_x'], df['column_y'])

plt.xlabel('column_x')

plt.ylabel('column_y')

plt.show()

correlation_matrix = df.corr()

sns.heatmap(correlation_matrix, annot=True)

plt.show()

# Categorical data analysis

df['categorical_column'].value_counts().plot(kind='bar')
plt.show()

# Identifying outliers using IQR

Q1 = df['column_name'].quantile(0.25)

Q3 = df['column_name'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['column_name'] < lower_bound) | (df['column_name']


> upper_bound)]

print(outliers)

# Identifying outliers using Z-Score

df['z_score'] = stats.zscore(df['column_name'])

outliers = df[np.abs(df['z_score']) > 3]

print(outliers)

# Feature engineering

df['new_feature'] = df['feature1'] + df['feature2']


# Visualizing relationships

sns.pairplot(df)

plt.show()

# Hypothesis testing

group1 = df[df['group_column'] == 'group1']['numeric_column']

group2 = df[df['group_column'] == 'group2']['numeric_column']

t_stat, p_value = ttest_ind(group1, group2)

print(f'T-statistic: {t_stat}, P-value: {p_value}')

This workflow provides a structured approach to performing EDA, helping you understand the dataset's
characteristics and relationships before moving on to more complex analysis or modeling.

You might also like