Perform Exploratory Data Analysis
Perform Exploratory Data Analysis
1. Load Data:
o Read the data into a table.
o Example: students = pd.read_csv('students.csv')
2. Initial Exploration:
o Look at the first few rows: students.head()
o Check data structure: students.info()
3. Summary Statistics:
o Calculate mean and median of test scores.
o Count unique values in gender column.
4. Handle Missing Data:
o Identify missing entries: students.isnull().sum()
o Fill missing scores with the mean or remove those rows.
5. Visualize Data:
o Histogram of test scores.
o Box plot of test scores by gender.
o Scatter plot of test scores versus study hours.
6. Find Patterns:
o Calculate correlation between study hours and test scores.
o Cross-tabulate test scores and extracurricular participation.
7. Identify Outliers:
o Use IQR to find unusually high or low test scores.
o Use Z-score to find test scores that are far from the average.
8. Feature Engineering:
o Create a new feature combining study hours and class
participation.
Performing Exploratory Data Analysis (EDA) involves several steps, from understanding the structure of
the data to summarizing its main characteristics. Below is a detailed guide on how to perform EDA using
Python with libraries like Pandas, Matplotlib, and Seaborn.
import pandas as pd
# Load data
df = pd.read_csv('your_dataset.csv')
print(df.head())
print(df.shape)
print(df.info())
print(df.describe())
# Data cleaning
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
# Univariate analysis
df['column_name'].hist(bins=30)
plt.show()
sns.boxplot(x=df['column_name'])
plt.show()
# Bivariate analysis
plt.scatter(df['column_x'], df['column_y'])
plt.xlabel('column_x')
plt.ylabel('column_y')
plt.show()
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
print(outliers)
df['z_score'] = stats.zscore(df['column_name'])
print(outliers)
# Feature engineering
sns.pairplot(df)
plt.show()
# Hypothesis testing
This workflow provides a structured approach to performing EDA, helping you understand the dataset's
characteristics and relationships before moving on to more complex analysis or modeling.