Comprehensive Guide for Exploratory Data Analysis in Python
Comprehensive Guide for Exploratory Data Analysis in Python
1. Introduction to EDA
Exploratory Data Analysis (EDA) is a crucial step in data analysis that helps understand the data,
uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of
summary statistics and graphical representations.
Comprehensive Guide for Exploratory Data Analysis in Python
2. Loading Libraries and Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Example: Loading a CSV file
df = pd.read_csv('your_dataset.csv')
Comprehensive Guide for Exploratory Data Analysis in Python
3. Data Overview
# Display the first few rows of the dataset
print(df.head())
# Display summary statistics
print(df.describe())
# Display information about the dataset
print(df.info())
Comprehensive Guide for Exploratory Data Analysis in Python
4. Data Cleaning
# Handling Missing Values
print(df.isnull().sum())
df.fillna(df.mean(), inplace=True)
# Alternatively, you can fill missing values with median or mode
# df['column_name'].fillna(df['column_name'].median(), inplace=True)
# df['column_name'].fillna(df['column_name'].mode()[0], inplace=True)
# Dropping rows with missing values
# df.dropna(inplace=True)
# Handling Duplicates
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python
5. Data Preprocessing
# Encoding Categorical Variables
df = pd.get_dummies(df, columns=['categorical_column'])
# Label Encoding for ordinal data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['ordinal_column'] = le.fit_transform(df['ordinal_column'])
# Feature Engineering
df['new_feature'] = df['existing_feature1'] * df['existing_feature2']
Comprehensive Guide for Exploratory Data Analysis in Python
6. Outlier Detection and Treatment
# Using Z-score to identify outliers
z_scores = stats.zscore(df['column_name'])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
df = df[filtered_entries]
# Using IQR (Interquartile Range) to identify outliers
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
filtered_entries = ((df['column_name'] >= (Q1 - 1.5 * IQR)) & (df['column_name'] <= (Q3 + 1.5 *
IQR)))
df = df[filtered_entries]
Comprehensive Guide for Exploratory Data Analysis in Python
7. Scaling and Normalization
# Min-Max Scaling
scaler = MinMaxScaler()
df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])
# Standardization
scaler = StandardScaler()
df[['column1', 'column2']] = scaler.fit_transform(df[['column1', 'column2']])
Comprehensive Guide for Exploratory Data Analysis in Python
8. Data Visualization
# Univariate Analysis
# Histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['column_name'], kde=True)
plt.title('Histogram of column_name')
plt.show()
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['column_name'])
plt.title('Boxplot of column_name')
plt.show()
# Bivariate Analysis
# Scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='column1', y='column2', data=df)
plt.title('Scatter plot between column1 and column2')
plt.show()
# Heatmap for correlation
Comprehensive Guide for Exploratory Data Analysis in Python
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Multivariate Analysis
# Pairplot
sns.pairplot(df)
plt.show()
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='categorical_column', y='numeric_column', data=df)
plt.title('Violin plot')
plt.show()
Comprehensive Guide for Exploratory Data Analysis in Python
9. Summarizing Findings
print("Key Findings:")
print("1. Description of key patterns or anomalies.")
print("2. Potential relationships between features.")
print("3. Insights on missing values and outliers.")
Comprehensive Guide for Exploratory Data Analysis in Python
10. Adjusting for Different Problems and Constraints
# Imbalanced Data
# Check class distribution
print(df['target'].value_counts())
# Oversampling using SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
# Large Datasets
# Using Dask for larger-than-memory computations
import dask.dataframe as dd
df = dd.read_csv('large_dataset.csv')
# Time Series Data
# Converting a column to datetime
df['date_column'] = pd.to_datetime(df['date_column'])
# Setting the date column as index
df.set_index('date_column', inplace=True)
Comprehensive Guide for Exploratory Data Analysis in Python
# Resampling
df_resampled = df.resample('M').mean()
# Text Data
# Using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df['text_column'])
# Using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['text_column'])