EDA - Exploratory Data Analysis in Python

Last Updated : 31 Jul, 2025

Exploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration and insights generation to help in further modeling and analysis. In this article, we will see how to perform EDA using python.

Key Steps for Exploratory Data Analysis (EDA)

Lets see various steps involved in Exploratory Data Analysis:

Step 1: Importing Required Libraries

We need to install Pandas, NumPy, Matplotlib and Seaborn libraries in python to proceed further.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wr
wr.filterwarnings('ignore')

 
import pandas as pd
import numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport warnings as wrwr.filterwarnings('ignore')

Step 2: Reading Dataset

Lets read the dataset using pandas.

Download the dataset from this link

df = pd.read_csv("/content/WineQT.csv")
print(df.head())

 
df = pd.read_csv("/content/WineQT.csv")
print(df.head())

Output:

eda1 — First 5 rows

Step 3: Analyzing the Data

1. df.shape(): This function is used to understand the number of rows (observations) and columns (features) in the dataset. This gives an overview of the dataset's size and structure.

df.shape

df.shape

Output:

(1143, 13)

2. df.info(): This function helps us to understand the dataset by showing the number of records in each column, type of data, whether any values are missing and how much memory the dataset uses.

df.info()

df.info()

Output:

eda2 — info()

3. df.describe().T: This method gives a statistical summary of the DataFrame (Transpose) showing values like count, mean, standard deviation, minimum and quartiles for each numerical column. It helps in summarizing the central tendency and spread of the data.

df.describe().T

 
df.describe().T

Output:

describe

4. df.columns.tolist(): This converts the column names of the DataFrame into a Python list making it easy to access and manipulate the column names.

df.columns.tolist()

 
df.columns.tolist()

Output:

eda4 — column names

Step 4 : Checking Missing Values

df.isnull().sum(): This checks for missing values in each column and returns the total number of null values per column helping us to identify any gaps in our data.

df.isnull().sum()

 
df.isnull().sum()

Output:

eda5 — Missing values in each column

Step 5 : Checking for the duplicate values

df.nunique(): This function tells us how many unique values exist in each column which provides insight into the variety of data in each feature.

df.nunique()

df.nunique()

Output:

eda6 — nunique()

Step 6: Univariate Analysis

In Univariate analysis plotting the right charts can help us to better understand the data making the data visualization so important.

1. Bar Plot for evaluating the count of the wine with its quality rate.

quality_counts = df['quality'].value_counts()

plt.figure(figsize=(8, 6))
plt.bar(quality_counts.index, quality_counts, color='deeppink')
plt.title('Count Plot of Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

 
quality_counts = df['quality'].value_counts()
​plt.figure(figsize=(8, 6))plt.bar(quality_counts.index, quality_counts, color='deeppink')plt.title('Count Plot of Quality')plt.xlabel('Quality')plt.ylabel('Count')plt.show()

Output:

eda7 — Bar Plot

Here, this count plot graph shows the count of the wine with its quality rate.

2. Kernel density plot for understanding variance in the dataset

sns.set_style("darkgrid")

numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns

plt.figure(figsize=(14, len(numerical_columns) * 3))
for idx, feature in enumerate(numerical_columns, 1):
    plt.subplot(len(numerical_columns), 2, idx)
    sns.histplot(df[feature], kde=True)
    plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")

plt.tight_layout()
plt.show()

 
sns.set_style("darkgrid")
​numerical_columns = df.select_dtypes(include=["int64", "float64"]).columns​plt.figure(figsize=(14, len(numerical_columns) * 3))for idx, feature in enumerate(numerical_columns, 1):    plt.subplot(len(numerical_columns), 2, idx)    sns.histplot(df[feature], kde=True)    plt.title(f"{feature} | Skewness: {round(df[feature].skew(), 2)}")​plt.tight_layout()plt.show()

Output:

eda8 — Kernel density plot

The features in the dataset with a skewness of 0 shows a symmetrical distribution. If the skewness is 1 or above it suggests a positively skewed (right-skewed) distribution. In a right-skewed distribution the tail extends more to the right which shows the presence of extremely high values.

3. Swarm Plot for showing the outlier in the data

plt.figure(figsize=(10, 8))

sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')

plt.title('Swarm Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()

 
plt.figure(figsize=(10, 8))
​sns.swarmplot(x="quality", y="alcohol", data=df, palette='viridis')​plt.title('Swarm Plot for Quality and Alcohol')plt.xlabel('Quality')plt.ylabel('Alcohol')plt.show()

Output:

eda9 — Swarm Plot

This graph shows the swarm plot for the 'Quality' and 'Alcohol' columns. The higher point density in certain areas shows where most of the data points are concentrated. Points that are isolated and far from these clusters represent outliers highlighting uneven values in the dataset.

Step 7: Bivariate Analysis

In bivariate analysis two variables are analyzed together to identify patterns, dependencies or interactions between them. This method helps in understanding how changes in one variable might affect another.

Let's visualize these relationships by plotting various plot for the data which will show how the variables interact with each other across multiple dimensions.

1. Pair Plot for showing the distribution of the individual variables

sns.set_palette("Pastel1")

plt.figure(figsize=(10, 6))

sns.pairplot(df)

plt.suptitle('Pair Plot for DataFrame')
plt.show()

 
sns.set_palette("Pastel1")
​plt.figure(figsize=(10, 6))​sns.pairplot(df)​plt.suptitle('Pair Plot for DataFrame')plt.show()

Output:

eda10 — Pair Plot

If the plot is diagonal , histograms of kernel density plots shows the distribution of the individual variables.
If the scatter plot is in the lower triangle, it displays the relationship between the pairs of the variables.
If the scatter plots above and below the diagonal are mirror images indicating symmetry.
If the histogram plots are more centered, it represents the locations of peaks.
Skewness is found by observing whether the histogram is symmetrical or skewed to the left or right.

2. Violin Plot for examining the relationship between alcohol and Quality.

df['quality'] = df['quality'].astype(str)  

plt.figure(figsize=(10, 8))

sns.violinplot(x="quality", y="alcohol", data=df, palette={
               '3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)

plt.title('Violin Plot for Quality and Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')
plt.show()

 
df['quality'] = df['quality'].astype(str)  
​plt.figure(figsize=(10, 8))​sns.violinplot(x="quality", y="alcohol", data=df, palette={               '3': 'lightcoral', '4': 'lightblue', '5': 'lightgreen', '6': 'gold', '7': 'lightskyblue', '8': 'lightpink'}, alpha=0.7)​plt.title('Violin Plot for Quality and Alcohol')plt.xlabel('Quality')plt.ylabel('Alcohol')plt.show()

Output:

violin — Violin Plot

For interpreting the Violin Plot:

If the width is wider, it shows higher density suggesting more data points.
Symmetrical plot shows a balanced distribution.
Peak or bulge in the violin plot represents most common value in distribution.
Longer tails shows great variability.
Median line is the middle line inside the violin plot. It helps in understanding central tendencies.

3. Box Plot for examining the relationship between alcohol and Quality

sns.boxplot(x='quality', y='alcohol', data=df)

 
sns.boxplot(x='quality', y='alcohol', data=df)

Output:

box-plot — Box Plot

Box represents the IQR i.e longer the box, greater the variability.

Median line in the box shows central tendency.
Whiskers extend from box to the smallest and largest values within a specified range.
Individual points beyond the whiskers represents outliers.
A compact box shows low variability while a stretched box shows higher variability.

Step 8: Multivariate Analysis

It involves finding the interactions between three or more variables in a dataset at the same time. This approach focuses to identify complex patterns, relationships and interactions which provides understanding of how multiple variables collectively behave and influence each other.

Here, we are going to show the multivariate analysis using a correlation matrix plot.

plt.figure(figsize=(15, 10))

sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)

plt.title('Correlation Heatmap')
plt.show()

 
plt.figure(figsize=(15, 10))
​sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='Pastel2', linewidths=2)​plt.title('Correlation Heatmap')plt.show()

Output:

eda13 — Correlation Matrix

Values close to +1 shows strong positive correlation, -1 shows a strong negative correlation and 0 suggests no linear correlation.

Darker colors signify strong correlation, while light colors represents weaker correlations.
Positive correlation variable move in same directions. As one increases, the other also increases.
Negative correlation variable move in opposite directions. An increase in one variable is associated with a decrease in the other.

With these insights from the EDA, we are now ready to undertsand the data and explore more advanced modeling techniques.

What is Exploratory Data Analysis (EDA)

K

KattamuriMeghna

Improve

Article Tags :

Explore

DSA Tutorial - Learn Data Structures and Algorithms

System Design Tutorial

Aptitude Questions and Answers

Web Development Technologies

AI, ML and Data Science Tutorial

DevOps Tutorial