0% found this document useful (0 votes)

35 views6 pages

Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) in Python involves analyzing datasets to uncover patterns and relationships using manual methods with libraries like Pandas, Matplotlib, and Seaborn, or through automated tools like Pandas Profiling and Sweetviz. The document outlines various techniques for univariate, bivariate, and multivariate analysis, as well as strategies for handling missing data and outliers. It emphasizes the importance of combining manual and automated approaches for a comprehensive analysis workflow.

Uploaded by

kvenkatasairahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views6 pages

Exploratory Data Analysis (EDA) in Python

Uploaded by

kvenkatasairahul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) is the process of analyzing and summarizing the main characteristics of a
dataset to uncover patterns, relationships, and anomalies 1 . In Python, this can be done manually using
libraries like Pandas, Matplotlib and Seaborn, or automatically using specialized EDA tools. Manual EDA
gives fine control over each step (and is often iterative), while automated tools generate quick reports and
visualizations with minimal code. We outline both approaches below, along with common visualization
techniques.

Manual EDA Approach with Pandas/Matplotlib/Seaborn

• Univariate analysis (single variable): For numeric variables, compute summary statistics (e.g.
using df.describe() , which yields count, mean, std, min/max, quartiles) and plot distributions.
Common plots are histograms or KDE plots to see shape of the distribution 2 , and box plots to
show median, quartiles and flag outliers 3 . For example, a histogram ( sns.histplot() ) reveals
how values cluster, while a boxplot ( sns.boxplot() ) highlights any extreme points beyond the
1.5×IQR whiskers 3 . For categorical variables, use value counts and bar charts (or countplots) to
show frequencies of each category 4 . (Pie charts are also possible, though less informative.) These
univariate plots help detect skewness, modality, and basic anomalies in each variable.

• Bivariate analysis (two variables): Explore pairwise relationships:

• Numerical vs numerical: Use scatter plots (e.g. sns.scatterplot ) to see correlations or trends
between two numeric features 5 . You can also plot a line graph if one variable is time or ordered. If
there are many variables, a correlation heatmap ( sns.heatmap(df.corr()) ) shows linear
correlations across all numeric fields 6 .
• Categorical vs numerical: Use grouped plots such as boxplots or violin plots to compare the
distribution of a numeric variable across categories 7 . For example, sns.barplot(x=cat,
y=num) or sns.boxplot(x=cat, y=num) can reveal differences in group means or spread (see
the bar-chart example in 7 ).

• Categorical vs categorical: Create cross-tabulations or grouped bar charts to compare two

categorical features. For instance, a sns.countplot with hue= can show joint frequencies (as in
8 ). Alternatively, a heatmap of the contingency table highlights associations between categories.
These plots reveal how categories interact or depend on each other.

• Multivariate analysis (multiple variables): To explore more than two variables at once, use
techniques like:

• Pairplot (Scatterplot matrix): Plots all pairwise scatterplots of numeric features (and histograms on
the diagonal). This provides a quick visual overview of relationships among many variables.

1
• Dimensionality reduction plots: Apply PCA or t-SNE on numeric data and scatter the first 2–3
components (with points colored by a category) 9 . For example, plotting the first two principal
components of the Iris dataset (colored by species) can reveal clusters (as shown in 9 ).
• Heatmaps: A heatmap of the correlation matrix (numeric variables) or of a distance/association
measure summarizes multivariate structure 6 .

These multivariate plots help uncover patterns not seen in pairwise analyses alone.

• Handling missing data: First detect missingness by checking df.isnull().sum() or using

df.info() which reports non-null counts 10 . Missing data can be visualized via heatmaps (e.g.
sns.heatmap(df.isnull(), cbar=False) or using the missingno library). Common
strategies to handle missing values are: deletion (e.g. df.dropna() ) or imputation. Deletion can
be dropping all rows (or columns) with nulls 11 . Imputation replaces nulls with estimates: e.g. fill
numeric columns with the mean or median, and categorical with the mode 12 . More advanced
methods include interpolation or model-based imputation. The choice depends on the data and the
amount of missingness.

• Outlier detection: Outliers (extreme values) can distort analysis. Two common rules are:

• Z-score rule: Compute the z-score for each point ( (x–μ)/σ ) and flag points with |z| > 3 (beyond 3
standard deviations) as outliers 13 .

• IQR rule: Mark as outliers any points below Q1−1.5×IQR or above Q3+1.5×IQR 3 . These are exactly
the points beyond the whiskers of a boxplot 3 .
Visual inspection with a boxplot or scatterplot often helps spot outliers. Once identified, outliers can
be removed or capped, or further investigated case-by-case.

• Data distribution & transformation: Assess each numeric variable’s distribution using histograms
or density plots. Compute skewness/kurtosis to quantify asymmetry. If a variable is highly skewed
(e.g. log-normal), apply transformations such as logarithm, square-root, or Box–Cox to “normalize” it
14 . For example, a log transform is often used to reduce right-skew: it “stretches out” the long tail

of a right-skewed distribution, making it more symmetric 14 . Other transforms (e.g. min–max

scaling or standard scaling) are used to normalize ranges (important later for modeling). Always
check distributions before and after transformation to confirm improvement.

Data/Relationship Typical Plots/Techniques

Numeric (single feature) Histogram or density plot; Boxplot/Violin (to show median/IQR/outliers)

Categorical (single) Bar chart of counts (countplot); Pie chart (rarely, for 2–3 categories)

Numeric vs Numeric Scatter plot; Hexbin (for large data); Pairplot (scatterplot matrix)

Categorical vs Numeric Boxplot or Violin plot (split by category); Bar plot of group means

Grouped/stacked bar charts; Mosaic plots; Heatmap of contingency

Categorical vs Categorical
(crosstab)

2
Data/Relationship Typical Plots/Techniques

Time series (Numeric vs Line plot (trend over time); Seasonal decomposition (trend/season
Time) plots)

Pairplot (all pairs of numeric); PCA/t-SNE scatter; Heatmaps of

Multivariate (many vars)
correlation

Table: Common visualization choices for different data types and relationships (implementable via Matplotlib/
Seaborn). The examples above are supported by standard Python libraries 2 5 15 .

Automated EDA Tools and Libraries

Several Python libraries automate much of this process by scanning a DataFrame and producing reports or
dashboards:

Figure: Pandas Profiling report “Overview” for the Iris dataset. The left panel shows dataset statistics (5
variables, 150 observations, 0 missing) and duplicate count, while the right panel lists variable types 16 .
Pandas-Profiling (ydata-profiling): This library generates a comprehensive HTML report with one
command (e.g. ProfileReport(df) ). It extends df.describe() by including variable type
inference, unique counts, missing-value summaries, descriptive stats (mean, median, quartiles,
skewness, etc.), histograms, correlations, and more 16 17 . For instance, the report shows
distribution plots and flags high correlations or missing data. It is very time-saving (requires only a
few lines of code) and interactive – you can drill into any variable’s details. Limitations: Pandas-
Profiling can be slow or fail on very large datasets, because it computes many statistics for every
column 18 . It also loads data into the browser, which can be heavy. Thus it’s best used on small-to-
medium data to get an initial overview.

3
•

Figure: Sweetviz output for the Iris dataset. Sweetviz produces a self-contained HTML report with summary
statistics and histograms for each feature (see e.g. sepal/petal length and width) 19 . Sweetviz: Another
auto-EDA library, Sweetviz generates visually-rich reports (in one line, sv.analyze(df) ) 19 20 .
The output focuses on visual comparisons: it plots side-by-side histograms, summary tables, and
correlation heatmaps. Notably, Sweetviz highlights how each feature relates to a target variable
(“target analysis”) and can compare two datasets (e.g. train vs. test) in one report 21 20 . It also
reports missing values, duplicates, and frequent values. Sweetviz is very quick to set up and its HTML
report is easy to share. Limitations: Like Pandas-Profiling, Sweetviz can struggle with very large or
wide datasets – the report becomes cluttered and loading may slow down. Also, it offers less
granular control (you must accept its defaults) compared to manual plotting.

• D-Tale: This tool provides an interactive web GUI for Pandas data 22 . Launching D-Tale in a notebook
(via dtale.show(df) ) opens a web interface where you can sort, filter, and search the DataFrame
as if in a spreadsheet 22 23 . You can edit cells, hide columns, and even plot columns on the fly (e.g.
click on a column to see distribution or correlations). D-Tale is useful for exploratory “play” when you
want to drill into data without writing code. It also can export the filtering actions as Python code.
Strengths: instant, detailed inspection of any part of the data (the interface is “detailed and very
simple” to use 22 ). Limitations: It requires a local browser interface and is not suited to automated
pipelines. Loading very large data can be slow (it attempts to load all data into the browser), and the
interface can hang if the dataset has millions of rows.

• AutoViz (AutoViz_Class): AutoViz automates chart creation with minimal input 24 . You simply
instantiate AutoViz_Class() and call AutoViz("datafile.csv") . It scans the dataset and
produces a wide variety of plots: histograms, boxplots, scatterplots, violin plots, and correlation
matrices for all features 25 . If a target column is specified, it also performs class-based
comparisons. AutoViz is designed to be very fast, generating dozens of plots in seconds 24 26 .
Strengths: One command yields many insights (it often produces more plots than Pandas-Profiling
or Sweetviz 26 ). It works well for moderate data. Limitations: For very large datasets, AutoViz
samples the data (by default up to ~150,000 rows 27 ) to avoid long runtimes. Beyond that, a paid

4
license may be required. Also, the large number of auto-generated plots can be overwhelming and
are not easily customized on-the-fly. AutoViz is best for an initial broad survey of the data.

Other emerging tools include pandasgui (a PyQt-based GUI for DataFrames) and Lux (a visualization
recommendation engine built on Pandas), but the above are the most common auto-EDA libraries in
Python.

Example Workflow (Combining Manual and Automated Steps)

1. Load data & initial summary: Use Pandas to load the dataset ( df = pd.read_csv(...) ). Check
df.info() for data types and null counts, and df.describe() for basic stats. To automate this,
run a profiling report (e.g. ProfileReport(df) ) which will instantly list number of rows/columns,
missing value counts, duplicate count, and basic summaries 16 .
2. Univariate exploration: Examine individual features. For numeric features, plot histograms or
boxplots (e.g. sns.histplot(df['age']) or sns.boxplot(df['salary']) ) to see
distributions and outliers. For categorical features, use
df['catcol'].value_counts().plot(kind='bar') . Alternatively, run AutoViz or Sweetviz,
which will automatically generate all histograms and bar charts for you 24 19 .
3. Handle missing and outliers: Identify missing data ( df.isnull().sum() or heatmap) and
decide how to treat it (drop or impute 10 11 ). Remove or cap outliers identified earlier (e.g. drop
points with z-score >3 13 or outside 1.5×IQR 3 ). Automated reports like Pandas-Profiling will
already flag high-missing and extreme-value columns for you.
4. Bivariate and multivariate analysis: Plot relationships – scatter plots ( sns.scatterplot ) or
pairplots for numerical pairs 5 , and boxplots for categorical–numeric comparisons. Compute a
correlation matrix ( df.corr() ) and plot it to find strong linear relationships 6 . You can also use
a sweetviz or profile report to see correlation heatmaps or target analysis at a glance.
5. Interactive drilling (optional): If something still needs inspection, launch D-Tale
( dtale.show(df) ) to interactively filter or pivot the data and generate quick plots without writing
code 22 23 . For example, you might filter to a subset of interest and immediately see that subset’s
distribution.
6. Document findings: Summarize any important patterns discovered (e.g. feature distributions,
missing value impact, strong correlations) before moving to modeling. The reports from profiling or
sweetviz can be shared with stakeholders, while manual plots can be refined for presentations.

Throughout, one alternates between manual and automated tools: use code for precise control (e.g.
customizing a specific Seaborn plot) and use auto-tools for broad overviews. By combining these
approaches, you ensure no blind spots in the EDA.

Sources: The above practices are standard in Python-based EDA (as documented in tutorials and library
references 2 5 10 24 ). Each listed automated tool is open-source and documented in the Python
ecosystem 16 19 22 24 . The figures shown are examples from Pandas-Profiling and Sweetviz outputs.

1 19 21 SweetViz | Automated Exploratory Data Analysis (EDA) | GeeksforGeeks

https://www.geeksforgeeks.org/sweetviz-automated-exploratory-data-analysis-eda/

5
2 4 5 6 7 8 9 What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation? |
GeeksforGeeks
https://www.geeksforgeeks.org/what-is-univariate-bivariate-multivariate-analysis-in-data-visualisation/

3 Finding the outlier points from Matplotlib | GeeksforGeeks

https://www.geeksforgeeks.org/finding-the-outlier-points-from-matplotlib/

10 11 12 How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis | by DataScienceSphere
| Medium
https://medium.com/@datasciencejourney100_83560/how-to-handle-missing-data-data-cleaning-exploratory-data-analysis-
b706abc563ec

13 Z score for Outlier Detection – Python | GeeksforGeeks

https://www.geeksforgeeks.org/z-score-for-outlier-detection-python/

14 Log Transformation and visualizing it using Python | by Tarique Akhtar | Medium

https://tariqueakhtar-39220.medium.com/log-transformation-and-visualizing-it-using-python-392cb4bcfc74

15 Exploratory Data Analysis Using Python

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/

16 20 Better EDA with 3 Easy Python Libraries for Any Beginner

https://www.analyticsvidhya.com/blog/2021/08/better-eda-with-3-easy-python-libraries-for-any-beginner/

17 25 Automated EDA with Pandas Profiling, SweetViz, and Autoviz: Streamlining Data Analysis Efforts | by

Nilam Ayu Rosari | Medium

https://medium.com/@nilamayurosari99/automated-eda-with-pandas-profiling-sweetviz-and-autoviz-streamlining-data-analysis-
efforts-effdd26c2165

18 Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners | DataCamp

https://www.datacamp.com/tutorial/pandas-profiling-ydata-profiling-in-python-guide

22 Exploratory Data Analysis Tools. Pandas-Profiling, Sweetviz, D-Tale | by Karteek Menda | Medium
https://medium.com/@karteekmenda93/exploratory-data-analysis-tools-83ef538c879f

23 Exploring Pandas DataFrame With D-Tale - Analytics Vidhya

https://www.analyticsvidhya.com/blog/2021/06/exploring-pandas-dataframe-with-d-tale/

24 Automated EDA using pandas profiling,sweetviz,autoviz | by Guhanesvar | Analytics Vidhya |

Medium
https://medium.com/analytics-vidhya/automated-eda-using-pandas-profiling-sweetviz-autoviz-4f15c4031a12

27 AutoViz/autoviz/AutoViz_Class.py at master - GitHub

https://github.com/AutoViML/AutoViz/blob/master/autoviz/AutoViz_Class.py

Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Programming For AI: Exploratory Data Analysis
No ratings yet
Programming For AI: Exploratory Data Analysis
52 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit 2
No ratings yet
Unit 2
36 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Explorato Ry: Data Analysis
No ratings yet
Explorato Ry: Data Analysis
6 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Data Exploration
No ratings yet
Data Exploration
11 pages
Unit 3
No ratings yet
Unit 3
20 pages
Lecture - Exploratory Data Analysis
No ratings yet
Lecture - Exploratory Data Analysis
39 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
What Is Exploratory Data Analysis?: Intuition
No ratings yet
What Is Exploratory Data Analysis?: Intuition
8 pages
Data Visualization
No ratings yet
Data Visualization
19 pages
Experiment No 9
No ratings yet
Experiment No 9
13 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Exploratory Data Analysis (EDA) and Descriptive Analytic
No ratings yet
Exploratory Data Analysis (EDA) and Descriptive Analytic
47 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Presentation (7)
No ratings yet
Presentation (7)
19 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Aphical Representation
No ratings yet
Aphical Representation
8 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
ML Report
No ratings yet
ML Report
12 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
Unit 3
No ratings yet
Unit 3
47 pages
Unit 6
No ratings yet
Unit 6
3 pages
Unit 5
No ratings yet
Unit 5
25 pages
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
No ratings yet
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
25 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Econometrics MTU
No ratings yet
Econometrics MTU
31 pages
Lab Report Group 4
No ratings yet
Lab Report Group 4
11 pages
Multicollinearity
No ratings yet
Multicollinearity
35 pages
Escala Resilience Eng
No ratings yet
Escala Resilience Eng
17 pages
AI-based Soundscape Analysis
No ratings yet
AI-based Soundscape Analysis
14 pages
QMS and Insti Performance of SUCs (SDSSU Journal)
No ratings yet
QMS and Insti Performance of SUCs (SDSSU Journal)
7 pages
Thesis Job Satisfaction Philippines
100% (3)
Thesis Job Satisfaction Philippines
6 pages
Data Mining 4545
No ratings yet
Data Mining 4545
20 pages
46 Items
No ratings yet
46 Items
14 pages
Qns Exam2
No ratings yet
Qns Exam2
11 pages
5 6264866871082748465
No ratings yet
5 6264866871082748465
85 pages
When Work Family Benefits Are Not Enough
No ratings yet
When Work Family Benefits Are Not Enough
24 pages
Initial Psychometric Properties of General Functioning of Family Scale (Urdu Version)
No ratings yet
Initial Psychometric Properties of General Functioning of Family Scale (Urdu Version)
9 pages
Group 6
No ratings yet
Group 6
21 pages
Chapter 5 - Ge
No ratings yet
Chapter 5 - Ge
17 pages
Nursing Research Review
No ratings yet
Nursing Research Review
8 pages
High School Students Level of Awareness and Prepa
No ratings yet
High School Students Level of Awareness and Prepa
19 pages
Levels of Selected Metals-2266
No ratings yet
Levels of Selected Metals-2266
12 pages
Statistical Reasoning For Everyday Life 4th Edition by Jeff Bennett
No ratings yet
Statistical Reasoning For Everyday Life 4th Edition by Jeff Bennett
311 pages
Assignment 2 Final
No ratings yet
Assignment 2 Final
20 pages
Department of Geography
No ratings yet
Department of Geography
25 pages
Gaver 1993
No ratings yet
Gaver 1993
36 pages
Environmental Accounting
No ratings yet
Environmental Accounting
18 pages
MPC 006 Previous Year Question Papers by
No ratings yet
MPC 006 Previous Year Question Papers by
67 pages
Bikesh Final
No ratings yet
Bikesh Final
47 pages
S.6 Applied Math 2
No ratings yet
S.6 Applied Math 2
5 pages
Alarm Prediction
No ratings yet
Alarm Prediction
5 pages
CRP (Latex), (Normal Application) ,: AU400/AU400 System Reagent: OSR6x99 Reagent ID: 099
No ratings yet
CRP (Latex), (Normal Application) ,: AU400/AU400 System Reagent: OSR6x99 Reagent ID: 099
14 pages
Expt3.ipynb - JupyterLab
No ratings yet
Expt3.ipynb - JupyterLab
5 pages
P - The Relationship Between Status Consumption and Materialism A Cross-Cultural Comparison of Chinese Mexican and American Student
No ratings yet
P - The Relationship Between Status Consumption and Materialism A Cross-Cultural Comparison of Chinese Mexican and American Student
16 pages