0% found this document useful (0 votes)
3 views6 pages

Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) in Python involves analyzing datasets to uncover patterns and relationships using manual methods with libraries like Pandas, Matplotlib, and Seaborn, or through automated tools like Pandas Profiling and Sweetviz. The document outlines various techniques for univariate, bivariate, and multivariate analysis, as well as strategies for handling missing data and outliers. It emphasizes the importance of combining manual and automated approaches for a comprehensive analysis workflow.

Uploaded by

kvenkatasairahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) in Python involves analyzing datasets to uncover patterns and relationships using manual methods with libraries like Pandas, Matplotlib, and Seaborn, or through automated tools like Pandas Profiling and Sweetviz. The document outlines various techniques for univariate, bivariate, and multivariate analysis, as well as strategies for handling missing data and outliers. It emphasizes the importance of combining manual and automated approaches for a comprehensive analysis workflow.

Uploaded by

kvenkatasairahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) is the process of analyzing and summarizing the main characteristics of a
dataset to uncover patterns, relationships, and anomalies 1 . In Python, this can be done manually using
libraries like Pandas, Matplotlib and Seaborn, or automatically using specialized EDA tools. Manual EDA
gives fine control over each step (and is often iterative), while automated tools generate quick reports and
visualizations with minimal code. We outline both approaches below, along with common visualization
techniques.

Manual EDA Approach with Pandas/Matplotlib/Seaborn


• Univariate analysis (single variable): For numeric variables, compute summary statistics (e.g.
using df.describe() , which yields count, mean, std, min/max, quartiles) and plot distributions.
Common plots are histograms or KDE plots to see shape of the distribution 2 , and box plots to
show median, quartiles and flag outliers 3 . For example, a histogram ( sns.histplot() ) reveals
how values cluster, while a boxplot ( sns.boxplot() ) highlights any extreme points beyond the
1.5×IQR whiskers 3 . For categorical variables, use value counts and bar charts (or countplots) to
show frequencies of each category 4 . (Pie charts are also possible, though less informative.) These
univariate plots help detect skewness, modality, and basic anomalies in each variable.

• Bivariate analysis (two variables): Explore pairwise relationships:

• Numerical vs numerical: Use scatter plots (e.g. sns.scatterplot ) to see correlations or trends
between two numeric features 5 . You can also plot a line graph if one variable is time or ordered. If
there are many variables, a correlation heatmap ( sns.heatmap(df.corr()) ) shows linear
correlations across all numeric fields 6 .
• Categorical vs numerical: Use grouped plots such as boxplots or violin plots to compare the
distribution of a numeric variable across categories 7 . For example, sns.barplot(x=cat,
y=num) or sns.boxplot(x=cat, y=num) can reveal differences in group means or spread (see
the bar-chart example in 7 ).

• Categorical vs categorical: Create cross-tabulations or grouped bar charts to compare two


categorical features. For instance, a sns.countplot with hue= can show joint frequencies (as in
8 ). Alternatively, a heatmap of the contingency table highlights associations between categories.
These plots reveal how categories interact or depend on each other.

• Multivariate analysis (multiple variables): To explore more than two variables at once, use
techniques like:

• Pairplot (Scatterplot matrix): Plots all pairwise scatterplots of numeric features (and histograms on
the diagonal). This provides a quick visual overview of relationships among many variables.

1
• Dimensionality reduction plots: Apply PCA or t-SNE on numeric data and scatter the first 2–3
components (with points colored by a category) 9 . For example, plotting the first two principal
components of the Iris dataset (colored by species) can reveal clusters (as shown in 9 ).
• Heatmaps: A heatmap of the correlation matrix (numeric variables) or of a distance/association
measure summarizes multivariate structure 6 .

These multivariate plots help uncover patterns not seen in pairwise analyses alone.

• Handling missing data: First detect missingness by checking df.isnull().sum() or using


df.info() which reports non-null counts 10 . Missing data can be visualized via heatmaps (e.g.
sns.heatmap(df.isnull(), cbar=False) or using the missingno library). Common
strategies to handle missing values are: deletion (e.g. df.dropna() ) or imputation. Deletion can
be dropping all rows (or columns) with nulls 11 . Imputation replaces nulls with estimates: e.g. fill
numeric columns with the mean or median, and categorical with the mode 12 . More advanced
methods include interpolation or model-based imputation. The choice depends on the data and the
amount of missingness.

• Outlier detection: Outliers (extreme values) can distort analysis. Two common rules are:

• Z-score rule: Compute the z-score for each point ( (x–μ)/σ ) and flag points with |z| > 3 (beyond 3
standard deviations) as outliers 13 .

• IQR rule: Mark as outliers any points below Q1−1.5×IQR or above Q3+1.5×IQR 3 . These are exactly
the points beyond the whiskers of a boxplot 3 .
Visual inspection with a boxplot or scatterplot often helps spot outliers. Once identified, outliers can
be removed or capped, or further investigated case-by-case.

• Data distribution & transformation: Assess each numeric variable’s distribution using histograms
or density plots. Compute skewness/kurtosis to quantify asymmetry. If a variable is highly skewed
(e.g. log-normal), apply transformations such as logarithm, square-root, or Box–Cox to “normalize” it
14 . For example, a log transform is often used to reduce right-skew: it “stretches out” the long tail

of a right-skewed distribution, making it more symmetric 14 . Other transforms (e.g. min–max


scaling or standard scaling) are used to normalize ranges (important later for modeling). Always
check distributions before and after transformation to confirm improvement.

Data/Relationship Typical Plots/Techniques

Numeric (single feature) Histogram or density plot; Boxplot/Violin (to show median/IQR/outliers)

Categorical (single) Bar chart of counts (countplot); Pie chart (rarely, for 2–3 categories)

Numeric vs Numeric Scatter plot; Hexbin (for large data); Pairplot (scatterplot matrix)

Categorical vs Numeric Boxplot or Violin plot (split by category); Bar plot of group means

Grouped/stacked bar charts; Mosaic plots; Heatmap of contingency


Categorical vs Categorical
(crosstab)

2
Data/Relationship Typical Plots/Techniques

Time series (Numeric vs Line plot (trend over time); Seasonal decomposition (trend/season
Time) plots)

Pairplot (all pairs of numeric); PCA/t-SNE scatter; Heatmaps of


Multivariate (many vars)
correlation

Table: Common visualization choices for different data types and relationships (implementable via Matplotlib/
Seaborn). The examples above are supported by standard Python libraries 2 5 15 .

Automated EDA Tools and Libraries


Several Python libraries automate much of this process by scanning a DataFrame and producing reports or
dashboards:

Figure: Pandas Profiling report “Overview” for the Iris dataset. The left panel shows dataset statistics (5
variables, 150 observations, 0 missing) and duplicate count, while the right panel lists variable types 16 .
Pandas-Profiling (ydata-profiling): This library generates a comprehensive HTML report with one
command (e.g. ProfileReport(df) ). It extends df.describe() by including variable type
inference, unique counts, missing-value summaries, descriptive stats (mean, median, quartiles,
skewness, etc.), histograms, correlations, and more 16 17 . For instance, the report shows
distribution plots and flags high correlations or missing data. It is very time-saving (requires only a
few lines of code) and interactive – you can drill into any variable’s details. Limitations: Pandas-
Profiling can be slow or fail on very large datasets, because it computes many statistics for every
column 18 . It also loads data into the browser, which can be heavy. Thus it’s best used on small-to-
medium data to get an initial overview.

3

Figure: Sweetviz output for the Iris dataset. Sweetviz produces a self-contained HTML report with summary
statistics and histograms for each feature (see e.g. sepal/petal length and width) 19 . Sweetviz: Another
auto-EDA library, Sweetviz generates visually-rich reports (in one line, sv.analyze(df) ) 19 20 .
The output focuses on visual comparisons: it plots side-by-side histograms, summary tables, and
correlation heatmaps. Notably, Sweetviz highlights how each feature relates to a target variable
(“target analysis”) and can compare two datasets (e.g. train vs. test) in one report 21 20 . It also
reports missing values, duplicates, and frequent values. Sweetviz is very quick to set up and its HTML
report is easy to share. Limitations: Like Pandas-Profiling, Sweetviz can struggle with very large or
wide datasets – the report becomes cluttered and loading may slow down. Also, it offers less
granular control (you must accept its defaults) compared to manual plotting.

• D-Tale: This tool provides an interactive web GUI for Pandas data 22 . Launching D-Tale in a notebook
(via dtale.show(df) ) opens a web interface where you can sort, filter, and search the DataFrame
as if in a spreadsheet 22 23 . You can edit cells, hide columns, and even plot columns on the fly (e.g.
click on a column to see distribution or correlations). D-Tale is useful for exploratory “play” when you
want to drill into data without writing code. It also can export the filtering actions as Python code.
Strengths: instant, detailed inspection of any part of the data (the interface is “detailed and very
simple” to use 22 ). Limitations: It requires a local browser interface and is not suited to automated
pipelines. Loading very large data can be slow (it attempts to load all data into the browser), and the
interface can hang if the dataset has millions of rows.

• AutoViz (AutoViz_Class): AutoViz automates chart creation with minimal input 24 . You simply
instantiate AutoViz_Class() and call AutoViz("datafile.csv") . It scans the dataset and
produces a wide variety of plots: histograms, boxplots, scatterplots, violin plots, and correlation
matrices for all features 25 . If a target column is specified, it also performs class-based
comparisons. AutoViz is designed to be very fast, generating dozens of plots in seconds 24 26 .
Strengths: One command yields many insights (it often produces more plots than Pandas-Profiling
or Sweetviz 26 ). It works well for moderate data. Limitations: For very large datasets, AutoViz
samples the data (by default up to ~150,000 rows 27 ) to avoid long runtimes. Beyond that, a paid

4
license may be required. Also, the large number of auto-generated plots can be overwhelming and
are not easily customized on-the-fly. AutoViz is best for an initial broad survey of the data.

Other emerging tools include pandasgui (a PyQt-based GUI for DataFrames) and Lux (a visualization
recommendation engine built on Pandas), but the above are the most common auto-EDA libraries in
Python.

Example Workflow (Combining Manual and Automated Steps)


1. Load data & initial summary: Use Pandas to load the dataset ( df = pd.read_csv(...) ). Check
df.info() for data types and null counts, and df.describe() for basic stats. To automate this,
run a profiling report (e.g. ProfileReport(df) ) which will instantly list number of rows/columns,
missing value counts, duplicate count, and basic summaries 16 .
2. Univariate exploration: Examine individual features. For numeric features, plot histograms or
boxplots (e.g. sns.histplot(df['age']) or sns.boxplot(df['salary']) ) to see
distributions and outliers. For categorical features, use
df['catcol'].value_counts().plot(kind='bar') . Alternatively, run AutoViz or Sweetviz,
which will automatically generate all histograms and bar charts for you 24 19 .
3. Handle missing and outliers: Identify missing data ( df.isnull().sum() or heatmap) and
decide how to treat it (drop or impute 10 11 ). Remove or cap outliers identified earlier (e.g. drop
points with z-score >3 13 or outside 1.5×IQR 3 ). Automated reports like Pandas-Profiling will
already flag high-missing and extreme-value columns for you.
4. Bivariate and multivariate analysis: Plot relationships – scatter plots ( sns.scatterplot ) or
pairplots for numerical pairs 5 , and boxplots for categorical–numeric comparisons. Compute a
correlation matrix ( df.corr() ) and plot it to find strong linear relationships 6 . You can also use
a sweetviz or profile report to see correlation heatmaps or target analysis at a glance.
5. Interactive drilling (optional): If something still needs inspection, launch D-Tale
( dtale.show(df) ) to interactively filter or pivot the data and generate quick plots without writing
code 22 23 . For example, you might filter to a subset of interest and immediately see that subset’s
distribution.
6. Document findings: Summarize any important patterns discovered (e.g. feature distributions,
missing value impact, strong correlations) before moving to modeling. The reports from profiling or
sweetviz can be shared with stakeholders, while manual plots can be refined for presentations.

Throughout, one alternates between manual and automated tools: use code for precise control (e.g.
customizing a specific Seaborn plot) and use auto-tools for broad overviews. By combining these
approaches, you ensure no blind spots in the EDA.

Sources: The above practices are standard in Python-based EDA (as documented in tutorials and library
references 2 5 10 24 ). Each listed automated tool is open-source and documented in the Python
ecosystem 16 19 22 24 . The figures shown are examples from Pandas-Profiling and Sweetviz outputs.

1 19 21 SweetViz | Automated Exploratory Data Analysis (EDA) | GeeksforGeeks


https://www.geeksforgeeks.org/sweetviz-automated-exploratory-data-analysis-eda/

5
2 4 5 6 7 8 9 What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation? |
GeeksforGeeks
https://www.geeksforgeeks.org/what-is-univariate-bivariate-multivariate-analysis-in-data-visualisation/

3 Finding the outlier points from Matplotlib | GeeksforGeeks


https://www.geeksforgeeks.org/finding-the-outlier-points-from-matplotlib/

10 11 12 How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis | by DataScienceSphere
| Medium
https://medium.com/@datasciencejourney100_83560/how-to-handle-missing-data-data-cleaning-exploratory-data-analysis-
b706abc563ec

13 Z score for Outlier Detection – Python | GeeksforGeeks


https://www.geeksforgeeks.org/z-score-for-outlier-detection-python/

14 Log Transformation and visualizing it using Python | by Tarique Akhtar | Medium


https://tariqueakhtar-39220.medium.com/log-transformation-and-visualizing-it-using-python-392cb4bcfc74

15 Exploratory Data Analysis Using Python


https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/

16 20 Better EDA with 3 Easy Python Libraries for Any Beginner


https://www.analyticsvidhya.com/blog/2021/08/better-eda-with-3-easy-python-libraries-for-any-beginner/

17 25 Automated EDA with Pandas Profiling, SweetViz, and Autoviz: Streamlining Data Analysis Efforts | by

Nilam Ayu Rosari | Medium


https://medium.com/@nilamayurosari99/automated-eda-with-pandas-profiling-sweetviz-and-autoviz-streamlining-data-analysis-
efforts-effdd26c2165

18 Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners | DataCamp


https://www.datacamp.com/tutorial/pandas-profiling-ydata-profiling-in-python-guide

22 Exploratory Data Analysis Tools. Pandas-Profiling, Sweetviz, D-Tale | by Karteek Menda | Medium
https://medium.com/@karteekmenda93/exploratory-data-analysis-tools-83ef538c879f

23 Exploring Pandas DataFrame With D-Tale - Analytics Vidhya


https://www.analyticsvidhya.com/blog/2021/06/exploring-pandas-dataframe-with-d-tale/

24 Automated EDA using pandas profiling,sweetviz,autoviz | by Guhanesvar | Analytics Vidhya |


26

Medium
https://medium.com/analytics-vidhya/automated-eda-using-pandas-profiling-sweetviz-autoviz-4f15c4031a12

27 AutoViz/autoviz/AutoViz_Class.py at master - GitHub


https://github.com/AutoViML/AutoViz/blob/master/autoviz/AutoViz_Class.py

You might also like