Exploratory Data Analysis (EDA) in Python
Exploratory Data Analysis (EDA) in Python
Exploratory Data Analysis (EDA) is the process of analyzing and summarizing the main characteristics of a
dataset to uncover patterns, relationships, and anomalies 1 . In Python, this can be done manually using
libraries like Pandas, Matplotlib and Seaborn, or automatically using specialized EDA tools. Manual EDA
gives fine control over each step (and is often iterative), while automated tools generate quick reports and
visualizations with minimal code. We outline both approaches below, along with common visualization
techniques.
• Numerical vs numerical: Use scatter plots (e.g. sns.scatterplot ) to see correlations or trends
between two numeric features 5 . You can also plot a line graph if one variable is time or ordered. If
there are many variables, a correlation heatmap ( sns.heatmap(df.corr()) ) shows linear
correlations across all numeric fields 6 .
• Categorical vs numerical: Use grouped plots such as boxplots or violin plots to compare the
distribution of a numeric variable across categories 7 . For example, sns.barplot(x=cat,
y=num) or sns.boxplot(x=cat, y=num) can reveal differences in group means or spread (see
the bar-chart example in 7 ).
• Multivariate analysis (multiple variables): To explore more than two variables at once, use
techniques like:
• Pairplot (Scatterplot matrix): Plots all pairwise scatterplots of numeric features (and histograms on
the diagonal). This provides a quick visual overview of relationships among many variables.
1
• Dimensionality reduction plots: Apply PCA or t-SNE on numeric data and scatter the first 2–3
components (with points colored by a category) 9 . For example, plotting the first two principal
components of the Iris dataset (colored by species) can reveal clusters (as shown in 9 ).
• Heatmaps: A heatmap of the correlation matrix (numeric variables) or of a distance/association
measure summarizes multivariate structure 6 .
These multivariate plots help uncover patterns not seen in pairwise analyses alone.
• Outlier detection: Outliers (extreme values) can distort analysis. Two common rules are:
• Z-score rule: Compute the z-score for each point ( (x–μ)/σ ) and flag points with |z| > 3 (beyond 3
standard deviations) as outliers 13 .
• IQR rule: Mark as outliers any points below Q1−1.5×IQR or above Q3+1.5×IQR 3 . These are exactly
the points beyond the whiskers of a boxplot 3 .
Visual inspection with a boxplot or scatterplot often helps spot outliers. Once identified, outliers can
be removed or capped, or further investigated case-by-case.
• Data distribution & transformation: Assess each numeric variable’s distribution using histograms
or density plots. Compute skewness/kurtosis to quantify asymmetry. If a variable is highly skewed
(e.g. log-normal), apply transformations such as logarithm, square-root, or Box–Cox to “normalize” it
14 . For example, a log transform is often used to reduce right-skew: it “stretches out” the long tail
Numeric (single feature) Histogram or density plot; Boxplot/Violin (to show median/IQR/outliers)
Categorical (single) Bar chart of counts (countplot); Pie chart (rarely, for 2–3 categories)
Numeric vs Numeric Scatter plot; Hexbin (for large data); Pairplot (scatterplot matrix)
Categorical vs Numeric Boxplot or Violin plot (split by category); Bar plot of group means
2
Data/Relationship Typical Plots/Techniques
Time series (Numeric vs Line plot (trend over time); Seasonal decomposition (trend/season
Time) plots)
Table: Common visualization choices for different data types and relationships (implementable via Matplotlib/
Seaborn). The examples above are supported by standard Python libraries 2 5 15 .
Figure: Pandas Profiling report “Overview” for the Iris dataset. The left panel shows dataset statistics (5
variables, 150 observations, 0 missing) and duplicate count, while the right panel lists variable types 16 .
Pandas-Profiling (ydata-profiling): This library generates a comprehensive HTML report with one
command (e.g. ProfileReport(df) ). It extends df.describe() by including variable type
inference, unique counts, missing-value summaries, descriptive stats (mean, median, quartiles,
skewness, etc.), histograms, correlations, and more 16 17 . For instance, the report shows
distribution plots and flags high correlations or missing data. It is very time-saving (requires only a
few lines of code) and interactive – you can drill into any variable’s details. Limitations: Pandas-
Profiling can be slow or fail on very large datasets, because it computes many statistics for every
column 18 . It also loads data into the browser, which can be heavy. Thus it’s best used on small-to-
medium data to get an initial overview.
3
•
Figure: Sweetviz output for the Iris dataset. Sweetviz produces a self-contained HTML report with summary
statistics and histograms for each feature (see e.g. sepal/petal length and width) 19 . Sweetviz: Another
auto-EDA library, Sweetviz generates visually-rich reports (in one line, sv.analyze(df) ) 19 20 .
The output focuses on visual comparisons: it plots side-by-side histograms, summary tables, and
correlation heatmaps. Notably, Sweetviz highlights how each feature relates to a target variable
(“target analysis”) and can compare two datasets (e.g. train vs. test) in one report 21 20 . It also
reports missing values, duplicates, and frequent values. Sweetviz is very quick to set up and its HTML
report is easy to share. Limitations: Like Pandas-Profiling, Sweetviz can struggle with very large or
wide datasets – the report becomes cluttered and loading may slow down. Also, it offers less
granular control (you must accept its defaults) compared to manual plotting.
• D-Tale: This tool provides an interactive web GUI for Pandas data 22 . Launching D-Tale in a notebook
(via dtale.show(df) ) opens a web interface where you can sort, filter, and search the DataFrame
as if in a spreadsheet 22 23 . You can edit cells, hide columns, and even plot columns on the fly (e.g.
click on a column to see distribution or correlations). D-Tale is useful for exploratory “play” when you
want to drill into data without writing code. It also can export the filtering actions as Python code.
Strengths: instant, detailed inspection of any part of the data (the interface is “detailed and very
simple” to use 22 ). Limitations: It requires a local browser interface and is not suited to automated
pipelines. Loading very large data can be slow (it attempts to load all data into the browser), and the
interface can hang if the dataset has millions of rows.
• AutoViz (AutoViz_Class): AutoViz automates chart creation with minimal input 24 . You simply
instantiate AutoViz_Class() and call AutoViz("datafile.csv") . It scans the dataset and
produces a wide variety of plots: histograms, boxplots, scatterplots, violin plots, and correlation
matrices for all features 25 . If a target column is specified, it also performs class-based
comparisons. AutoViz is designed to be very fast, generating dozens of plots in seconds 24 26 .
Strengths: One command yields many insights (it often produces more plots than Pandas-Profiling
or Sweetviz 26 ). It works well for moderate data. Limitations: For very large datasets, AutoViz
samples the data (by default up to ~150,000 rows 27 ) to avoid long runtimes. Beyond that, a paid
4
license may be required. Also, the large number of auto-generated plots can be overwhelming and
are not easily customized on-the-fly. AutoViz is best for an initial broad survey of the data.
Other emerging tools include pandasgui (a PyQt-based GUI for DataFrames) and Lux (a visualization
recommendation engine built on Pandas), but the above are the most common auto-EDA libraries in
Python.
Throughout, one alternates between manual and automated tools: use code for precise control (e.g.
customizing a specific Seaborn plot) and use auto-tools for broad overviews. By combining these
approaches, you ensure no blind spots in the EDA.
Sources: The above practices are standard in Python-based EDA (as documented in tutorials and library
references 2 5 10 24 ). Each listed automated tool is open-source and documented in the Python
ecosystem 16 19 22 24 . The figures shown are examples from Pandas-Profiling and Sweetviz outputs.
5
2 4 5 6 7 8 9 What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation? |
GeeksforGeeks
https://www.geeksforgeeks.org/what-is-univariate-bivariate-multivariate-analysis-in-data-visualisation/
10 11 12 How to Handle Missing Data | Data Cleaning | Exploratory Data Analysis | by DataScienceSphere
| Medium
https://medium.com/@datasciencejourney100_83560/how-to-handle-missing-data-data-cleaning-exploratory-data-analysis-
b706abc563ec
17 25 Automated EDA with Pandas Profiling, SweetViz, and Autoviz: Streamlining Data Analysis Efforts | by
22 Exploratory Data Analysis Tools. Pandas-Profiling, Sweetviz, D-Tale | by Karteek Menda | Medium
https://medium.com/@karteekmenda93/exploratory-data-analysis-tools-83ef538c879f
Medium
https://medium.com/analytics-vidhya/automated-eda-using-pandas-profiling-sweetviz-autoviz-4f15c4031a12