0% found this document useful (0 votes)
36 views42 pages

CSE445 T2c Exploratory Data Analysis

Uploaded by

zikbal100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views42 pages

CSE445 T2c Exploratory Data Analysis

Uploaded by

zikbal100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture 2c

Exploratory Data
Analysis (EDA)

Silvia Ahmed CSE445 Machine Learning ECE@NSU 1


Topics

• Introduction to EDA

• Descriptive Statistics

• Univariate Visualizations

• Multivariate Visualizations

• Handling Categorical Variables

• Advanced Visualization techniques


• Q&A
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 2
Learning goals
• After this presentation, you should be able to:
1.Understand and apply univariate, multivariate, and categorical data
visualization techniques, such as histograms, scatter plots, and bar
charts, to effectively explore and interpret data.
2.Use advanced visualization techniques, including PCA, t-SNE, and
UMAP, to reduce the dimensionality of high-dimensional datasets and
visualize complex data relationships.
3.Create and interpret density-based visualizations, such as hexbin plots
and contour plots, to analyze large datasets with overlapping data points.
4.Leverage specialized visualization methods, such as radar charts and
dendrograms, to compare multiple variables and represent hierarchical or
networked data structures.
5.Effectively choose the appropriate visualization technique based on
the data type and analysis goals to enhance data exploration, pattern
identification, and communication of insights.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 3
Introduction to Exploratory Data Analysis (EDA)
• A crucial step in understanding the structure, patterns, and
relationships in a dataset before applying machine learning
models.
• Why is EDA important?
• Uncover data patterns: Identify trends and correlations.
• Check Assumptions: Validate assumptions about the data (e.g.
distribution, linearity).
• Spot Anomalies: Detect outliers or errors.
• Feature Understanding: Determine which features are important for
modeling.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 4


Introduction to EDA (contd.)
• Key objectives of EDA:
• Summarize the dataset: Use numerical and graphical methods to
describe data.
• Visualize relationships: Identify how features relate to each other.
• Hypothesis generation: Form hypothesis about the data that can be
tested in later analysis.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 5


EDA vs Data Preprocessing
• Data preprocessing: Prepare raw data for analysis and
modeling. Steps such as, handling missing data, removing
outliers, scaling and normalization, encoding categorical
variables, etc.

• EDA: Understand the dataset by summarizing its main


characteristics. Steps:
• Descriptive statistics (mean, median, standard deviation).
• Visualizing distributions (histograms, boxplots).
• Investigating relationships between variables (scatter plots, heatmaps).
• Detecting patterns and anomalies.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 6


EDA vs Data Preprocessing (contd.)

Aspect Data Preprocessing EDA


Goal Prepare data for analysis Understand data, detect
patterns
Key Cleaning, transforming, Descriptive statistics,
Steps encoding visualizations
Focus Making data suitable for ML Exploring data for insights and
models anomalies
Outcome Clean, structured data Hypotheses and insights

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 7


Descriptive Statistics for EDA
• Descriptive statistics summarize and describe the main features
of a dataset quantitatively. They provide a numerical overview
of the data.
• Key descriptive metrics:
1. Mean: The average value of a dataset.
2. Median: The middle value when data is sorted, which is less
affected by outliers.
3. Mode: The most frequently occurring value in the data.
4. Standard Deviation: Measures the spread or variability
around the mean. A larger value means the data is more
spread out.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 8
Descriptive Statistics for EDA (contd.)
5. Skewness: Indicates asymmetry in the data distribution. Positive skew
means a long tail on the right, and negative skew means a long tail on the
left.
6. Kurtosis: Indicates the "tailedness" of the data distribution. Higher
kurtosis means more data is in the tails, indicating more extreme values
(outliers).
7. Interquartile Range (IQR): The range between the 25th percentile (Q1)
and 75th percentile (Q3), providing insight into data spread and identifying
potential outliers.
• Importance:
• Summarize data characteristics before using visualizations.
• Spot anomalies like outliers through values like skewness, kurtosis, and IQR.
• Guide the choice of visualizations, such as histograms for skewed distributions or
boxplots for outlier detection.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 9


Visualizing Data Distributions
• Visualizing the distribution of data helps in understanding how
values are spread across a range, whether the data is skewed,
and whether there are outliers. It gives a clearer picture of the
shape, central tendency, and variability of the data.
• Key Visualization Techniques:
• Histogram
• Density Plot
• Box Plot

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 10


Univariate Visualizations Techniques
• These visualizations focus on a single variable at a time,
providing insights into its distribution, central tendency, and
spread. They are key to understanding individual feature
behavior before exploring relationships with other variables.
• Key Univariate Visualization Techniques:
• Histogram
• Boxplot
• Violin Plot

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 11


Histogram
• A histogram is a graphical representation of the distribution of a
continuous variable.
• It divides the range of the variable into intervals (bins) and
displays how many data points fall into each bin.
• Use Case:
• Ideal for understanding the distribution (e.g., normal, skewed) of a
single continuous variable.
• Helps to identify potential outliers or data that is not normally
distributed.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 12


Histogram (contd.)
• Key Features:
1.X-axis (Bins): Represents the
range of data divided into equal
intervals.
2.Y-axis (Frequency): Shows the
count of data points that fall
within each bin.
3.Bin Width: Affects the level of
detail in the histogram. Smaller
bins provide more detail, but too
many bins can be noisy.
4.Skewness: Can reveal whether
data is symmetric or skewed
(left or right).
Figure: A simple histogram showing the frequency
distribution of Sepal Length from the Iris dataset

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 13


Boxplot
• A boxplot (or box-and-whisker plot) is a standardized way of
displaying the distribution of data based on a five-number
summary: minimum, first quartile (Q1), median, third quartile
(Q3), and maximum. It also highlights potential outliers.
• Use Case:
• Ideal for comparing distributions between different categories or
groups.
• Excellent for identifying outliers, spread, and skewness of the data.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 14


Boxplot (contd.)
• Key Components:
1.Box: Represents the interquartile
range (IQR), from Q1 to Q3.
2.Median Line: A line inside the box
that shows the median (Q2) of the
data.
3.Whiskers: Extend from the box to the
minimum and maximum values, but
only up to 1.5 times the IQR.
4.Outliers: Points plotted outside the
whiskers are considered outliers. Figure: A boxplot visualizing the distribution of the Fare
variable in the Titanic dataset, including median,
quartiles, and potential outliers

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 15


Violin Plot
• A violin plot is a combination of a boxplot and a kernel density
plot. It shows the distribution of the data, its probability density,
and provides insights into the spread, center, and skewness.
• Use Case:
• Ideal for comparing distributions between different groups or
categories.
• Provides a detailed view of the data distribution, including multimodal
distributions that boxplots may not reveal.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 16


Violin Plot (contd.)
• Key Features:
1.Kernel Density Plot: The "violin"
shape shows the distribution of the
data's density. Wider sections
represent a higher concentration of
data points.
2.Boxplot Inside: Contains the median
and interquartile range (IQR) similar to
a regular boxplot.
3.Symmetry: Symmetrical violins
indicate a symmetric distribution, while
Figure: A violin plot visualizing the distribution and
asymmetry reveals skewness. density of the Fare variable in the Titanic dataset, with
the inner boxplot showing the median and quartiles

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 17


Multivariate Visualization Techniques
• Multivariate visualizations help us understand relationships
between three or more variables simultaneously. These
techniques are essential for identifying complex interactions in
datasets with multiple features.
• Key Multivariate Visualization Techniques:
• Scatter Plot
• Pair Plot (or Scatterplot matrix)
• Correlation Heatmap

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 18


Scatter Plot
• A scatter plot is a graphical representation of the relationship
between two continuous variables. Each point on the plot
represents an observation, with the x-axis corresponding to one
variable and the y-axis to another.
• Use Case:
• Useful for visualizing the relationship between two continuous
variables.
• Helps in identifying patterns such as linearity, clusters, or outliers.
• Can incorporate a third variable using color, size, or shape to show
additional relationships.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 19


Scatter Plot (contd.)
• Key Features:
1.X-axis: Represents the independent
variable (e.g., RM - number of rooms).
2.Y-axis: Represents the dependent
variable (e.g., MEDV - median house
value).
3.Dots (Points): Each dot represents an
observation in the dataset.
4.Color/Size/Shape: Additional
dimensions can be represented using
different colors, sizes, or shapes of the
points.

Figure: A scatter plot displaying the relationship between


RM (number of rooms) and MEDV (median house value),
with color representing the LSTAT feature (lower status of
the population).
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 20
Pair Plot
• A pair plot is a grid of scatter plots that visualizes pairwise
relationships between all numerical variables in a dataset. The
diagonal of the grid typically shows histograms or density plots
for each individual variable.
• Use Case:
• Ideal for quick, comprehensive exploration of all relationships in a
dataset.
• Helps to spot correlations, trends, clusters, and outliers.
• Useful for identifying potential interactions between variables that can
be further investigated in machine learning models.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 21


Pair Plot (contd.)
• Key Features:
1.Pairwise Scatter Plots: Each
scatter plot shows the relationship
between two variables. The x-axis
represents one variable, and the
y-axis represents another.
2.Diagonal Plots: On the diagonal,
you typically find histograms or
kernel density estimates (KDEs)
of individual variables, showing
their distribution.
3.Color (Optional): Color can be
used to represent a categorical
variable, helping to reveal
clusters or patterns within the
relationships.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 22


Correlation Heatmap
• A correlation heatmap is a graphical representation of the correlation
matrix for a set of variables. It shows the strength and direction of
relationships between variables using color gradients. The
correlation coefficient (ranging from -1 to +1) indicates the degree of
linear association between two variables:
• +1: Perfect positive correlation (as one variable increases, so does the other).
• -1: Perfect negative correlation (as one variable increases, the other
decreases).
• 0: No linear relationship between variables.
• Use Case:
• Helps to quickly identify multicollinearity (high correlation between
independent variables), which can impact model performance.
• Useful for selecting features that are highly correlated with the target variable
but not with each other.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 23


Correlation Heatmap (contd.)
• Key Features:
1.Color Intensity: The color in
each cell represents the
strength of the correlation,
with deeper colors indicating
stronger correlations (positive
or negative).
2.Positive vs. Negative
Correlation: Warm colors
(e.g., red) typically represent
positive correlations, while
cool colors (e.g., blue)
represent negative
correlations.
3.Diagonal: The diagonal
contains 1s (perfect
correlation with itself) and is
often ignored in analysis.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 24


Visualizing Categorical Variable
• Categorical variables represent discrete groups or categories
(e.g., gender, class, or region). Visualizing them helps in
understanding the distribution of data across different
categories, comparing frequencies, and identifying patterns
between categories.
• Key Visualization Techniques for Categorical Data:
• Bar Plots
• Stacked Bar Charts
• Categorical Heatmaps

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 25


Bar Plots
• A bar plot (or bar chart) is a graphical representation of
categorical data where each category is represented by a bar.
The height (or length) of each bar corresponds to the value (or
count) of that category.
• Purpose:
• Comparison: Bar plots are particularly useful for comparing the
frequency or distribution of different categories in a dataset.
• Simple Interpretation: The height of the bars makes it easy to see
which categories dominate and how they compare to each other.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 26


Bar Plots (contd.)
• Key Features:
1.Bars: Each bar represents a
distinct category, and its
height shows the frequency
or value for that category.
2.X-axis: Represents the
categorical variable (e.g.,
Pclass for Passenger Class).
3.Y-axis: Represents the
frequency/count or value of
each category (e.g., number
of passengers).

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 27


Stacked Bar Charts
• A stacked bar chart is an extension of the basic bar plot, where
each bar is divided into segments. Each segment represents a
sub-category within the main category, and the height of each
segment shows the count or value for that sub-category. The
sum of all segments equals the total for the main category.
• Purpose:
• Comparison of Sub-Categories: Stacked bar charts are useful when
you want to compare both the total of each main category and the
breakdown of sub-categories.
• Visualizing Proportions: Helps to visualize how the sub-categories
(e.g., survived vs. not survived) contribute to the total of the primary
category (e.g., each class).

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 28


Stacked Bar Charts (contd.)
• Key Features:
1.Bars: Each bar represents a main
category (e.g., Pclass in the Titanic
dataset).
2.Segments within Bars: Each bar is
divided into segments representing sub-
categories (e.g., Survived or Not
Survived).
3.X-axis: Represents the primary
categorical variable (e.g., Pclass).
4.Y-axis: Represents the total count or
percentage of sub-categories within each
primary category.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 29


Categorical Heatmaps
• A categorical heatmap is a visualization tool that uses color to
represent the frequency or count of occurrences between two
categorical variables. It provides a clear and effective way to
observe the relationships or interactions between two
categorical variables.
• Purpose:
• Visualizing Relationships: Categorical heatmaps are used to detect
and visualize patterns, relationships, or trends between two categorical
variables.
• Identifying Associations: They help in identifying which category
combinations are more frequent than others and are ideal for showing
interactions in larger datasets.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 30


Categorical Heatmaps (contd.)
• Key Features:
1.Matrix Layout: A heatmap is structured
as a grid, with each row representing
one category from the first variable, and
each column representing one category
from the second variable.
2.Colors: The color intensity in each cell
of the heatmap corresponds to the
count or frequency of observations for
the combination of the two categories
(e.g., darker shades representing
higher frequencies).
3.Labels: Each cell can optionally display
the actual count value, making it easier
to interpret the relationships.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 31


Advanced Visualization Techniques
• As data becomes larger and more complex, traditional
visualization techniques may not be sufficient. Advanced
visualization methods help simplify and extract insights from
high-dimensional datasets by reducing their complexity, while
still retaining important patterns and relationships.
• Key Techniques:
• PCA (Principal Component Analysis)
• Hexbin Plot
• t-SNE / UMAP (Non-linear Dimensionality Reduction)

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 32


Principal Component Analysis (PCA)
• Principal Component Analysis (PCA) is a dimensionality reduction
technique that transforms high-dimensional data into a lower-
dimensional space while retaining as much of the original variance
as possible.
• PCA helps visualize complex datasets in fewer dimensions (2D or
3D) by projecting data onto principal components that explain the
largest variance in the dataset.
• Purpose:
• Data Visualization: PCA allows us to visualize high-dimensional data in 2D
or 3D by capturing the most important patterns and relationships in the
dataset.
• Feature Extraction: Reducing data to fewer dimensions can help in
understanding complex datasets and removing noise or less informative
features.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 33


PCA (contd.)
• Key Concepts:
1.Dimensionality Reduction: Reduces the
number of variables by projecting the
original data onto a smaller set of new,
uncorrelated variables (principal
components).
2.Variance Explained: PCA aims to capture
the most variance in the data using fewer
dimensions (e.g., reducing a 4D dataset to
2D for visualization).
3.Principal Components (PC): New axes
(or dimensions) that are a linear
combination of the original variables. The
first principal component (PC1) captures A scatter plot showing the first two
the most variance, followed by PC2, PC3, principal components (PC1 and PC2) of the
etc. Iris dataset, where each point represents a
flower, and the color represents the species
(Setosa, Versicolor, or Virginica).
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 34
Hexbin Plot
• A hexbin plot is a type of bivariate plot that displays the relationship
between two numerical variables using hexagonal bins.
• It's a useful alternative to scatter plots, especially when there are
many data points that overlap, causing overplotting.
• Instead of plotting individual points, the data is divided into
hexagonal cells, and the color of each cell corresponds to the
number of points within that bin.
• Purpose:
• Visualizing Large Datasets: Hexbin plots help in visualizing dense regions
of data, where individual scatter points overlap and clutter the plot.
• Identifying Patterns: This visualization makes it easier to identify trends,
correlations, and clustering in large datasets.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 35


Hexbin Plot (contd.)
• Key Features:
1.Hexagonal Bins: The plot divides the
data space into hexagons, allowing for
better representation of dense data
points.
2.Density Representation: The color
intensity of each hexagon represents
the density or frequency of the data
points that fall within that bin. Darker
shades usually represent higher
concentrations of data points.
3.Efficient for Large Datasets:
Particularly useful when scatter plots
become unreadable due to
overplotting, which often happens with
large datasets.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 36


t-SNE / UMAP
• t-SNE (t-Distributed Stochastic
Neighbor Embedding) and UMAP
(Uniform Manifold Approximation and
Projection) are advanced techniques for
reducing high-dimensional data to two
or three dimensions, focusing on
preserving the local structure and
neighborhood of the data points.
• Purpose: Ideal for visualizing complex
datasets where clusters and
relationships are not easily identifiable
in the original high-dimensional space. A 2D t-SNE plot, showing how the three
Often used for high-dimensional data different species of Iris flowers cluster
like images or genetic data. together

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 37


Key Takeaways from EDA
• Key Points:
• EDA is essential for understanding the data.
• Visualizations are powerful tools to reveal insights.
• Select the right visualization based on the type of data and analysis
goals.
• Visual: A summary infographic with different visualization types.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 38


Summary
Univariate
Visualization Description Example Use Case
Technique
Histogram Displays the distribution of a single numeric Visualizing the distribution of
variable. house prices.
Box Plot Summarizes the distribution of a numeric Comparing the salaries of
variable, showing the median, quartiles, and employees in different
outliers. departments.
Violin Plot Combines a box plot and density plot to show Comparing the distribution of
the distribution of a variable and its probability exam scores between two
density. classes.
Density Plot Smooth, continuous version of a histogram that Visualizing the probability
estimates the probability density function. distribution of a stock price.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 39


Summary (contd.)
Multivariate
Visualization Description Example Use Case
Technique
Scatter Plot Shows the relationship between two numeric Analyzing the relationship
variables. between house size and price.
Pair Plot Visualizes pairwise relationships between Exploring relationships in the Iris
multiple numeric variables. dataset between all features.
Heatmap Displays correlations between numeric variables Identifying multicollinearity in
(Correlation) using color intensity. the features of the Boston
Housing dataset.
Hexbin Plot Displays the density of points between two Visualizing point density in large
numeric variables using hexagonal bins. datasets like population data.
t-SNE/UMAP Dimensionality reduction techniques for Clustering similar species in the
visualizing high-dimensional data in 2D/3D Iris dataset.
space.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 40


Summary (contd.)
Categorical Data
Visualization Description Example Use Case
Technique
Bar Plot Compares the frequency of different categories. Comparing the number of
passengers in each passenger
class (Pclass) on the Titanic.
Stacked Bar Chart Shows the composition of sub-categories within Visualizing survival rates across
each category. passenger classes on the
Titanic.
Categorical Shows the relationship between two categorical Displaying the count of
Heatmap variables using color. passengers by port of
embarkation and class on the
Titanic.

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 41


Reference and further reading
• Chapter 2: “Hands-on Machine Learning with Scikit-Learn,
Keras & TensorFlow”, Aurelien Geron.
• Jupyter Notebook:
• Under Module 2 in canvas : T2c_EDA_Data_Visualization

Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 42

You might also like