We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42
Lecture 2c
Exploratory Data Analysis (EDA)
Silvia Ahmed CSE445 Machine Learning ECE@NSU 1
Topics
• Introduction to EDA
• Descriptive Statistics
• Univariate Visualizations
• Multivariate Visualizations
• Handling Categorical Variables
• Advanced Visualization techniques
• Q&A Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 2 Learning goals • After this presentation, you should be able to: 1.Understand and apply univariate, multivariate, and categorical data visualization techniques, such as histograms, scatter plots, and bar charts, to effectively explore and interpret data. 2.Use advanced visualization techniques, including PCA, t-SNE, and UMAP, to reduce the dimensionality of high-dimensional datasets and visualize complex data relationships. 3.Create and interpret density-based visualizations, such as hexbin plots and contour plots, to analyze large datasets with overlapping data points. 4.Leverage specialized visualization methods, such as radar charts and dendrograms, to compare multiple variables and represent hierarchical or networked data structures. 5.Effectively choose the appropriate visualization technique based on the data type and analysis goals to enhance data exploration, pattern identification, and communication of insights. Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 3 Introduction to Exploratory Data Analysis (EDA) • A crucial step in understanding the structure, patterns, and relationships in a dataset before applying machine learning models. • Why is EDA important? • Uncover data patterns: Identify trends and correlations. • Check Assumptions: Validate assumptions about the data (e.g. distribution, linearity). • Spot Anomalies: Detect outliers or errors. • Feature Understanding: Determine which features are important for modeling.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 4
Introduction to EDA (contd.) • Key objectives of EDA: • Summarize the dataset: Use numerical and graphical methods to describe data. • Visualize relationships: Identify how features relate to each other. • Hypothesis generation: Form hypothesis about the data that can be tested in later analysis.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 5
EDA vs Data Preprocessing • Data preprocessing: Prepare raw data for analysis and modeling. Steps such as, handling missing data, removing outliers, scaling and normalization, encoding categorical variables, etc.
• EDA: Understand the dataset by summarizing its main
characteristics. Steps: • Descriptive statistics (mean, median, standard deviation). • Visualizing distributions (histograms, boxplots). • Investigating relationships between variables (scatter plots, heatmaps). • Detecting patterns and anomalies.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 6
EDA vs Data Preprocessing (contd.)
Aspect Data Preprocessing EDA
Goal Prepare data for analysis Understand data, detect patterns Key Cleaning, transforming, Descriptive statistics, Steps encoding visualizations Focus Making data suitable for ML Exploring data for insights and models anomalies Outcome Clean, structured data Hypotheses and insights
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 7
Descriptive Statistics for EDA • Descriptive statistics summarize and describe the main features of a dataset quantitatively. They provide a numerical overview of the data. • Key descriptive metrics: 1. Mean: The average value of a dataset. 2. Median: The middle value when data is sorted, which is less affected by outliers. 3. Mode: The most frequently occurring value in the data. 4. Standard Deviation: Measures the spread or variability around the mean. A larger value means the data is more spread out. Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 8 Descriptive Statistics for EDA (contd.) 5. Skewness: Indicates asymmetry in the data distribution. Positive skew means a long tail on the right, and negative skew means a long tail on the left. 6. Kurtosis: Indicates the "tailedness" of the data distribution. Higher kurtosis means more data is in the tails, indicating more extreme values (outliers). 7. Interquartile Range (IQR): The range between the 25th percentile (Q1) and 75th percentile (Q3), providing insight into data spread and identifying potential outliers. • Importance: • Summarize data characteristics before using visualizations. • Spot anomalies like outliers through values like skewness, kurtosis, and IQR. • Guide the choice of visualizations, such as histograms for skewed distributions or boxplots for outlier detection.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 9
Visualizing Data Distributions • Visualizing the distribution of data helps in understanding how values are spread across a range, whether the data is skewed, and whether there are outliers. It gives a clearer picture of the shape, central tendency, and variability of the data. • Key Visualization Techniques: • Histogram • Density Plot • Box Plot
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 10
Univariate Visualizations Techniques • These visualizations focus on a single variable at a time, providing insights into its distribution, central tendency, and spread. They are key to understanding individual feature behavior before exploring relationships with other variables. • Key Univariate Visualization Techniques: • Histogram • Boxplot • Violin Plot
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 11
Histogram • A histogram is a graphical representation of the distribution of a continuous variable. • It divides the range of the variable into intervals (bins) and displays how many data points fall into each bin. • Use Case: • Ideal for understanding the distribution (e.g., normal, skewed) of a single continuous variable. • Helps to identify potential outliers or data that is not normally distributed.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 12
Histogram (contd.) • Key Features: 1.X-axis (Bins): Represents the range of data divided into equal intervals. 2.Y-axis (Frequency): Shows the count of data points that fall within each bin. 3.Bin Width: Affects the level of detail in the histogram. Smaller bins provide more detail, but too many bins can be noisy. 4.Skewness: Can reveal whether data is symmetric or skewed (left or right). Figure: A simple histogram showing the frequency distribution of Sepal Length from the Iris dataset
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 13
Boxplot • A boxplot (or box-and-whisker plot) is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It also highlights potential outliers. • Use Case: • Ideal for comparing distributions between different categories or groups. • Excellent for identifying outliers, spread, and skewness of the data.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 14
Boxplot (contd.) • Key Components: 1.Box: Represents the interquartile range (IQR), from Q1 to Q3. 2.Median Line: A line inside the box that shows the median (Q2) of the data. 3.Whiskers: Extend from the box to the minimum and maximum values, but only up to 1.5 times the IQR. 4.Outliers: Points plotted outside the whiskers are considered outliers. Figure: A boxplot visualizing the distribution of the Fare variable in the Titanic dataset, including median, quartiles, and potential outliers
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 15
Violin Plot • A violin plot is a combination of a boxplot and a kernel density plot. It shows the distribution of the data, its probability density, and provides insights into the spread, center, and skewness. • Use Case: • Ideal for comparing distributions between different groups or categories. • Provides a detailed view of the data distribution, including multimodal distributions that boxplots may not reveal.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 16
Violin Plot (contd.) • Key Features: 1.Kernel Density Plot: The "violin" shape shows the distribution of the data's density. Wider sections represent a higher concentration of data points. 2.Boxplot Inside: Contains the median and interquartile range (IQR) similar to a regular boxplot. 3.Symmetry: Symmetrical violins indicate a symmetric distribution, while Figure: A violin plot visualizing the distribution and asymmetry reveals skewness. density of the Fare variable in the Titanic dataset, with the inner boxplot showing the median and quartiles
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 17
Multivariate Visualization Techniques • Multivariate visualizations help us understand relationships between three or more variables simultaneously. These techniques are essential for identifying complex interactions in datasets with multiple features. • Key Multivariate Visualization Techniques: • Scatter Plot • Pair Plot (or Scatterplot matrix) • Correlation Heatmap
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 18
Scatter Plot • A scatter plot is a graphical representation of the relationship between two continuous variables. Each point on the plot represents an observation, with the x-axis corresponding to one variable and the y-axis to another. • Use Case: • Useful for visualizing the relationship between two continuous variables. • Helps in identifying patterns such as linearity, clusters, or outliers. • Can incorporate a third variable using color, size, or shape to show additional relationships.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 19
Scatter Plot (contd.) • Key Features: 1.X-axis: Represents the independent variable (e.g., RM - number of rooms). 2.Y-axis: Represents the dependent variable (e.g., MEDV - median house value). 3.Dots (Points): Each dot represents an observation in the dataset. 4.Color/Size/Shape: Additional dimensions can be represented using different colors, sizes, or shapes of the points.
Figure: A scatter plot displaying the relationship between
RM (number of rooms) and MEDV (median house value), with color representing the LSTAT feature (lower status of the population). Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 20 Pair Plot • A pair plot is a grid of scatter plots that visualizes pairwise relationships between all numerical variables in a dataset. The diagonal of the grid typically shows histograms or density plots for each individual variable. • Use Case: • Ideal for quick, comprehensive exploration of all relationships in a dataset. • Helps to spot correlations, trends, clusters, and outliers. • Useful for identifying potential interactions between variables that can be further investigated in machine learning models.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 21
Pair Plot (contd.) • Key Features: 1.Pairwise Scatter Plots: Each scatter plot shows the relationship between two variables. The x-axis represents one variable, and the y-axis represents another. 2.Diagonal Plots: On the diagonal, you typically find histograms or kernel density estimates (KDEs) of individual variables, showing their distribution. 3.Color (Optional): Color can be used to represent a categorical variable, helping to reveal clusters or patterns within the relationships.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 22
Correlation Heatmap • A correlation heatmap is a graphical representation of the correlation matrix for a set of variables. It shows the strength and direction of relationships between variables using color gradients. The correlation coefficient (ranging from -1 to +1) indicates the degree of linear association between two variables: • +1: Perfect positive correlation (as one variable increases, so does the other). • -1: Perfect negative correlation (as one variable increases, the other decreases). • 0: No linear relationship between variables. • Use Case: • Helps to quickly identify multicollinearity (high correlation between independent variables), which can impact model performance. • Useful for selecting features that are highly correlated with the target variable but not with each other.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 23
Correlation Heatmap (contd.) • Key Features: 1.Color Intensity: The color in each cell represents the strength of the correlation, with deeper colors indicating stronger correlations (positive or negative). 2.Positive vs. Negative Correlation: Warm colors (e.g., red) typically represent positive correlations, while cool colors (e.g., blue) represent negative correlations. 3.Diagonal: The diagonal contains 1s (perfect correlation with itself) and is often ignored in analysis.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 24
Visualizing Categorical Variable • Categorical variables represent discrete groups or categories (e.g., gender, class, or region). Visualizing them helps in understanding the distribution of data across different categories, comparing frequencies, and identifying patterns between categories. • Key Visualization Techniques for Categorical Data: • Bar Plots • Stacked Bar Charts • Categorical Heatmaps
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 25
Bar Plots • A bar plot (or bar chart) is a graphical representation of categorical data where each category is represented by a bar. The height (or length) of each bar corresponds to the value (or count) of that category. • Purpose: • Comparison: Bar plots are particularly useful for comparing the frequency or distribution of different categories in a dataset. • Simple Interpretation: The height of the bars makes it easy to see which categories dominate and how they compare to each other.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 26
Bar Plots (contd.) • Key Features: 1.Bars: Each bar represents a distinct category, and its height shows the frequency or value for that category. 2.X-axis: Represents the categorical variable (e.g., Pclass for Passenger Class). 3.Y-axis: Represents the frequency/count or value of each category (e.g., number of passengers).
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 27
Stacked Bar Charts • A stacked bar chart is an extension of the basic bar plot, where each bar is divided into segments. Each segment represents a sub-category within the main category, and the height of each segment shows the count or value for that sub-category. The sum of all segments equals the total for the main category. • Purpose: • Comparison of Sub-Categories: Stacked bar charts are useful when you want to compare both the total of each main category and the breakdown of sub-categories. • Visualizing Proportions: Helps to visualize how the sub-categories (e.g., survived vs. not survived) contribute to the total of the primary category (e.g., each class).
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 28
Stacked Bar Charts (contd.) • Key Features: 1.Bars: Each bar represents a main category (e.g., Pclass in the Titanic dataset). 2.Segments within Bars: Each bar is divided into segments representing sub- categories (e.g., Survived or Not Survived). 3.X-axis: Represents the primary categorical variable (e.g., Pclass). 4.Y-axis: Represents the total count or percentage of sub-categories within each primary category.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 29
Categorical Heatmaps • A categorical heatmap is a visualization tool that uses color to represent the frequency or count of occurrences between two categorical variables. It provides a clear and effective way to observe the relationships or interactions between two categorical variables. • Purpose: • Visualizing Relationships: Categorical heatmaps are used to detect and visualize patterns, relationships, or trends between two categorical variables. • Identifying Associations: They help in identifying which category combinations are more frequent than others and are ideal for showing interactions in larger datasets.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 30
Categorical Heatmaps (contd.) • Key Features: 1.Matrix Layout: A heatmap is structured as a grid, with each row representing one category from the first variable, and each column representing one category from the second variable. 2.Colors: The color intensity in each cell of the heatmap corresponds to the count or frequency of observations for the combination of the two categories (e.g., darker shades representing higher frequencies). 3.Labels: Each cell can optionally display the actual count value, making it easier to interpret the relationships.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 31
Advanced Visualization Techniques • As data becomes larger and more complex, traditional visualization techniques may not be sufficient. Advanced visualization methods help simplify and extract insights from high-dimensional datasets by reducing their complexity, while still retaining important patterns and relationships. • Key Techniques: • PCA (Principal Component Analysis) • Hexbin Plot • t-SNE / UMAP (Non-linear Dimensionality Reduction)
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 32
Principal Component Analysis (PCA) • Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower- dimensional space while retaining as much of the original variance as possible. • PCA helps visualize complex datasets in fewer dimensions (2D or 3D) by projecting data onto principal components that explain the largest variance in the dataset. • Purpose: • Data Visualization: PCA allows us to visualize high-dimensional data in 2D or 3D by capturing the most important patterns and relationships in the dataset. • Feature Extraction: Reducing data to fewer dimensions can help in understanding complex datasets and removing noise or less informative features.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 33
PCA (contd.) • Key Concepts: 1.Dimensionality Reduction: Reduces the number of variables by projecting the original data onto a smaller set of new, uncorrelated variables (principal components). 2.Variance Explained: PCA aims to capture the most variance in the data using fewer dimensions (e.g., reducing a 4D dataset to 2D for visualization). 3.Principal Components (PC): New axes (or dimensions) that are a linear combination of the original variables. The first principal component (PC1) captures A scatter plot showing the first two the most variance, followed by PC2, PC3, principal components (PC1 and PC2) of the etc. Iris dataset, where each point represents a flower, and the color represents the species (Setosa, Versicolor, or Virginica). Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 34 Hexbin Plot • A hexbin plot is a type of bivariate plot that displays the relationship between two numerical variables using hexagonal bins. • It's a useful alternative to scatter plots, especially when there are many data points that overlap, causing overplotting. • Instead of plotting individual points, the data is divided into hexagonal cells, and the color of each cell corresponds to the number of points within that bin. • Purpose: • Visualizing Large Datasets: Hexbin plots help in visualizing dense regions of data, where individual scatter points overlap and clutter the plot. • Identifying Patterns: This visualization makes it easier to identify trends, correlations, and clustering in large datasets.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 35
Hexbin Plot (contd.) • Key Features: 1.Hexagonal Bins: The plot divides the data space into hexagons, allowing for better representation of dense data points. 2.Density Representation: The color intensity of each hexagon represents the density or frequency of the data points that fall within that bin. Darker shades usually represent higher concentrations of data points. 3.Efficient for Large Datasets: Particularly useful when scatter plots become unreadable due to overplotting, which often happens with large datasets.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 36
t-SNE / UMAP • t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are advanced techniques for reducing high-dimensional data to two or three dimensions, focusing on preserving the local structure and neighborhood of the data points. • Purpose: Ideal for visualizing complex datasets where clusters and relationships are not easily identifiable in the original high-dimensional space. A 2D t-SNE plot, showing how the three Often used for high-dimensional data different species of Iris flowers cluster like images or genetic data. together
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 37
Key Takeaways from EDA • Key Points: • EDA is essential for understanding the data. • Visualizations are powerful tools to reveal insights. • Select the right visualization based on the type of data and analysis goals. • Visual: A summary infographic with different visualization types.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 38
Summary Univariate Visualization Description Example Use Case Technique Histogram Displays the distribution of a single numeric Visualizing the distribution of variable. house prices. Box Plot Summarizes the distribution of a numeric Comparing the salaries of variable, showing the median, quartiles, and employees in different outliers. departments. Violin Plot Combines a box plot and density plot to show Comparing the distribution of the distribution of a variable and its probability exam scores between two density. classes. Density Plot Smooth, continuous version of a histogram that Visualizing the probability estimates the probability density function. distribution of a stock price.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 39
Summary (contd.) Multivariate Visualization Description Example Use Case Technique Scatter Plot Shows the relationship between two numeric Analyzing the relationship variables. between house size and price. Pair Plot Visualizes pairwise relationships between Exploring relationships in the Iris multiple numeric variables. dataset between all features. Heatmap Displays correlations between numeric variables Identifying multicollinearity in (Correlation) using color intensity. the features of the Boston Housing dataset. Hexbin Plot Displays the density of points between two Visualizing point density in large numeric variables using hexagonal bins. datasets like population data. t-SNE/UMAP Dimensionality reduction techniques for Clustering similar species in the visualizing high-dimensional data in 2D/3D Iris dataset. space.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 40
Summary (contd.) Categorical Data Visualization Description Example Use Case Technique Bar Plot Compares the frequency of different categories. Comparing the number of passengers in each passenger class (Pclass) on the Titanic. Stacked Bar Chart Shows the composition of sub-categories within Visualizing survival rates across each category. passenger classes on the Titanic. Categorical Shows the relationship between two categorical Displaying the count of Heatmap variables using color. passengers by port of embarkation and class on the Titanic.
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 41
Reference and further reading • Chapter 2: “Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow”, Aurelien Geron. • Jupyter Notebook: • Under Module 2 in canvas : T2c_EDA_Data_Visualization
Silvia Ahmed (SvA) CSE445 Machine Learning ECE@NSU 42