0% found this document useful (0 votes)

30 views35 pages

EDA - Module 4

Uploaded by

barath11koc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views35 pages

EDA - Module 4

Uploaded by

barath11koc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

EXPLORATORY DATA

ANALYSIS

1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).

2
Exploratory Data Analysis

3
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)

4
AIM OF THE EDA
• The goal of EDA is to open-mindedly explore data.
• EDA is detective work… Unless detective finds the clues, judge or jury
has nothing to consider.
• Here, judge or jury is a confirmatory data analysis
• Confirmatory data analysis goes further, assessing the strengths of
the evidence.
• With EDA, we can examine data and try to understand the meaning
of variables. What are the abbreviations stand for.

5
Exploratory vs Confirmatory Data
Analysis
EDA CDA
• No hypothesis at first • Start with hypothesis

• Generate hypothesis • Test the null hypothesis

• Uses graphical methods (mostly) • Uses statistical models

6
STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones.
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
• Try to identify confounding variables, interaction relations and multicollinearity, if
any.
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables).
• Decide on the hypothesis based on your research questions
7
AFTER EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
• Get conclusions and present your results nicely.

8
Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First,
each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics,
while graphical methods obviously summarize the data in a diagrammatic
or pictorial way.
• Univariate methods look at one variable (data column) at a time, while
multivariate methods look at two or more variables at a time to explore
relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more
variables.
• It is almost always a good idea to perform univariate EDA on each of the
components of a multivariate EDA before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
9
EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1.Climate mildness Questions:
2.Housing cost 1.How is climate related to location?
3.Health care and environment 2.Are there clusters in the data (excluding
4.Crime location)?
3.Are nearby cities similar?
5.Transportation supply 4.Any relation bw economic outlook and crime?
6.Educational opportunities and effort 5.What else???
7.Arts and culture facilities
8.Recreational opportunities
9.Personal economic outlook
+ latitude and longitude of each city

10
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable

11
Data Types and Measurement
Scales
• Variables may be one of several types, and have a defined set of
valid values.
• Two main classes of variables are:
Continuous Variables: (Quantitative, numeric).
Continuous data can be rounded or \binned to create categorical data.
Categorical Variables: (Discrete, qualitative).
Some categorical variables (e.g. counts) are sometimes treated as
continuous.

12
Categorical Data
• Unordered categorical data (nominal)
2 possible values (binary or dichotomous)
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.
• Ordered categorical data (ordinal)
Ratings or preferences
BCCI Contracts
Quality of life scales,
IPL Contracts
(Base Price: 30 L, 50 L, 75L, 1cr, 2cr)
Number of copies of a recessive gene (0, 1 or 2)
13
EDA Part 2: Summarizing Data With
Tables and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.

• Familiarizing yourself with the data.

• Find possible errors and anomalies.
• Examine the distribution of values for each variable.

14
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as
ordered categorical.
Plots specific to Continuous variables.

The goal for both categorical and continuous data is data reduction
while preserving/extracting key information about the process under
investigation.
15
Categorical Data Summaries
• A survey of 100 people asking for their favorite color results in the
following categories:
• Red: 30
• Blue: 40
• Green: 20
• Yellow: 10

16
Frequency Table

• Mode (Most Frequent Category)

• Example: From a set of responses about preferred transportation (Car,
Bike, Walk, Bus), if the counts are:
• Car: 50
• Bike: 30
• Walk: 10
• Bus: 10

• Frequency Table: Categories with counts

• Relative Frequency Table: Percentage in each category
17
Relative Frequency Table

• A relative frequency table shows the proportion of each category

compared to the total.
• Example: In a class of 30 students:
• Male: 12
• Female: 18
• Relative Frequencies:
• Male: 12/30 = 0.4 (or 40%)
• Female: 18/30 = 0.6 (or 60%)
• Summary: 40% of students are male, and 60% are female.

18
Graphing a Frequency Table - Bar
Chart:
A bar chart can visually represent the frequency or percentage of each
category.
Example: Plot a bar chart with the categories (Red, Blue, Green, Yellow)
on the x-axis and the frequency or percentage on the y-axis. The
heights of the bars represent the counts or percentages for each color.

19
Chi-Square Test (for association
between categorical variables)
• Example: If you want to test whether there is an association between
gender (Male, Female) and whether people prefer watching movies
at home or in the theater, you would collect data and perform a chi-
square test on a contingency table.
• Summary: The test might indicate whether gender and movie-
watching preference are significantly related, helping you assess
patterns or trends

20
Continuous Data - Tables

Pie Chart
A pie chart can be used to show the proportion of each category in a whole.
•Example: For the above color preference survey, a pie chart would divide a circle
into segments, each representing the percentage of people who chose each color
(Blue, Red, Green, Yellow). The chart would visually display that Blue takes up the
largest segment.

21
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package

22
Techniques involved in Exploratory
Data Analysis
1. Data Collection and Understanding
•Data Sources: Understanding where the data is coming from (e.g., databases, APIs, CSV files, spreadsheets).
•Types of Data: Distinguishing between numerical, categorical, ordinal, and nominal data.
•Data Structure: Exploring rows, columns, and the data types in the dataset (e.g., integer, float, object, etc.).
•Initial Data Inspection: Using basic functions like head(), info(), and describe() to get a quick summary.

2. Data Cleaning
•Handling Missing Data: Techniques for imputation, removal, or using algorithms that handle missing values
(mean/median imputation, forward/backward fill, etc.).
•Handling Duplicates: Identifying and removing duplicate records.
•Data Transformation: Converting data types, changing the format, or scaling features (e.g., converting
categorical variables to numeric).
•Outlier Detection and Treatment: Identifying outliers and deciding how to handle them (removal, capping,
transformation).

23
3. Univariate Analysis
•Summary Statistics: Mean, median, mode, range, variance, standard deviation,
skewness, and kurtosis.
•Histograms: Plotting the distribution of single variables to visualize their
frequency.
•Boxplots: Identifying the spread and central tendency, and detecting potential
outliers.
•Bar Charts: Visualizing the distribution of categorical variables.
•Density Plots: Visualizing the smooth distribution of a variable (Kernel Density
Estimation).

24
4. Bivariate Analysis
•Scatter Plots: Analyzing the relationship between two continuous variables.
•Correlation Matrix: Identifying linear relationships between numerical features
using correlation coefficients (Pearson, Spearman).
•Boxplots and Violin Plots: Comparing distributions of continuous data across
categorical groups.
•Heatmaps: Visualizing the correlation matrix or missing data patterns.

25
5. Multivariate Analysis
•Pairplots: Visualizing relationships between multiple continuous variables at once.
•Heatmaps for Correlation: Analyzing the correlation matrix between several
features.
•Principal Component Analysis (PCA): Reducing the dimensionality of the
dataset to identify the most significant features and visualize high-dimensional data.
•t-SNE (t-Distributed Stochastic Neighbor Embedding): A non-linear
dimensionality reduction technique for high-dimensional data visualization.
•Pairwise Comparisons: Comparing distributions or relationships across multiple
dimensions.

26
6. Feature Engineering
•Feature Creation: Creating new variables based on existing data (e.g., date
extraction, text vectorization).
•Feature Scaling: Normalization and standardization techniques (e.g., Min-Max
Scaling, Z-Score Standardization).
•Feature Encoding: Techniques for encoding categorical variables (e.g., One-Hot
Encoding, Label Encoding, Target Encoding).
•Dimensionality Reduction: Using techniques like PCA, t-SNE, and Autoencoders
to reduce the number of features while retaining essential information.

27
7. Handling Skewed Data
•Data Transformation: Applying log, square root, or Box-Cox transformations to
handle skewed distributions.
•Identifying Skewness: Visualizing skewness with histograms or skewness-kurtosis
tests.
•Dealing with Skewed Target Variable: Using transformations for regression or
classification models to deal with non-normal target distributions.

28
8. Data Visualization
•Plotting Techniques: Understanding how to use different types of plots like line
plots, histograms, boxplots, bar plots, heatmaps, and pie charts to visualize data.
•Seaborn/Matplotlib: Using Python libraries to create advanced visualizations and
customizing plots.
•Faceted Plots: Creating subsets of plots based on different categories or values to
explore relationships in the data.
•Interactive Plots: Using tools like Plotly and Dash for more advanced, interactive
visualizations.

29
9. Detecting Anomalies and Outliers
•Visual Techniques: Using scatter plots, box plots, and z-scores to detect
anomalous points.
•Statistical Methods: Using statistical tests (e.g., Grubbs' Test, Modified Z-score)
for outlier detection.
•Robust Statistics: Using methods that are not sensitive to outliers, such as median
and interquartile ranges (IQR).

30
10. Time Series Analysis (if applicable)
•Trend Analysis: Identifying trends and seasonality in time series data.
•Autocorrelation: Using autocorrelation plots (ACF, PACF) to check for repeating
patterns in time series data.
•Decomposition: Decomposing time series into trend, seasonal, and residual
components.
•Stationarity Tests: Checking if the data is stationary using tests like the
Augmented Dickey-Fuller (ADF) test.

31
11. Data Sampling Techniques
•Random Sampling: Drawing random samples from the dataset for quick analysis.
•Stratified Sampling: Ensuring samples represent all key segments of the
population.
•Bootstrapping: A method of resampling to estimate the variability of a statistic.

12. Handling Categorical Variables

•Frequency Distribution: Checking the distribution of values in categorical
variables.
•Chi-Square Test: Testing for independence between categorical variables.
•Cross-tabulation: Analyzing the relationship between two categorical variables
using contingency tables.
32
13. Data Integrity and Quality Assessment
•Consistency Checks: Ensuring data is consistent across variables and records.
•Missing Data Patterns: Identifying missing data patterns and deciding whether
they are missing completely at random (MCAR), missing at random (MAR), or
missing not at random (MNAR).
•Data Quality Metrics: Evaluating the quality of data by checking for accuracy,
completeness, consistency, and timeliness.

33
14. Multicollinearity
•Variance Inflation Factor (VIF): Measuring multicollinearity between predictor
variables.
•Condition Number: Checking the stability of the regression model when
multicollinearity is present.
•Removing Highly Correlated Features: Addressing multicollinearity by
removing redundant variables.

34
15. Handling Imbalanced Data
•Resampling Methods: Over-sampling the minority class (SMOTE) or
under-sampling the majority class.
•Class Weight Adjustment: Adjusting the weight of classes in models
to give more importance to the minority class.
•Anomaly Detection: Using techniques like Isolation Forest or One-
Class SVM to detect rare events or classes

Senn Et Al 2019 Taste and Familiarity Affect The Experience of Groove in Popular Music
No ratings yet
Senn Et Al 2019 Taste and Familiarity Affect The Experience of Groove in Popular Music
22 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
53 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Unit 3
No ratings yet
Unit 3
47 pages
FALLSEM2024-25 BCSE331L TH VL2024250101742 2024-07-19 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE331L TH VL2024250101742 2024-07-19 Reference-Material-I
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
DSE 3 Unit 4
No ratings yet
DSE 3 Unit 4
8 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Unit 3
No ratings yet
Unit 3
222 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
68 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Unit 1
No ratings yet
Unit 1
52 pages
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
No ratings yet
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
49 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
03a EDA
No ratings yet
03a EDA
47 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
CH01 - Introduction To Statistics 2
No ratings yet
CH01 - Introduction To Statistics 2
52 pages
m2 Final
No ratings yet
m2 Final
151 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
Data Science - Module 2 (Updated)
No ratings yet
Data Science - Module 2 (Updated)
94 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Exploratory Data Analysis Reference
No ratings yet
Exploratory Data Analysis Reference
50 pages
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
Unit 3
No ratings yet
Unit 3
77 pages
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
No ratings yet
03 Phan Tich Dau Tu Nang Cao - Phan Tich Kham Pha Du Lieu
47 pages
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
4 DataUnderstanding
No ratings yet
4 DataUnderstanding
51 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
Unit 42 Statistic For Management
No ratings yet
Unit 42 Statistic For Management
29 pages
Komorowski EDA2016
No ratings yet
Komorowski EDA2016
20 pages
Variable: An Item of Data Examples
No ratings yet
Variable: An Item of Data Examples
60 pages
UNIT II-DSDA - Docx Notes
No ratings yet
UNIT II-DSDA - Docx Notes
26 pages
Engineering Statistics Handbook 2003
No ratings yet
Engineering Statistics Handbook 2003
1,522 pages
Module 1 - 2 - EDA
No ratings yet
Module 1 - 2 - EDA
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
3 pages
C21 Sma Exp4
No ratings yet
C21 Sma Exp4
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
18 pages
Exploratory Data Analysis Types
No ratings yet
Exploratory Data Analysis Types
14 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
Exploratory Data Analysis: M. Srinath
No ratings yet
Exploratory Data Analysis: M. Srinath
19 pages
RM EBBA Class 8 CH0 11 Quatitative Analysis
No ratings yet
RM EBBA Class 8 CH0 11 Quatitative Analysis
37 pages
Chapter 7&8
No ratings yet
Chapter 7&8
40 pages
Unit 3
No ratings yet
Unit 3
6 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Thinking Statistically
From Everand
Thinking Statistically
Anthony Banfield
5/5 (1)
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Statistics Super Review, 2nd Ed.
From Everand
Statistics Super Review, 2nd Ed.
The Editors of REA
5/5 (3)
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Statistics For Paediatrician
No ratings yet
Statistics For Paediatrician
6 pages
Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
No ratings yet
Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
46 pages
Credit EDA Case Study: Upgrad Assignment
No ratings yet
Credit EDA Case Study: Upgrad Assignment
35 pages
Comparison of Segmentation Approaches: by Beth Horn and Wei Huang
No ratings yet
Comparison of Segmentation Approaches: by Beth Horn and Wei Huang
12 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
Logistic Regression
No ratings yet
Logistic Regression
33 pages
Heart Disease Prediction Project Documentation
No ratings yet
Heart Disease Prediction Project Documentation
22 pages
Probability and Statistics Final-3
No ratings yet
Probability and Statistics Final-3
106 pages
Classical ML Algorithms
No ratings yet
Classical ML Algorithms
109 pages
PB2MAT - 02Bahan-Presenting Data in Tables and Charts For Categorical and Numerical Data Pert 2
No ratings yet
PB2MAT - 02Bahan-Presenting Data in Tables and Charts For Categorical and Numerical Data Pert 2
23 pages
Levine BSFC6e PPT Ch02
No ratings yet
Levine BSFC6e PPT Ch02
58 pages
Indroduction On Statistics
No ratings yet
Indroduction On Statistics
34 pages
1-6 Dummy Variable
No ratings yet
1-6 Dummy Variable
16 pages
PR2 Intro To Research
No ratings yet
PR2 Intro To Research
71 pages
S Asokan Article Parenting
No ratings yet
S Asokan Article Parenting
7 pages
Data Science & Analytics Paper
No ratings yet
Data Science & Analytics Paper
55 pages
Math 7 Q4 Weeks1to5 MELCs1to4 MOD1
No ratings yet
Math 7 Q4 Weeks1to5 MELCs1to4 MOD1
80 pages
4.introduction To Biostatistics
No ratings yet
4.introduction To Biostatistics
30 pages
Abdukerim MSC Part One
100% (1)
Abdukerim MSC Part One
24 pages
13 - Anova
No ratings yet
13 - Anova
33 pages
IQRM Book 2020 Jan 28
No ratings yet
IQRM Book 2020 Jan 28
277 pages
Inbound 7082907700606943605
No ratings yet
Inbound 7082907700606943605
120 pages
Reviewer in Statistical Analysis With Software Application
No ratings yet
Reviewer in Statistical Analysis With Software Application
5 pages
1 ASAP Business Analytics Introduction
No ratings yet
1 ASAP Business Analytics Introduction
25 pages
Top 100 R Interview Questions and Answers For 2021
No ratings yet
Top 100 R Interview Questions and Answers For 2021
50 pages
Introduction To Statistics For IGCSE Students
No ratings yet
Introduction To Statistics For IGCSE Students
10 pages
Literature Review On Independent Variables
100% (2)
Literature Review On Independent Variables
8 pages
Birds Do Not Use Social Learning of Landmarks To Locate Favorable Nest Sites
No ratings yet
Birds Do Not Use Social Learning of Landmarks To Locate Favorable Nest Sites
21 pages
Syllabus-MATH144 - Modular - Blended - and - FO 2Q 2021 2022
No ratings yet
Syllabus-MATH144 - Modular - Blended - and - FO 2Q 2021 2022
15 pages