0% found this document useful (0 votes)
8 views

Exploratory Data Analysis types

Exploratory Data Analysis (EDA) is a statistical approach used to summarize the main characteristics of data sets through graphical and non-graphical techniques. The process involves various steps including variable identification, univariate and multivariate analysis, and treatment of missing values and outliers. EDA can be categorized into four types: univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical.

Uploaded by

happylifemanu82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Exploratory Data Analysis types

Exploratory Data Analysis (EDA) is a statistical approach used to summarize the main characteristics of data sets through graphical and non-graphical techniques. The process involves various steps including variable identification, univariate and multivariate analysis, and treatment of missing values and outliers. EDA can be categorized into four types: univariate non-graphical, multivariate non-graphical, univariate graphical, and multivariate graphical.

Uploaded by

happylifemanu82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Exploratory Data Analysis

Definition
• In statistics, Exploratory Data Analysis is an approach of
analyzing data sets to summarize their main
characteristics, often using statistical graphics and
other data visualization methods.
• EDA employs a variety of techniques (mostly graphical)
to maximize insight into a data set;
– uncover underlying structure;
– extract important variables;
– detect outliers and anomalies;
– test underlying assumptions;
– develop parsimonious models; and
– determine optimal factor settings.
https://en.wikipedia.org/wiki/Exploratory_data_analysis
Steps in Data Exploration
• Identification of variables and data types
• Analyzing the basic metrics
• Non-Graphical Univariate Analysis
• Graphical Univariate Analysis
• Bivariate Analysis
• Variable transformations
• Missing value treatment
• Outlier treatment
• Correlation Analysis
• Dimensionality Reduction

https://towardsai.net/p/data-analysis/exploratory-data-analysis-in-python-ebdf643a33f6
Typical Data format
• Generally in csv format – rectangular array
– with one row per experimental subject
– one column for each subject identifier
– outcome variable
– explanatory variable
• Each column contains the numeric values for a
particular quantitative variable or the levels
for a categorical variable.

https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
Types of EDA The four types of EDA are
univariate non-graphical
multivariate nongraphical
univariate graphical,
multivariate graphical

• Cross-classified in two ways


• First Method
– Non-graphical - summary statistics
– Graphical - summarize the data in a pictorial way
• Second Method
– Univariate - look at one variable (data column) at a time
– Multivariate - look at two or more variables at a time to
explore relationships
• Always a good idea to perform univariate EDA on each
of the components of a multivariate EDA before
performing the multivariate EDA.
Univariate non-graphical EDA
• The data that come from making a particular measurement
on all of the subjects in a sample represent our observations
for a single characteristic such as age, gender, speed at a task,
or response to a stimulus.
• Categorical data
– Tabulation of frequency of categories
– fraction (or percent) of data that falls in each category.
Univariate non-graphical EDA
• Quantitative data
– center, spread, modality (number of peaks in the
pdf), shape (including “heaviness of the tails”),
and outliers
• Central tendency - “location” of a distribution
has to do with typical or middle values.
– mean, median (mid-value in an ordered list, and
sometimes mode
– geometric, harmonic, truncated, or Winsorized
means
Univariate non-graphical EDA
• Spread
– indicator of how far away from the center we are still likely
to find data values
– variance, standard deviation, and interquartile range
• Variance – standard measure of spread
• Standard deviation -the square root of the variance
• Interquartile range (IQR)
– The quartiles of a population or a sample are the three
values which divide the distribution or observed data into
even fourths.
– IQR = Q3 − Q1.
Univariate non-graphical EDA
• range
– distance from the minimum value to the
maximum value
– range = maximum – minimum
– The minimum and maximum of a sample may be
useful for detecting outliers
• typing a digit twice or transposing digits (e.g., entering
211 instead of 21 and entering 19 instead of 91 for data
that represents ages of senior citizens
Univariate non-graphical EDA
• Skewness - measure of asymmetry

• Kurtosis - measure of “peakedness” relative to


a Gaussian shape
Univariate graphical EDA

• Categorical data
– Bar plot of the tabulation of the data
– Pie chart
• Quantitative data
– Histogram
– Stem and leaf plot
– Box plot
– Quantile
• Normal plot
Multivariate non-graphical EDA
• Cross-tabulation
– two-way table with column headings that match the levels
of one variable and row headings that match the levels of
the other variable, then filling in the counts of all subjects
that share a pair of levels.
Multivariate non-graphical EDA
• Correlation
– two-way table with column headings that match the levels
of one variable and row headings that match the levels of
the other variable, then filling in the counts of all subjects
that share a pair of levels.
Multivariate graphical EDA
• Scatter plot - For two quantitative variables

You might also like