EDA - Day 3
EDA - Day 3
EDA - Day 3
1
Module Name: EDA &
Statistics
Course : EDA
Edit Master text styles
Lecture On : EDA - Day - 3
Instructor :
2
3
Today’s Agenda
● Revision
● Introduction to Univariate Analysis
● Categorical Unordered Univariate Analysis
● Categorical Ordered Univariate Analysis
● Statistics on Numerical Features
● Key Takeaways
• Given a dataset, the first step is to understand what kind of data it contains.
• Information about a dataset can be gained simply by looking at its metadata which in
simple terms, is the data that describes each variable in detail.
• Information such as the size of the data set, when the dataset was created, type of
variables in each column etc.
• “Uni” means “one”, so in other words you perform EDA with only one variable and
find hidden useful insights.
• For Example:
• For a student, the examiner is an antagonist most of the times, who prevents
you from getting the scores you deserve.
• And everyone has an opinion on when and where grace marks are justified.
Quantitative Variables - Summary Metrics
• The simplest way to perform univariate analysis on quantitative data is to compute
the mean, median, mode, standard deviation and quartile values of the data.
• Mean and median are single values that broadly give a representation of the entire
data. It is very important to understand when to use these metrics to avoid doing an
inaccurate analysis.
• It is best to create a box plot of a numerical variable since it will show you the spread
of the data between the first and the third quartile. Also, it will provide you with the
minimum and the maximum values in the dataset.Standard deviation and
interquartile difference are both used to represent the spread of the data.
Segmented Univariate Analysis
• The segmented univariate analysis allows you to compare subsets of data, which is a
powerful technique because it helps you understand how a relevant metric varies
across different segments.
• For e.g.,
• In case of number runs scored by a batsman against an opponent, the column
containing the list of opponents can become a basis of segmentation.
Basis of Segmentation
• The entire segmentation process can be divided into four parts:
• Group by dimensions
• When you have a large number of variables in your dataset, It looks very repetitive
task to perform the same analysis on the large bunch of variables.
• One way of solving this problem is to make a table with the categorical variables on
one axis and the numeric variables (or measures/facts) on the other.
Comparison of Averages
• Once you are done with segmentation, the next step is to compare your results
within the category.
• You can either compare the means, or you can go for other descriptive statistics
such as median, max, min, etc.
• But you should be careful while comparing averages, especially if the difference in
average values is small.
Comparison of Averages
• Don’t blindly believe in the averages of the buckets — you need to observe the
distribution of each bucket closely and ask yourself if the difference in means is
significant enough to draw a conclusion.
• If the difference in means is small, you may not be able to draw inferences. In
such cases, a technique called hypothesis testing is used to ascertain whether
the difference in means is significant or due to randomness.
Comparison of Other Metrics
• Once you have identified the variables based on the business problem for analysing
the segments, the next step is to know the distribution of segments and compare the
average of each segment.
• But this is not the only way of comparing segments. There are various metrics which
you can use to understand and explain your data easily.
• Besides finding the segments and comparing the metrics, your primary focus should
be on understanding the results arising from the segments.
Key Takeaway
● Revision
● Introduction to Univariate Analysis
● Categorical Unordered Univariate Analysis
● Categorical Ordered Univariate Analysis
● Statistics on Numerical Features
● Key Takeaways
Thank You!
18