EDA - Day 3

#LifeKoKaroLift
Data Science Program
1
Module Name: EDA &
Statistics
Course : EDA
Edit Master text styles
Lecture On : EDA - Day - 3
Instructor :
2
3
Today’s Agenda
● Revision
● Introduction to Univariate Analysis
● Categorical Unordered Univariate Analysis
● Categorical Ordered Univariate Analysis
● Statistics on Numerical Features
● Key Takeaways
Data Science Certification 4

Revision
• In previous session, we learnt about:
○ Data Cleaning Processes

■ Fixing the Rows and Columns
■ Impute/Remove Missing Values
■ Handling Outliers
■ Standardising Values
■ Fixing Invalid Values
■ Filtering Data
Types of Data Variables
• Given a dataset, the first step is to understand what kind of data it contains.
• Information about a dataset can be gained simply by looking at its metadata which in
simple terms, is the data that describes each variable in detail.
• Information such as the size of the data set, when the dataset was created, type of
variables in each column etc.
• There are two types of variables. They are,

• Categorical
• Numerical or quantitative variables.
Types of Data Variables
• The categorical variables are further divided into two parts; ordered and unordered.
• Ordered categorical variables follow a specific sequence.

• For example
• Salary: A salary of a person can be categorised as high, medium or low.
• Month: In a school, student records can be categorised according to the month
of their birth.
• Unordered ones do not follow any sequence.

• For example
• The type of loan taken by a person: A loan taken from a bank can be classified
as either home, personal or automobile.
• A department of an employee: An employee working in either an HR, sales or
accounts department.
Unordered Categorical Variables - Univariate Analysis
• Univariate analysis is the simplest form of analyzing data.
• “Uni” means “one”, so in other words you perform EDA with only one variable and
find hidden useful insights.
• Plots are immensely helpful in identifying hidden patterns in the data
• It is possible to extract meaningful insights from unordered categorical variables

using rank-frequency plots
• Rank-frequency plots of unordered categorical variables, when plotted on a log-log

scale, typically result in a power law distribution
Ordered Categorical Variables - Univariate Analysis
• Whenever you have a continuous or an ordered categorical variable, make sure

you plot a histogram or a bar chart and observe any unexpected trends in it.
• For Example:
• For a student, the examiner is an antagonist most of the times, who prevents
you from getting the scores you deserve.
• You might also have been intrigued by questions such as:

• how many students obtained marks similar to yours, how many students
were ahead, or how many lagged behind.
• And everyone has an opinion on when and where grace marks are justified.
Quantitative Variables - Summary Metrics
• The simplest way to perform univariate analysis on quantitative data is to compute
the mean, median, mode, standard deviation and quartile values of the data.
• Mean and median are single values that broadly give a representation of the entire
data. It is very important to understand when to use these metrics to avoid doing an
inaccurate analysis.
• The median is almost always a better measure of ‘representativeness’ as the mean

value gets affected by outliers whereas the median value is immune to the outliers.
• It is best to create a box plot of a numerical variable since it will show you the spread
of the data between the first and the third quartile. Also, it will provide you with the
minimum and the maximum values in the dataset.Standard deviation and
interquartile difference are both used to represent the spread of the data.
Segmented Univariate Analysis
• In segmented univariate analysis, we segment the categorical variables and then

conduct univariate analysis across its categories.
• The segmented univariate analysis allows you to compare subsets of data, which is a
powerful technique because it helps you understand how a relevant metric varies
across different segments.
• As a general rule of thumb, any categorical variable can become a basis of

segmentation.
• For e.g.,
• In case of number runs scored by a batsman against an opponent, the column
containing the list of opponents can become a basis of segmentation.
Basis of Segmentation
• The entire segmentation process can be divided into four parts:
• Take raw data
• Group by dimensions
• Summarise using a relevant metric such as mean, median, etc.
• Compare the aggregated metric across groups/categories

Quick way of Segmentation
• When you have a large number of variables in your dataset, It looks very repetitive
task to perform the same analysis on the large bunch of variables.
• One way of solving this problem is to make a table with the categorical variables on
one axis and the numeric variables (or measures/facts) on the other.
Comparison of Averages
• Once you are done with segmentation, the next step is to compare your results
within the category.
• You can either compare the means, or you can go for other descriptive statistics
such as median, max, min, etc.
• But you should be careful while comparing averages, especially if the difference in
average values is small.
Comparison of Averages
• Don’t blindly believe in the averages of the buckets — you need to observe the
distribution of each bucket closely and ask yourself if the difference in means is
significant enough to draw a conclusion.
• If the difference in means is small, you may not be able to draw inferences. In
such cases, a technique called hypothesis testing is used to ascertain whether
the difference in means is significant or due to randomness.
Comparison of Other Metrics
• Once you have identified the variables based on the business problem for analysing
the segments, the next step is to know the distribution of segments and compare the
average of each segment.
• But this is not the only way of comparing segments. There are various metrics which
you can use to understand and explain your data easily.
• Besides finding the segments and comparing the metrics, your primary focus should
be on understanding the results arising from the segments.
Key Takeaway
● Revision
● Introduction to Univariate Analysis
● Categorical Unordered Univariate Analysis
● Categorical Ordered Univariate Analysis
● Statistics on Numerical Features
● Key Takeaways
Data Science Certification 17

#LifeKoKaroLift
Thank You!
18

EDA - Day 3

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

EDA - Day 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EDA - Day 3

Uploaded by

Copyright:

Available Formats

#LifeKoKaroLift

Data Science Program

Data Science Certification 4

• In previous session, we learnt about:

○ Data Cleaning Processes

• There are two types of variables. They are,

• Ordered categorical variables follow a specific sequence.

• Unordered ones do not follow any sequence.

• Plots are immensely helpful in identifying hidden patterns in the data

• It is possible to extract meaningful insights from unordered categorical variables

• Rank-frequency plots of unordered categorical variables, when plotted on a log-log

• Whenever you have a continuous or an ordered categorical variable, make sure

• You might also have been intrigued by questions such as:

• The median is almost always a better measure of ‘representativeness’ as the mean

• In segmented univariate analysis, we segment the categorical variables and then

• As a general rule of thumb, any categorical variable can become a basis of

• Take raw data

• Summarise using a relevant metric such as mean, median, etc.

• Compare the aggregated metric across groups/categories

Data Science Certification 17

You might also like