0% found this document useful (0 votes)
24 views49 pages

EDA Plots

The document provides an overview of Exploratory Data Analysis (EDA) techniques, including Stem and Leaf Plots, Box Plots, Median Polish, Histograms, and Scatter Plots, emphasizing their use in understanding datasets without formal testing. It outlines step-by-step instructions for creating these visualizations using Microsoft Excel, along with interpretations of the results. The document serves as a guide for students in a Data Science Fundamentals course to apply EDA tools effectively.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views49 pages

EDA Plots

The document provides an overview of Exploratory Data Analysis (EDA) techniques, including Stem and Leaf Plots, Box Plots, Median Polish, Histograms, and Scatter Plots, emphasizing their use in understanding datasets without formal testing. It outlines step-by-step instructions for creating these visualizations using Microsoft Excel, along with interpretations of the results. The document serves as a guide for students in a Data Science Fundamentals course to apply EDA tools effectively.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

School of Computing

Science and Engineering

Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science Fundamentals
Exploratory Data
Analysis 1
Body of
knowledge
3

1. Stem and Leaf


Plot
2. Box Plot
3. Median Polish
4. Resistant Line
5. Resistant
Smooth
6. Rootogram
Introduction to Exploratory Data
Analysis 4

Exploratory Data Analysis is an approach that has a list of techniques which can be used to
understand the
data better without the need to use significance or confidence level testing.

Uses of Exploratory Data Analysis are as below:

1. Get detailed insight into your dataset.


2. Understand some critical impact variables that influence the dataset.
3. Detect if any outliers are present in the dataset.
4. Test the underlying assumptions of’ the dataset.

Exploratory Data Analysis can be done in a matter of 3 minutes using Minitab or any
other statistical
software package.

Be surprised though --- We will use Microsoft Excel ® to complete these tools.
Stem and Leaf
Plot 5
Stem and Leaf
Plot 6

A contact center quality team evaluates 100 calls in the contact


center. The Quality Manager decides to review the quality scores
of the operations floor.

Let us draw a stem and leaf plot to understand the data.

A snapshot of the data sheet is attached here. This data sheet


can be found in the file EDA.xls.
Stem and Leaf
Plot 7

Step 1 – Sort the data in


ascending order.

Step 2 – Find out the minimum


and maximum values using the
MIN and the MAX function

Step 3 – Find out the range


using the formula MAX – MIN

Step 4 – Construct the stems


starting from 0 and ending with
8. Rule for constructing
stems – If you have a data set
with 3 digit values, the stems
would need to be
constructed in accordance to
the hundredth place.
Stem and Leaf
Plot 8

Step 5 – We need to write the formula to compute leafs. For


example, let us take the Stem 3 highlighted in Yellow
background. We need to count how many values fall greater
than 30.

Let us first write the formula to count the values that are 30.

Press Enter. See how the Leaf shows up as 0. Now, this means
we have one value of 30.

Let us change the first value of the dataset to 30 – For sake of


simulations!! As we see here, you now have two values of 30.
So, the formula works!!
Stem and Leaf
Plot 9

Step 6 – Let us now build the formula which will count all the
numbers in the series of 30-40, i.e. 31, 32, 33, 34, 35, and so on.

Huh! That formula seems to never end, does it? Well do it just once
and then it
would be easy.

But yes, it is some pain and worth it!!


Stem and Leaf
Plot 10

Step 7 – The Stem and Leaf


Plot as shown here.

Step 8 – Let us the LEN


and SUBSTITUTE
formula together to add
the interpretation.
Stem and Leaf
Plot 1
1
1. You have an easier option to run a macro to
generate the Stem and Leaf Plot, but VBA
coding is not everyone’s cup of tea.

2. You could use some statistical software but that


may turn
out to be slightly expensive.

3. With the use of some simple Excel formulas,


you have discovered tool 1 which is used to
show granularity in information in the
dataset.

4. That is the Steam and Leaf Plot for you.


Box
Plot
1
2
Box
Plot13
Granularity as provided by the Stem and Leaf Plot is good, but at
times you need a graph that shows the data shape, its distribution
and the spread. That’s where we use the Box Plot.

Let us draw a Box plot to understand the data.

5 teams of a factory produce homogenous units. The


sampled cycle times are shown as below.
Box
Plot14
Step 1 – Let us setup the
table as seen here. We
know how to calculate the
Minimum and Maximum
value.

Step 2 – Calculate the


Median, and the Quartile
values using the formulas
below

Median: = MEDIAN()
1st Quartile: =
PERCENTILE(Data range,
25%)
3rd Quartile: =
PERCENTILE(Data
range, 75%)
Box
Plot15
Step 3 – Although you have
prepared the basic data
needed, we aren’t ready to
draw the Box Plot yet. We
need to prepare another
table, one that is shown here.

Step 4 – In the row titled


Series 1, fetch the minimum
values for the Teams.

In the row titled Series 2,


subtract the Minimum
value from the 1st
Quartile value from the
Summary Range table.
Box
Plot16
Step 5– In the row titled
Series 3, subtract the 1st
Quartile value from the
Median value.

In the row titled Series 4,


subtract the Median from
the 3rd Quartile value.

In the row titled Series 5,


subtract the 3rd Quartile
value from the Maximum
value.

Let us now try to draw the


Box Plot.
Box
Plot17
Step 6– Select data from Series to Series 4. Don’t select Series 5
as of yet. We will do it later.

Select 2D Column – Stacked Column Chart.


Box
Plot18
Step 7– Obviously the chart is
not a completed Box Plot. We
need to work around a few
things on Excel. Let us first
hide the Series 1 in the graph
generated.

To do this, right click on


Series 1 on the graph.

Click on Format Data

Series. Click on Fill.

Select No fill.

Click on Border Color.


Select No
color.
Box
Plot19
Step 8– Repeat the same steps
as in Step 7 discussed in the
previous slide but leave the
cursor selected on the axis of
Series 2.

Step 9 – We need to define


the Whiskers. To do that,
Click on Layout, click on
Error Bars and click on
More Error Bar options.

Step 10 – In the dialog


window box that opens up,
select Minus for Direction
and change the percentage
to 100.
Box
Plot20
Step 11– After doing Step 9 and
Step 10, the graph changes
shape to what is seen here.
Take a look at the graph.

Step 12 – Repeat steps 9 and 10


for Series 4. A small change. In
the More Error bars options,
select the Direction to Plus.

You will see how the lower


and upper whiskers are
defined now.
Box
Plot21
Step 11– After doing Step 9 and
Step 10, the graph changes
shape to what is seen here.
Take a look at the graph.

Step 12 – Repeat steps 9 and 10


for Series 4. A small change. In
the More Error bars options,
select the Direction to Plus.

You will see how the lower


and upper whiskers are
defined now.
Box
Plot22
Step 13– Oops something went
wrong with the graph here. We
have not defined the Maximum
values here.

Step 14 – Click on the lines at


the top. Click on Layout, Click
on More Error Bars and in the
window that opens up, select
Custom and specify values.

Select the maximum values


from the data for chart table,
aka Series 5.
Box
Plot23
Step 15– The Box Plot is ready now. We can now start interpreting.
Obviously we spent some time making this Box Plot, but it is a one
time effort. Once you are able to construct this, you can use this as
a Box Plot Template.

Box
Plot Interpretatio
n
1. The Median cycle time for Team C
seems the lowest at approximately 20
minutes.

2. Team A shows the greatest spread in


data.

3. Data for Team A is also heavily


skewed.

4. Team E seems to have a good % of


population in the lower end of the
Box
Plot24
1. Box Plot doesn’t confirm anything. It is thus not a confirmatory
data analysis tool.

2. Given the fact that a Box Plot is able to tell you information
about central tendency, spread and shape of the data, you can
use this EDA tool pretty much everywhere you have stratified
data.

3. You can also use this tool where you just have one sample of
data and you wish to study properties of that sample.
Median
Polish
2
5
Median
Polish26
In Inferential statistics, Analysis of Variance is a Hypothesis
testing measure that fits an additive model to a 2-way design and
identifies data patterns not explained by Row and Column
variable effects.

Median Polish does a similar thing except that Median


Polish will
use Medians.

A company wishes to conduct a Median Polish on the


percentage scores achieved by students in each course of
an IT institution.

Table
1
Median
Polish27
Step 1 – First find out the
medians of all the course
scores individually and
subtract the individual mean
performance scores from the
median. This is known as the
1st sweep.
Step 2 – Now, do the 2nd
Table
sweep. In the second sweep, 2
subtract the median from
table 2 (Last row) and the
Row median from table 2
(Last column) (Both
highlighted) from the table
values
For theof table 1.
column median,
subtract 2nd Sweep value for
any cell with the Table
3
corresponding cell in 1st
sweep.
Median
Polish28
Step 3 – Let’s do the 3rd sweep
now. Subtract the row values
obtained in table 3 from the
row medians. Identify the
new column medians in the
3rd sweep itself. The new row
medians = Change Median –
Median from table 3.

T
a
b
l
e

Step 4 – Time for the 4th sweep. Subtract all the row value in
table 4 from the 3rd sweep column median. This will give you
Median
Polish2
9

Table 5

Step 4 – Time for the 4th sweep. Subtract all the row value in
table 4 from the 3rd sweep column median. This will give you
the row values for new table which we would be
constructing.

Also add the Column Median value with the 3rd Sweep Column
Median.
Median
Polish3
0

Table 6 –
Final
Residual
Table
Median
Polish3
1
Interpretations

1. The average test score


performance across all the
courses was 44.25%.

2. People who do JAVA programs


alone score approximately 13
points less than those who
do .NET.

3. Oh yes, look at the Column effects


from the Residual table. Students
with 90% attendance outscore
the ones with 70% attendance by
5 points.
Median
Polish3
2
Final Notes

1. The tediousness of calculations shouldn’t shy you away from this


wonderful tool.

2. In a 2*2 design where there is a possibility that one of them is


categorical, Median polish comes in very handy in
establishing relationships.

3. With the power of calculating residuals with the Median Polish


tool, you can also predict on what could happen in the future.
Histogra
m 3
3
Histogra
m 34
Histogram is another important EDA tool, which you can use when
you wish to check the shape. Importantly, histogram will outline
issues in the data like

1. Modality issues
2. Skew issues
3. Mixed distribution issues

Let us go back to the cycle time data and try to plot the
histogram with the help of Excel.
Histogra
m 35
Step 1 – Let us first calculate the descriptive statistics measures for
all the teams. As you can see from the table shown here, most of the
formulas are basic except for the ones shaded in Light amber
background.

IQR = 3rd Quartile – 1st


Quartile Bin width =
2*Count1/3
Number of bins =
(Maximum – Minimum)/
Bin width
Histogra
m 36
Step 2 – Let us now define with the bins. Start with the minimum
value. For example, for Team A the first bin would be 0.32. The next
bin will be = 0.32+Bin Size (7.26). The third bin would be 7.53+
7.26 and so on. Continue this until you reach 7 bins.
Histogra
m 37
Step 3 – Let us first draw the Histogram for one team’s metric
performance, e.g. Team A.

Steps to draw a Histogram

1. Click on Data. Click on Data Analysis (If this option is not


available, please insert the Data Analysis Add-in).
2. From the Data Analysis Dialog window, choose Histogram.
3. In the section showing Input variable, select data corresponding
to Team A.
4. In the section showing Bin range, select Bin range
corresponding to Team A.
5. Put a tick on Chart Output and Click Ok.
Histogra
m 3
8

We achieved this
nice looking
Histogram by
reducing the Gap to
0% on the graph.
Histogra
m 3
9
Interpretations

1. Bi-modality observed at 7.53 and 56.


Is
this due to an external issue?

2. If the Bi-modality is resolved, we’d


get a close to a perfect
distribution, but what is the reason
for this bi-modality?

3. It could difference in suppliers,


difference in changeovers,
difference in raw materials
--- Anything?
Rootogra
m 4
0
Interpretations

1. Introduction of a new tool here. Instead of having the


frequencies on the vertical axis, you can now take the square
root of all the frequencies on the vertical axis and what you
have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a


Histogram.
Histogra
m 4
1

Based on the 4 Histograms drawn for each of the teams,


what can you
infer?

Which team’s data distribution is close to being a


normal distribution?
Rootogra
m 4
2
Interpretations

1. Introduction of a new tool here. Instead of having the


frequencies on the vertical axis, you can now take the square
root of all the frequencies on the vertical axis and what you
have is known as the Rootogram.

2. The x-axis is the response variable instead of bins used in a


Histogram.
Scatter
Plot4
3
Scatter
Plot 44
Most times in projects we stumble upon the fact that x impact y. In
other words, y = f(x). Now, using scatter plots, you can visually
understand if there is a relationship between x and y.

Let us use data for two variables – Machine downtime and


production capacity for a factory to understand how does a
scatter plot work. Downtime is expressed in % and
Production Capacity is expressed in tons.
Scatter
Plot 45
Step 1 – Select the data,
Click on Insert, Click on
Scatter and Click on
Scatter with only
markers.

Step 2 – Voila – you are


done. There you have the
scatter chart as seen
here.
Scatter
Plot 46
Step 3 – Modification to a Regression equation

This is where you can use an EDA tool as an Inferential statistics


tool. Right click on any point in the graph and click on Add
Trendline. Select Linear, Display equation and Display R-Square.
Scatter
Plot 4
7
Step 4 – Interpretation

While the scatter graph itself visually revealed absence of any strong
correlation between downtime and production capacity, the
regression statistics merely confirm.

The R-Square value needs to be > 0.64 for us to conclude strong


correlation.
Final
Notes
48

1. This module covers most of the tools used in Exploratory data


analysis.

2. Some other tools are:


a. Parallel Coordinates
b. Run Charts
c. Odds Ratio
d. Principal Components Analysis
e. Ordination
Thank
you….

You might also like