EDA Plots
EDA Plots
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science Fundamentals
Exploratory Data
Analysis 1
Body of
knowledge
3
Exploratory Data Analysis is an approach that has a list of techniques which can be used to
understand the
data better without the need to use significance or confidence level testing.
Exploratory Data Analysis can be done in a matter of 3 minutes using Minitab or any
other statistical
software package.
Be surprised though --- We will use Microsoft Excel ® to complete these tools.
Stem and Leaf
Plot 5
Stem and Leaf
Plot 6
Let us first write the formula to count the values that are 30.
Press Enter. See how the Leaf shows up as 0. Now, this means
we have one value of 30.
Step 6 – Let us now build the formula which will count all the
numbers in the series of 30-40, i.e. 31, 32, 33, 34, 35, and so on.
Huh! That formula seems to never end, does it? Well do it just once
and then it
would be easy.
Median: = MEDIAN()
1st Quartile: =
PERCENTILE(Data range,
25%)
3rd Quartile: =
PERCENTILE(Data
range, 75%)
Box
Plot15
Step 3 – Although you have
prepared the basic data
needed, we aren’t ready to
draw the Box Plot yet. We
need to prepare another
table, one that is shown here.
Select No fill.
Box
Plot Interpretatio
n
1. The Median cycle time for Team C
seems the lowest at approximately 20
minutes.
2. Given the fact that a Box Plot is able to tell you information
about central tendency, spread and shape of the data, you can
use this EDA tool pretty much everywhere you have stratified
data.
3. You can also use this tool where you just have one sample of
data and you wish to study properties of that sample.
Median
Polish
2
5
Median
Polish26
In Inferential statistics, Analysis of Variance is a Hypothesis
testing measure that fits an additive model to a 2-way design and
identifies data patterns not explained by Row and Column
variable effects.
Table
1
Median
Polish27
Step 1 – First find out the
medians of all the course
scores individually and
subtract the individual mean
performance scores from the
median. This is known as the
1st sweep.
Step 2 – Now, do the 2nd
Table
sweep. In the second sweep, 2
subtract the median from
table 2 (Last row) and the
Row median from table 2
(Last column) (Both
highlighted) from the table
values
For theof table 1.
column median,
subtract 2nd Sweep value for
any cell with the Table
3
corresponding cell in 1st
sweep.
Median
Polish28
Step 3 – Let’s do the 3rd sweep
now. Subtract the row values
obtained in table 3 from the
row medians. Identify the
new column medians in the
3rd sweep itself. The new row
medians = Change Median –
Median from table 3.
T
a
b
l
e
Step 4 – Time for the 4th sweep. Subtract all the row value in
table 4 from the 3rd sweep column median. This will give you
Median
Polish2
9
Table 5
Step 4 – Time for the 4th sweep. Subtract all the row value in
table 4 from the 3rd sweep column median. This will give you
the row values for new table which we would be
constructing.
Also add the Column Median value with the 3rd Sweep Column
Median.
Median
Polish3
0
Table 6 –
Final
Residual
Table
Median
Polish3
1
Interpretations
1. Modality issues
2. Skew issues
3. Mixed distribution issues
Let us go back to the cycle time data and try to plot the
histogram with the help of Excel.
Histogra
m 35
Step 1 – Let us first calculate the descriptive statistics measures for
all the teams. As you can see from the table shown here, most of the
formulas are basic except for the ones shaded in Light amber
background.
We achieved this
nice looking
Histogram by
reducing the Gap to
0% on the graph.
Histogra
m 3
9
Interpretations
While the scatter graph itself visually revealed absence of any strong
correlation between downtime and production capacity, the
regression statistics merely confirm.