Module 2 - Exploratory Data Analysis (EDA) : Central Tendency and Variability
Module 2 - Exploratory Data Analysis (EDA) : Central Tendency and Variability
Analysis (EDA)
x
• The properties of this distribution can be described in
several ways - Central tendency, Position, Variability
Describing a Population/Sample
• Central Tendency or “Average”
– Mode
– Median
15
– Mean 12
F re q u e n c y
9
• Position 6
– Quantiles 3
– Percentiles
0
16 18 20 22 24 26 28 30 32
height
• Variability or Dispersion
– Range, Interquartile Range (IQR)
– Variance, Standard Deviation
– Standard Error of the Sample Mean
Working With an Example
• Sample mean x i
– Represented by x x i 1
n
• Population mean
– Represented by
n
• Note that
i 1
means
– sum all values from 1 to n
Calculating the Mean
• The summation of all of our data values
23
xi = 1714.1 kg.
i 1
• So the mean is n
x i
x i 1
n
1714.1
23
74.5 kg .
Position
• Quantiles
– General name for measures of position that
divide the distribution (or ranked data) into
equal groups. For examples quarters,tenths,
hundreds, etc.
• Quartiles
– Measures of position that divide the
distribution (or ranked data) into Quarters.
• Percentiles
– Measures of position that divide the
distribution (or ranked data) into 100 equal
subsets
Central Tendency vs. Variability
• The mean, median, and mode all tell us about
the central tendency of a distribution.
70
described as
60
the difference
50
Mean (or distance)
40
between the
data point and
the mean
30
Sample xx
Number Mean Number - Mean
Variability Around
73 74.52609 -1.526086957
93 74.52609 18.47391304
68.5 74.52609 -6.026086957
differences then we
Number Mean Number - Mean Squared
73 74.52609 -1.526086957 2.328941
will always get a 93 74.52609
68.5 74.52609
18.47391304
-6.026086957
341.2855
36.31372
positive number 101 74.52609
65.5 74.52609
26.47391304
-9.026086957
700.8681
81.47025
– this is known as the sum 78.5 74.52609 3.973913043 15.79198
83 74.52609 8.473913043 71.8072
of squares (SS) 80 74.52609 5.473913043 29.96372
– this can be represented 80.5 74.52609
87 74.52609
5.973913043
12.47391304
35.68764
155.5985
by the following equation 73 74.52609 -1.526086957 2.328941
75.6 74.52609 1.073913043 1.153289
61 74.52609 -13.52608696 182.955
(x x)
2 86.5 74.52609 11.97391304 143.3746
61.5 74.52609 -13.02608696 169.6789
65.5 74.52609 -9.026086957 81.47025
39 74.52609 -35.52608696 1262.103
98 74.52609 23.47391304 551.0246
– Where; 69.5 74.52609 -5.026086957 25.26155
x represents the mean 52.5 74.52609
71.5 74.52609
-22.02608696
-3.026086957
485.1485
9.157202
76 74.52609 1.473913043 2.17242
x represents each
74.5 74.52609 -0.026086957 0.000681
Total 0 4386.944
individual number
Variability Around the Mean
23 1
Notice that we are
199.4kg 2 in squared units
Variability -
Sample Standard Deviation
(x x)
2
s
n 1
s 199.4
14.12kg Notice that we are
now back in our
original units
The Standard Error
of the Sample Mean
• The Std. Dev. divided by the square
root of n is called the Standard Error of
the sample mean - we will encounter this
measure later on in the course.
s2 s
sx
n n
199.4 14.12
23 23
2.94
Sample VS Population
Sample Population
Sample Only
sx Standard error of the sample mean (S.E.)
Module 2 - Exploratory Data
Analysis (EDA)
Graphical Methods
Other
Helium
Hydrogen
80
60
40
20
Count
0
Hydrogen Helium Other
Element
Cases weighted by MASS
Clustered Bar Chart
Two variables with two categories each
Cancer Status
500
Cancer
No Cancer
400
Count
300
200
100
0
Smoker Non Smoker
Smoking Status
Cases weighted by freq
Graphs for Continuous Variables
• Measurement scale - Scale
– Other terms - quantitative
– Examples - Length, Temperature, Species Richness
16
14
12
Frequency
10
8
6
4
2
0
16 – 17.9
18 – 19.9
20 – 21.9
22 – 23.9
24 – 25.9
26 – 27.9
28 – 29.9
30 – 31.9
Height Categories (or Bins)
Histogram
Here’s One We Prepared Earlier
Histogram of Plant height
16
14
12
Frequency
10
8
6
4
2
0
12
F re q u e n c y
Mean = 23.03
Std. Dev. = 2.7412
N = 50
0
16 18 20 22 24 26 28 30 32
10
8
F re q u e n c y
Mean = 23.03
Std. Dev. = 2.7412
N = 50
0
Single sample 16 17 18 19 20 21 22 23 24
height
25 26 27 28 29 30 31 32
28
26
24
16
16 18 20 22 24 26 28 30 32
Observed Value
Box and Whisker Plots
• The Box includes
– The Median
– Q1 and Q3 as the edges of the box
• The Whiskers
– either (method 1) – “5 number summary”
• Max and the Min are the ends of the whiskers
– or (method 2) – default method used in SPSS
• Q3+1.5 IQR and Q1-1.5 IQR are the ends of the
whiskers
• Q3+3.0 IQR and Q1-3.0 IQR border between outliers
and extreme outliers
• symbols used for outliers (O) and extreme outliers (*)
Box and Whisker Plot
Method 1 - 5 Number Summary
This type of Box and
Whisker Plot is the
simplest.
Max
It is based on a five
Q3
number summary:-
Range IQR Q2 (Median)
Max, Q3, Q2, Q1, Min
Q1
Min
Box and Whisker Plot
Method 2 - SPSS (Boxplot)
Extreme Outlier *
Outliers o Q3 + 3 IQR
o
Q3 + 1.5 IQR (or max)
Q3
Q2 (Median)
Q1
Q1 - 1.5 IQR (or min)
o
Outlier Q1 - 3 IQR
Making a Boxplot in SPSS
SPSS Clustered Boxplot
Note:
Outlier present in
second site 70
(sample)
15
60
50
40
30
20
10
GALLS
-10
N= 8 8 8 8 8
1 2 3 4 5
The default
Make sure you multiplier is 2
select the correct so make sure
measure of that you always
variability change it to 1
SPSS Clustered Error Bar Plot
40
30
Note:
Mean 1 S.E.
20
Mean +- 1 SE GALLS
10
0
N= 8 8 8 8 8
1 2 3 4 5
5.00 5.00
Oxygen Concentration
Oxygen Concentration
4.00 4.00
3.00 3.00
2.00
2.00 R Sq Linear = 0.979
20.0
18.0
T u r b id it y
16.0
14.0
12.0
10.0
5.0 4.0
20.0 25.0 8.0 7.0 6.0
30.0 35.0 10.09.0
40.011.0