0% found this document useful (0 votes)
926 views56 pages

Module 2 - Exploratory Data Analysis (EDA) : Central Tendency and Variability

Uploaded by

teganalexis
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
926 views56 pages

Module 2 - Exploratory Data Analysis (EDA) : Central Tendency and Variability

Uploaded by

teganalexis
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Module 2 - Exploratory Data

Analysis (EDA)

Central Tendency and Variability

Text: Field, A. 2009 2nd edition


-Chapter 1: 1.7
-Chapter 2: 2.1 – 2.5
-Chapter 4: 4.1 – 4.9
Describing a Population/Sample
• Statistics is the study of data which has some
element of random variation - random variable.

• This variation in the variable under study can be


conceptualised as a frequency or probability
distribution.

• An example - Distribution of a normal random variable


(x)

x
• The properties of this distribution can be described in
several ways - Central tendency, Position, Variability
Describing a Population/Sample
• Central Tendency or “Average”
– Mode
– Median
15

– Mean 12

F re q u e n c y
9

• Position 6

– Quantiles 3

– Quartiles Mean = 23.03


Std. Dev. = 2.7412
N = 50

– Percentiles
0
16 18 20 22 24 26 28 30 32
height

• Variability or Dispersion
– Range, Interquartile Range (IQR)
– Variance, Standard Deviation
– Standard Error of the Sample Mean
Working With an Example

Note that for the following definitions, we


will be working with the following data set
(n=23) of individual weights (kg)

73 78.5 73 65.5 71.5


93 83 75.6 39 76
68.5 80 61 98 74.5
101 80.5 86.5 69.5
65.5 87 61.5 52.5
Central Tendency -
Value Frequency
39 1
52.5 1
Mode 61
61.5
1
1
• The mode is the most common 65.5
68.5
2
1
value 69.5 1
71.5 1
• It has the highest frequency in 73
74.5
2
1
the dataset 75.6 1
76 1
• You can see that the example 78.5 1
80 1
dataset has two modes: 80.5 1
65.5kg and 73kg both have a 83
86.5
1
1
frequency of 2 87 1
93 1
• This dataset is bimodal 98 1
101 1
Central Tendency - Median
• The median is the middle value in an ordered
list of n numbers

• 50% of the data lie on either side of this


value

• It is also represented as Q2 (2nd Quartile)

• The position of Q2 can be calculated by using


the following
( n  1)
2
Order Number

Calculating the Median 1


2
39
52.5
3 61
4 61.5
5 65.5
In our example the dataset 6 65.5
contains 23 numbers: 7
8
68.5
69.5
9 71.5
(n  1) 10 73
11 73
2 12 74.5
13 75.6
(23  1) 14 76
2 Therefore the 12th 15 78.5
number in the 16 80
12th number ascending data set 17 80.5
18 83
will be the median 19 86.5
(Q2 = 74.5kg) 20 87
21 93
22 98
23 101
Central Tendency - Mean
n

• Sample mean x i

– Represented by x x i 1
n
• Population mean
– Represented by 
n

• Note that 
i 1
means
– sum all values from 1 to n
Calculating the Mean
• The summation of all of our data values
23

 xi = 1714.1 kg.
i 1

• Divided by the number of values (n = 23)

• So the mean is n

x i
x i 1

n
1714.1

23
 74.5 kg .
Position
• Quantiles
– General name for measures of position that
divide the distribution (or ranked data) into
equal groups. For examples quarters,tenths,
hundreds, etc.
• Quartiles
– Measures of position that divide the
distribution (or ranked data) into Quarters.
• Percentiles
– Measures of position that divide the
distribution (or ranked data) into 100 equal
subsets
Central Tendency vs. Variability
• The mean, median, and mode all tell us about
the central tendency of a distribution.

• They cannot tell us about the spread of the


distribution (variability).
Variability - Range

• The Range of the distribution of data is


given by the difference between the
maximum value and the minimum value
Range = Max - Min

• A measurement of variability that usually


accompanies the Median.
Variability - Interquartile Range
• Quartiles are the three points (Q1, Q2, Q3)
in the distribution defining four equal
quarters.
• The quartiles cut the data distribution into
four sections each containing 25% of the
data.
25% of the data Q1 Q2 Q3
Variability - Interquartile Range
• The Interquartile Range (IQR) is represented by
the difference between the lower quartile (Q1)
and the upper quartile (Q3)
• These quartile positions can be calculated via
(n  1) 3(n  1)
for Q1 for Q3
4 4
• The IQR can then be calculated using the value
at these positions
• A measurement of variability that usually
accompanies the Median.
Calculating the
Order Number
1 39
2 52.5
Interquartile Range 3
4
61
61.5
(n  1) 5 65.5
for Q1 6 65.5
4 7 68.5
8 69.5
(23  1) Q1 is therefore 9 71.5
4 65.5kg. 10 73
11 73
6th number Q2 or Median 12 74.5
13 75.6
14 76
15 78.5
3(n  1) 16 80
for Q3 17 80.5
4 18 83
3(23  1) 19 86.5
20 87
4 Q3 is therefore 21 93
22 98
18th number 83.0kg. 23 101
Calculating the
Order Number
1 39
2 52.5
Interquartile Range 3
4
61
61.5
5 65.5
Q1 = 65.5kg 6 65.5
7 68.5
8 69.5
9 71.5
10 73
11 73
IQR = Q3 - Q1 12 74.5
13 75.6
= 83.0 - 65.5 14 76
= 17.5 15 78.5
16 80
17 80.5
18 83
Q3 = 83.0kg 19 86.5
20 87
21 93
22 98
23 101
Variability Around the Mean
Variation
around the
mean can be
80

70
described as
60
the difference
50
Mean (or distance)
40
between the
data point and
the mean
30

Sample xx
Number Mean Number - Mean

Variability Around
73 74.52609 -1.526086957
93 74.52609 18.47391304
68.5 74.52609 -6.026086957

the Mean 101 74.52609


65.5 74.52609
26.47391304
-9.026086957
78.5 74.52609 3.973913043
83 74.52609 8.473913043
80 74.52609 5.473913043
80.5 74.52609 5.973913043
We cannot simply 87 74.52609 12.47391304

subtract each number 73 74.52609


75.6 74.52609
-1.526086957
1.073913043
from the mean because 61 74.52609 -13.52608696
86.5 74.52609 11.97391304
the sum of these 61.5 74.52609 -13.02608696
differences will be 65.5 74.52609
39 74.52609
-9.026086957
-35.52608696
zero - the positive 98 74.52609 23.47391304

differences will cancel 69.5 74.52609


52.5 74.52609
-5.026086957
-22.02608696
out the negative 71.5 74.52609 -3.026086957
76 74.52609 1.473913043
differences 74.5 74.52609 -0.026086957
Total 0
• If we square the Difference

differences then we
Number Mean Number - Mean Squared
73 74.52609 -1.526086957 2.328941
will always get a 93 74.52609
68.5 74.52609
18.47391304
-6.026086957
341.2855
36.31372
positive number 101 74.52609
65.5 74.52609
26.47391304
-9.026086957
700.8681
81.47025
– this is known as the sum 78.5 74.52609 3.973913043 15.79198
83 74.52609 8.473913043 71.8072
of squares (SS) 80 74.52609 5.473913043 29.96372
– this can be represented 80.5 74.52609
87 74.52609
5.973913043
12.47391304
35.68764
155.5985
by the following equation 73 74.52609 -1.526086957 2.328941
75.6 74.52609 1.073913043 1.153289
61 74.52609 -13.52608696 182.955

(x  x)
2 86.5 74.52609 11.97391304 143.3746
61.5 74.52609 -13.02608696 169.6789
65.5 74.52609 -9.026086957 81.47025
39 74.52609 -35.52608696 1262.103
98 74.52609 23.47391304 551.0246
– Where; 69.5 74.52609 -5.026086957 25.26155
x represents the mean 52.5 74.52609
71.5 74.52609
-22.02608696
-3.026086957
485.1485
9.157202
76 74.52609 1.473913043 2.17242
x represents each
74.5 74.52609 -0.026086957 0.000681
Total 0 4386.944
individual number
Variability Around the Mean

• Although useful in some calculations, the


sum of squares does not take into account
the number of observations (is dependent on
sample size).
• There are some important ways that the
spread of the data around the mean can be
represented (based on sum of squares).
– The Variance (s2).
– The Standard Deviation (s).
– The Standard Error of the Sample Mean.
(S.E. or s).
Variability - Sample Variance
• The Variance uses the Sum of Squares
adjusted for the number of “independent”
observations in the sample:-“average” variation

2
(x  X )
s 2

n 1
• We can use the Sums of Squares calculated in
the previous slide:
4386.944
s  2

23  1
Notice that we are
 199.4kg 2 in squared units
Variability -
Sample Standard Deviation

• The sample’s Standard Deviation is the


square root of the Variance:

(x  x)
2

s
n 1
s  199.4
 14.12kg Notice that we are
now back in our
original units
The Standard Error
of the Sample Mean
• The Std. Dev. divided by the square
root of n is called the Standard Error of
the sample mean - we will encounter this
measure later on in the course.
s2 s
sx  
n n
199.4 14.12
 
23 23
 2.94
Sample VS Population
Sample Population
 

x = sample mean  = population mean


s = sample std dev  = population std dev
s2 = sample variance 2 = population var.
n = sample size N = population size

Sample Only
sx Standard error of the sample mean (S.E.)
Module 2 - Exploratory Data
Analysis (EDA)

Graphical Methods

Text: Field, A. 2009 2nd edition


-Chapter 1: 1.7
-Chapter 2: 2.1 – 2.5
-Chapter 4: 4.1 – 4.9
Graphical Methods & SPSS
• Graphical methods are a good way of
summarising information and are useful to
visualise patterns within your data.
• Various methods can be used depending on the
measurement scale of the variables.

• SPSS is the statistical package that you will


be using this semester and has a similar
spreadsheet format to Microsoft Excel.
• Generally, when entering data into SPSS,
each column contains a different variable.
Graphs for Discrete Variables
• Measurement scale - nominal or ordinal
– Other terms -categorical, binned, class, qualitative
– Examples - gender, age group, trap type

• Common graphical methods are:


– Pie charts for proportions, percentages, or values
that sum to a fixed value
– Bar charts for most other discrete variables

• Data can be entered into SPSS in two forms


– Each case (row) represents a single observation
– Each case (row) represents the count, percentage,
or proportion of each level of the discrete variable
Data Entry for Discrete Variables
Data entry type 1 :-
Can create charts directly
using this type of data
Data entry type 2:-
First tell SPSS that each
discrete level has been counted
An Example - Mass (%) of Each
Element Within a Star
• The data is entered
into SPSS as in data
entry type 2

• You must then tell


SPSS to weight each
observation (case) by
the variable “mass”

• You will need to do


this for a pie chart
and for a bar graph
Making a Pie
Chart in SPSS
The Pie Chart

Other
Helium

Hydrogen

Cases weighted by MASS


Making a Bar Chart in SPSS
Simple Bar Chart
One variable with three categories

80

60

40

20
Count

0
Hydrogen Helium Other

Element
Cases weighted by MASS
Clustered Bar Chart
Two variables with two categories each

Cancer Status
500
Cancer
No Cancer

400
Count

300

200

100

0
Smoker Non Smoker
Smoking Status
Cases weighted by freq
Graphs for Continuous Variables
• Measurement scale - Scale
– Other terms - quantitative
– Examples - Length, Temperature, Species Richness

• Common graphical methods are:


– For a single sample - Histograms, Box and Whisker
plots, Error Bar plots, Q-Q plots.
– For 2 or more samples - Clustered Box and
Whisker plots, Clustered Error Bar plots.
– For 2 scale variables - Scatter plots.
An Example - Plant Heights
We will be using the following data set of
plant heights (cm) to construct a histogram.

21 24.5 20 23.5 24.5


20 26 21 24 25
21.5 23.5 21 20 28
23 24.5 22.5 21 28
21 25 21.5 22 26
21.5 26.5 22.5 21.5 25
24 21.5 23 16.5 29
25.5 23 25 19 31
20.5 22.5 23 19 21.5
24 23.5 23 19.5 22.5
Histogram
To create a histogram by hand, we need to
create a series of “bins” or categories.
– The data ranges from 16.5 to 31.0.
– we can use the following groups to classify the data.

You can see Bin Tally Frequency


that the ‘bins’ 16 – 17.9
have been 18 – 19.9
organised so 20 – 21.9
that there 22 – 23.9
each datum 24 – 25.9
belongs to a 26 – 27.9
unique group 28 – 29.9
30 – 31.9
Histogram
Histogram of Plant height

16
14
12
Frequency

10
8
6
4
2
0
16 – 17.9

18 – 19.9

20 – 21.9

22 – 23.9

24 – 25.9

26 – 27.9

28 – 29.9

30 – 31.9
Height Categories (or Bins)
Histogram
Here’s One We Prepared Earlier
Histogram of Plant height

16
14
12
Frequency

10
8
6
4
2
0

Height Categories (or Bins)


Histogram
Using SPSS
• SPSS will create the
bins, work out
frequencies and
create the histogram
for you

• The data needs to


be entered in a
single column
Histogram
Using SPSS
Histogram
Using SPSS
15

12
F re q u e n c y

Mean = 23.03
Std. Dev. = 2.7412
N = 50
0
16 18 20 22 24 26 28 30 32

Single sample height

Variable height (8 bins)


Histogram
Using SPSS
12

10

8
F re q u e n c y

Mean = 23.03
Std. Dev. = 2.7412
N = 50
0

Single sample 16 17 18 19 20 21 22 23 24
height
25 26 27 28 29 30 31 32

Variable height (16 bins)


Q-Q Plot
• For a single sample

• Plots the quantiles of a variable's distribution


(observed - unknown distribution) against the
quantiles of a test distribution (expected - e.g.
Normal Dist.).

• The test distribution (expected values) have the


same mean and standard deviation as the observed
data.

• Available test distributions include Beta, Chi-


square, Exponential, Gamma, Logistic, Lognormal,
Normal, Student’s t, and Uniform.
Q-Q Plot

• Probability plots are generally used to determine


whether the distribution of a variable (observed -
unknown distribution) matches a given distribution
(expected - e.g.. Normal Dist.).

• If the selected variable matches the test


distribution, the points line up on a 450 line
(observed = expected).

• Note, if using a sample from a population the


sample size needs to be reasonably large.

• An alternative is the P-P plot (percentile plot)


Q-Q Plot
Expected quantiles for a
normal distribution with the
same mean and standard
deviation as the observed
distribution 30
Normal Q-Q Plot of HEIGHT

28

26

24

Expected Normal Value


22

Observed quantiles from our 20

sample of plant heights 18

16
16 18 20 22 24 26 28 30 32

Observed Value
Box and Whisker Plots
• The Box includes
– The Median
– Q1 and Q3 as the edges of the box

• The Whiskers
– either (method 1) – “5 number summary”
• Max and the Min are the ends of the whiskers
– or (method 2) – default method used in SPSS
• Q3+1.5  IQR and Q1-1.5  IQR are the ends of the
whiskers
• Q3+3.0  IQR and Q1-3.0  IQR border between outliers
and extreme outliers
• symbols used for outliers (O) and extreme outliers (*)
Box and Whisker Plot
Method 1 - 5 Number Summary
This type of Box and
Whisker Plot is the
simplest.
Max
It is based on a five
Q3
number summary:-
Range IQR Q2 (Median)
Max, Q3, Q2, Q1, Min
Q1
Min
Box and Whisker Plot
Method 2 - SPSS (Boxplot)
Extreme Outlier *

Outliers o Q3 + 3  IQR
o
Q3 + 1.5  IQR (or max)
Q3
Q2 (Median)
Q1
Q1 - 1.5  IQR (or min)
o
Outlier Q1 - 3  IQR
Making a Boxplot in SPSS
SPSS Clustered Boxplot
Note:
Outlier present in
second site 70

(sample)
15

60

50

40

30

20

10
GALLS

-10
N= 8 8 8 8 8

1 2 3 4 5

Several samples SITES


Error Bar Plot

The Error Bar plot is used to represent


• The mean
• Plus a measure of variation around the mean
– Confidence Interval of the Sample Mean
– The Standard Error of the Sample Mean
– The Standard Deviation of the sample

• The most common form of the Error Bar Plot


– Is the Standard Error Plot
– Mean  1 Standard Error of the Sample Mean
Error Bar Plot in
SPSS

The default
Make sure you multiplier is 2
select the correct so make sure
measure of that you always
variability change it to 1
SPSS Clustered Error Bar Plot

40

30
Note:
Mean  1 S.E.
20
Mean +- 1 SE GALLS

10

0
N= 8 8 8 8 8

1 2 3 4 5

Several samples SITES


Scatter Plot
Two scale variables

5.00 5.00
Oxygen Concentration

Oxygen Concentration
4.00 4.00

3.00 3.00

2.00
2.00 R Sq Linear = 0.979

-20.00 -10.00 0.00 10.00 20.00


-20.00 -10.00 0.00 10.00 20.00
Temperature
Temperature

Line of best fit or linear regression model


Scatter Plot
Three scale variables

20.0

18.0
T u r b id it y

16.0

14.0

12.0

10.0

5.0 4.0
20.0 25.0 8.0 7.0 6.0
30.0 35.0 10.09.0
40.011.0

You might also like