L2-More On Describing Data
L2-More On Describing Data
12 14 19 20 22 24 25 26 26 50
14 19 20 22 24 25 26 26
xT 22
8
Why is the study of variability
important?
Variability (or dispersion) Does this can of cola contain
measures the amount of exactly 330 ml?
scatter in a dataset.
There is variability in
virtually everything
Allows us to distinguish
between usual and unusual
values
Reporting only a measure of
centre doesn’t provide a
complete picture of the
distribution.
Types of Variation
Systematic Variation
Differences in performance created by a specific experimental
manipulation.
Unsystematic Variation
Differences in performance created by unknown factors.
Age, gender, IQ, time of day, measurement error, etc.
Randomization
Minimizes unsystematic variation.
20 30 40 50 60 70
20 30 40 50 60 70
20 30 40 50 60 70
Notice that these three data sets all have the same mean and median
(at 45), but they have very different amounts of variability.
Methods of Variability/Dispersion
Measurement
Commonly used methods:
Range, variance, standard deviation, interquartile range,
coefficient of variation etc.
Measures of Variability
The simplest numeric measure of variability is range.
Its a crude measure of variability.
Range = largest observation – smallest observation
20 30 40 50 60 70
The first two data
sets have a range
20 30 40 50 60 70
of 50 (70-20) but
the third data set
has a much
20 30 40 50 60 70
smaller range of
10.
Measures of Variability
Sum 0
Measures of Variability
The square root of variance is called standard deviation.
A typical deviation from the mean is the standard
deviation.
iqr = Q3 – Q1
21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20
27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26
30 23 25 22 25 29 33 34 30 17 25 23 34 26
Find the interquartile range for this set of data.
21 27
17 26 20
19 19 19 20
30 21
35 22
35 22
26 22
47 23
26 23
27 23
30 24 24
29 24
22 24 25
29 25
20 25
20
27 25
25 38 24
35 26 25 26
26 31 26
19 26
24 26
27 27 27
23 27
34 27
25 27
32 28
26 29
24 29
22 29
28 30
26
23 30
30 30 25 31
22 32
25 33
29 34 30 26
33 34 34 17 35
35 25 35
23 38
34 47
26
30
KEY
SLIDE
The Research Process
Analysing Data
First step: Graph the data
Frequency Distributions (aka Histograms)
A graph plotting values of observations on the horizontal axis,
with a bar showing how many times each value occurred in the
data set.
Ideal: The ‘Normal’ Distribution
Bell-shaped
Symmetrical around the centre
The Normal Distribution
Skew
Properties of Frequency Distributions
Skew
The symmetry of the distribution.
Positive skew (scores bunched at low values with the tail
pointing to high values).
Negative skew (scores bunched at high values with the tail
pointing to low values).
Kurtosis
The ‘heaviness’ of the tails.
Leptokurtic = heavy tails (more scores in the tails).
Platykurtic = light tails (more scores in the middle).
Kurtosis
Going beyond the data
Frequency Distribution
Not only useful for descriptive purposes
Can be used to calculate likelihood of particular values
occurring – probability
For any distribution we could calculate the probability of
achieving any of the possible values
Tedious, time consuming
Statisticians have created a range of idealized distributions
probability distributions and from these we can calculate the
likelihood of achieving particular values if our data distribution
matches
Probability
Chance behaviour is unpredictable in the short term, but has a
regular and predictable pattern in the long term.
The probability of any outcome of a random phenomenon is the
proportion of times the outcome would occur in a very long series
of repetitions.
Sample Space
The set of all possible outcomes of a random phenomenon
Event
Any set of outcomes of interest
Probability of an event
The relative frequency of this set of outcomes over an infinite number of
trials
P(A) is the probability of event A
Probability Distributions
X represents the random variable X.
P(X) represents the probability of X.
P(X = x) refers to the probability that the discrete random
variable X is equal to a particular value, denoted by x.
As an example, P(X = 1) refers to the probability that the
random variable X is equal to 1.
Cumulative probability is the probability that a value falls
within a particular range or interval
P(X<=x)
Probability Distributions
The probability distribution for a random variable X gives
the possible values for X, and the probabilities associated
with each possible value (i.e., the likelihood that the
values will occur)
Has a probability assigned to each distinct value of the variable
A cumulative probability refers to the probability that
the value of a random variable falls within a specified
range.
The methods used to specify discrete probability
distributions are similar to (but slightly different from)
those used to specify continuous probability distributions
Discrete Random Variable
Has a probability assigned to each distinct value of the
variable
The sum of all assigned probabilities must be 1
Probability distribution can be considered a relative-
frequency distribution and therefore has a mean and standard
deviation
Mean is often called the expected value
Represents a cluster point for the entire distribution
Need not be an actual value of a point of the sample space
Standard deviation is represented as a measure of risk
Larger the standard deviation, the more likely it is that a random
variable x is different from the expected value
Discrete Probability Distribution
Shows us the complete space on which the distribution is
based
The corresponding probability of each event in the sample
space
Probability Distributions
Suppose you flip a coin two times.
This simple statistical experiment can have four possible
outcomes:
HH (two heads), HT (heads and tails), TH (tails and heads),
and TT.(tails and tails)
Let the variable X represent the number of Heads that
result from this experiment.
X can take on the values 0, 1, or 2.
In this example, X is a random variable because its value
is determined by the outcome of a statistical experiment.
Probability Distribution
A probability distribution is a table or an equation that
links each outcome of a statistical experiment with its
probability of occurrence.
for a sample :
X - X
z
s
z-scores transform our original IQ scores into scores with a mean of
0 and an SD of 1.
Raw IQ scores (mean = 100, SD = 15)
z for 100 = (100-100) / 15 = 0, z for 115 = (115-100) / 15 = 1,
z for 70 = (70-100) / -2, etc.
X = 65
X = 236
X 50 s 10 X 200 s 24
z = 1.5
?
Step 2: Convert 89 into a z-score:
z = (89 - 92) / 6 = - 3 / 6 = - 89 92
0.5
Step 3: use the table to ?
find the "area beyond z" for
our z of - 0.5:
Area beyond z = 8 92
0.3085 9
z-score value: Area between the Area beyond z:
mean and z:
0.44 0.17 0.33
Conclusion: .31 (31%) of 0.45 0.1736 0.3264
people without brain damage 0.46 0.1772 0.3228
0.47 0.1808 0.3192
are likely to have a 0.48 0.1844 0.3156
comprehension score this 0.49 0.1879 0.3121
low or lower. 0.5 0.1915 0.3085
0.51 0.195 0.305
0.52 0.1985 0.3015
0.53 0.2019 0.2981
0.54 0.2054 0.2946
0.55 0.2088 0.2912
0.56 0.2123 0.2877
0.57 0.2157 0.2843
0.58 0.219 0.281
0.59 0.2224 0.2776
0.6 0.2257 0.2743
0.61 0.2291 0.2709
The Normal Distribution
If we know µ and σ, we
derive a lot of additional
information about the
data with a normal
distribution.
Normal Distribution
The Empirical Rule - The 68-95-99.7 Rule
In the normal distribution with mean µ and standard
deviation σ:
68% of the observations fall within σ of the mean µ.
95% of the observations fall within 2σ of the mean µ.
99.7% of the observations fall within 3σ of the mean µ.
The Normal Distribution
If a variable is normally distributed, then:
within one standard deviation of the mean there will be
approximately 68% of the data
within two standard deviations of the mean there will be
approximately 95% of the data
within three standard deviations of the mean there will be
approximately 99.7% of the data
Properties of z-scores
1.96 cuts off the top 2.5% of the distribution.
−1.96 cuts off the bottom 2.5% of the distribution.
As such, 95% of z-scores lie between −1.96 and 1.96.
99% of z-scores lie between −2.58 and 2.58,
99.9% of them lie between −3.29 and 3.29.
Normal Distribution in summary
Many psychological/biological properties are normally
distributed.
This is very important for statistical inference
(extrapolating from samples to populations)
z-scores provide a way of
(a) comparing scores on different raw-score scales;
(b) showing how a given score stands in relation to the overall
set of scores.
(c) using probability tables to calculate likelihood of particular
scores.
Normal distribution in summary
The logic of z-scores underlies many statistical tests:
1. Scores are normally distributed around their mean.
2. Sample means are normally distributed around the population
mean.
3. Differences between sample means are normally distributed
around zero ("no difference").
We can exploit these phenomena in devising tests to help
us decide whether or not an observed difference between
sample means is due to chance.
Distribution is central to choosing the
correct test
Parametric Tests
Normal distribution
Non-parametric Tests
Non normal distribution
Always start by looking at the data!
The Research Process
Populations and Samples
Population
The collection of units (be they people, plankton, plants, cities,
suicidal authors, etc.) to which we want to generalize a set of
findings or a statistical model
Sample
A smaller (but hopefully representative) collection of units from
a population used to determine truths about that population
The Only Equation You Will Ever Need
deviation xi x
Use the Total Error?
We could just take the error between the mean and the
data and add them.
1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total = 0
(X X ) 0
Sum of Squared Errors
We could add the deviations to find out the total error.
Deviations cancel out because some are positive and
others negative.
Therefore, we square each deviation.
If we add these squared deviations we get the sum of
squared errors (SS).
Squared
Score Mean Deviation
Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20
SS ( X X ) 5.20
2
Variance
The sum of squares is a good measure of overall
variability, but is dependent on the number of scores.
We calculate the average variability by dividing by the
number of scores (n).
This value is called the variance (s2).
Standard Deviation
The variance has one problem: it is measured in units
squared.
This isn’t a very meaningful metric so we take the square
root value (measured in units).
This is the standard deviation (s).
n
2
x x
s i 1 i
n 5.20
5 1.02
Same Mean, Different SD
The SD and the Shape of a Distribution
So what is the mean a model of?
We have used it to model a summary of a set of data
The standard deviation in this case represents how good a
‘fit’ that model is to the set of data
So we are assessing the fit of the model by comparing the data
we have to the model we’ve ‘fittted’ to the data
This is a fundamental idea within the linear statistical model
Important Things to Remember
The sum of squares, variance, and standard deviation
represent the same thing:
The ‘fit’ of the mean to the data
The variability in the data
How well the mean represents the observed data
Error
Samples vs. Populations
Sample
Mean and SD describe only the sample from which they were
calculated.
Population
Mean and SD are intended to describe the entire population
(very rare in most studies).
Sample to Population:
Mean and SD are obtained from a sample, but are used to
estimate the mean and SD of the population (very common).
Going beyond the data
We now know how to fit a simple model to our data
But usually we want to move beyond our data to the wider
world the data represents and say something about the
world
Based on our sample
So we need to look at whether is model is a good fit for
the population from which it came
Going beyond the data
We ideally want to collect data from all members of the
population
We usually collect a number of samples
Each sample could have a different mean - sampling variation
We can plot the sample means into a frequency
distribution
Sample distribution
Going beyond the data
So what?
If we have enough samples we can calculate the population
mean
But how well does it fit ?
Need to calculate the standard deviation of the sample
means
Standard error of the mean (SE)
Sampling Variation
X 30
X 25 X 33 X 30 X 29
= 10
M = 10 M=9
M = 11 M = 10
s
M=9 M=8 M = 12
M = 10
M = 11
X
4
Mean = 10
SD = 1.22 N
3
Frequency
0
6 7 8 9 10 11 12 13 14
Sample Mean
Going beyond the data
In reality we can’t collect enough samples
Instead we rely on an approximation of the sample mean
and sample error
Based on the Central Limit Theorem
As samples get large, the sampling distribution has a normal
distribution with a sample mean equal to the population mean
and a standard deviation of
s
X
N
So what does this mean?
We can use the standard deviation of the sampling
distribution as the approximation of the sample error
If our distribution follows the normal distribution
For other shapes of distribution we have other ways of
approximating the population mean and standard error.
Confidence Intervals
CI represents a range of values between which we
think a population value will fall
Suppose we are looking at our fish in Lough Mask
True mean
15 thousand fish
Sample mean
17 thousand fish
Interval estimate
12 to 22 thousand(contains true value)
16 to 18 thousand (misses true value)
CIs constructed such that 95% contain the true value.
Fish Numbers(thousands)
Fish Numbers(thousands)
Fish Numbers(thousands)
How to construct a CI?
Typically look at 95% CI but can also look at 99%
What does this mean?
If we say CI is 95% then if we collected 100 samples,
calculated the mean
Then a CI of 95% means we are confident that of these would
contain the true mean
How to calculate?
Need to know the limits within which 95% of the means fall
Go back to the normal distribution – 95% of scores fall between
+-1.96
Once we know the mean and standard deviation we can
calculate any score and therefore the CI
CI
Lower boundary
Upper boundary
= population mean
SE = standard error of the mean
18
Thoughts Error Bars Show 95% CI
Actions
16
14
12
10
4
Number of Obsessiv
Thoughts/Actions p
0
CBT BT No Treatment
Therapy
Be Careful with your scales
KEY
SLIDE
Stem-and-Leaf plot
Shows data arranged by place value.
You can use a stem-and-leaf plot when you want to
display data in an organized way that allows you to see
each value.
Use for small to moderate sized data sets.
Doesn’t work well for large data sets.
Accompany with a comment on the centre, spread, and
shape of the distribution and if there are any unusual
features.
Creating Stem-and-Leaf Plots
Test Scores
75 86 83 91 94
88 84 99 79 86
Key: 4 0 means 40
Reading Stem-and-Leaf Plot
12 2 4 6 6 7 8 7 8 11 8 3 5 6 7 10 1
9 7 6 9 7 5 4 7 4 6 7 8 10
0 0 1 2 3 4 5 6 7 8 9 10 11 12