0% found this document useful (0 votes)
39 views154 pages

L2-More On Describing Data

Here are the steps to find the variance of the sample of fish lengths: 1) The sample sizes are: 3", 4", 5", 6", 8", 10" 2) The mean (x) is 6" 3) Calculate the deviations (x - x): 3" - 6" = -3 4" - 6" = -2 5" - 6" = -1 6" - 6" = 0 8" - 6" = 2 10" - 6" = 4 4) Square the deviations: (-3)2 = 9 (-2)2 = 4 (-1)2 = 1 02 = 0 22 = 4 42 = 16

Uploaded by

Message Twitter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views154 pages

L2-More On Describing Data

Here are the steps to find the variance of the sample of fish lengths: 1) The sample sizes are: 3", 4", 5", 6", 8", 10" 2) The mean (x) is 6" 3) Calculate the deviations (x - x): 3" - 6" = -3 4" - 6" = -2 5" - 6" = -1 6" - 6" = 0 8" - 6" = 2 10" - 6" = 4 4) Square the deviations: (-3)2 = 9 (-2)2 = 4 (-1)2 = 1 02 = 0 22 = 4 42 = 16

Uploaded by

Message Twitter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 154

Probability and Statistical Inference

Describing and Presenting Data

Sources used in creation of this lecture:


Statistics and Data Analysis, Peck, Olsen and Devore; Discovering Statistics Using IBM SPSS
(equivalent R book) , Andy Field; Understanding Basic Statistics, Brase and Brase
Basics of Statistics
 Science of collection, presentation, analysis, and
reasonable interpretation of data.
 Statistics provides a rigorous scientific method for gaining
insight into data.
Experiments and Variables
 For our purposes a statistical experiment or observation is
any process through which measurements are obtained
 Common to use the letter x to represent the quantitative
results of an experiment or observation
 X is a variable, x is the value of the variable
Random Variables
 A variable is considered a random variable if the value it
takes on in a given experiment or observation is
determined by chance
 A variable whose realization is determined by chance
 A discrete random variable
 May only take on a finite number of values or countable
number of values
 Result of a count
 A continuous random variable
 May take on any number of values in a line interval
 Measured on a continuous scale
Variables
 Not only something we measure
 Others we measure indirectly
 There will sometimes be a difference between the numbers we
use to represent a thing we are measuring and the actual value
of the thing (if we were measuring it directly)
 Measurement error
 E.g. psychological tests are approximate measures
Variables
 May be
 Things we can manipulate
 Compute
 Or control for
Study Design
 A careful advance plan of data collection and the analytic
approach is needed to answer the question under
investigation in a scientific way.
 The basic elements of a study design
 Selecting an appropriate sample size for a specified level of
power and level of significance
 Select appropriate measures
 Selecting methods of sampling, data collection, and analysis
appropriate to the study's objectives
Guidelines for Presenting Descriptive
Statistics
 Ensure that you are using the most appropriate way of
summarizing and presenting your data
 Be as efficient as possible when presenting your findings.
 All charts and tables should, as far as possible, be self-
explanatory.
 Use appropriate visualisation for the variables of interest
 Be consistent in the way you present your findings.
 Ensure that your data are not presented in a way that may
be misleading and/or confusing.
What do I need to describe for numerical
data?
 Centre
 Discuss where the middle of the data falls
 Measures of central tendency
 mean, median and mode
 Spread
 Discuss how spread out the data is
 Refers to the variability in the data
 Range, standard deviation, IQR
 Shape
 Refers to the overall shape of the distribution
 Symmetrical, uniform, skewed, or bimodal
What do I need to describe for numerical
data?
 Unusual Occurrences
 Outliers (value that lies away from the rest of the data)
 Gaps
 Clusters
 Context
 You must write your answer in reference to the context in the
problem, using correct statistical vocabulary and using
complete sentences.
Methods of centre Measurement
 Mean
 Summing up all the observation and dividing by number of
observations.
 Median
 The middle value in an ordered sequence of observations. That
is, to find the median we need to order the data set and then
find the middle value.
 In case of an even number of observations the average of the
two middle most values is the median.
 Mode
 The value that is observed most frequently.
 Undefined for sequences in which no observation is repeated.
Mean or Median
 The median is less sensitive to outliers (extreme scores)
than the mean and thus a better measure than the mean
for highly skewed distributions, e.g. family income.
 Example
 Mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270.
 Median of these four observations is (30+40)/2 =35.
 3 observations out of 4 lie between 20-40.
 Mean of 270 really fails to give a realistic picture of the
major part of the data.
It is influenced by extreme value 990.
 Median is more reflective of the data.
Population characteristic
Suppose we want to know the MEAN length of all
Fixed value
the fish about
in Lough Mask a
. . population
.
typically unknown
Is this a value that is known?
Can we find it out?

At any given point in


time, how many
values are there for
the mean length of
fish in the lake?
Statistic
Suppose we want to know the MEAN length of
all the
 Value fish in from
calculated Lough
a Mask.
sample
What can we do to estimate this unknown
population characteristic?
Review
 In a symmetrical distribution, the mean and median are
equal.
 In a symmetrical distribution, you should report the mean
 In a skewed distribution, the mean is pulled in the
direction of the skewness.
 In a skewed distribution, you should report the median
Trimmed mean:
 Purpose is to remove outliers from a data set
 To calculate a trimmed mean:
 Multiply the percent to trim by n (number in the sample)
 Truncate that many observations from BOTH ends of the
distribution (when listed in order)
 Calculate the mean with the shortened data set
 Not often used for large datasets
 Example Olympic Diving/Gymnastics scoring
 Used to eliminate extreme scores/bias from judges
Find the mean of the following set of data.

12 14 19 20 22 24 25 26 26 50

Find a 10% trimmed.


Mean = 23.8
10%(10) = 1
So remove one observation from each side!

14  19  20  22  24  25  26  26
xT   22
8
Why is the study of variability
important?
 Variability (or dispersion) Does this can of cola contain
measures the amount of exactly 330 ml?
scatter in a dataset.
 There is variability in
virtually everything
 Allows us to distinguish
between usual and unusual
values
 Reporting only a measure of
centre doesn’t provide a
complete picture of the
distribution.
Types of Variation
 Systematic Variation
 Differences in performance created by a specific experimental
manipulation.
 Unsystematic Variation
 Differences in performance created by unknown factors.
 Age, gender, IQ, time of day, measurement error, etc.
 Randomization
 Minimizes unsystematic variation.
20 30 40 50 60 70

20 30 40 50 60 70

20 30 40 50 60 70

Notice that these three data sets all have the same mean and median
(at 45), but they have very different amounts of variability.
Methods of Variability/Dispersion
Measurement
 Commonly used methods:
 Range, variance, standard deviation, interquartile range,
coefficient of variation etc.
Measures of Variability
 The simplest numeric measure of variability is range.
 Its a crude measure of variability.
 Range = largest observation – smallest observation

20 30 40 50 60 70
The first two data
sets have a range
20 30 40 50 60 70
of 50 (70-20) but
the third data set
has a much
20 30 40 50 60 70

smaller range of
10.
Measures of Variability

Another measure of the variability in a data set


uses the deviationsCan
from theanmean (x –What
we find x). can we do to the
average deviation? deviations so that we could
A sample of 6 fish that we caught from the lake . . find
. an average?
They were the following lengths:
3”, 4”, 5”, 6”, 8”, 10”
Population
The mean length was 6 inches. We can calculate the deviations
variance
from is average
the mean.
The estimated Whatofwas
the the sum ofsquared
deviations these deviations?
is called the variance.
denoted by
s2. Degree of
2
2  x  x  freedom
s 
n 1
Remember the sample of 6 fish that we caught from
the lake . . .
Find the variance of the length of fish.

First square the


deviations
x (x - x) (x - x)2
Finding the average of the
3 -3 9 deviations would always
4 -2 4 equal 0 (in a symmetrical distribution)!
1 What is the sum of
5 -1
0 the deviations
6 0
4 Divide this by 5.
squared?
8 2
16 Standard Deviation
10 4 34 s2 = 6.8

Sum 0
Measures of Variability
 The square root of variance is called standard deviation.
 A typical deviation from the mean is the standard
deviation.

 Our fish example: s2 = 6.8 inches2 so s = 2.608 inches


 The fish in our sample deviate from the mean of 6 by an
average of 2.608 inches.
When calculating sample variance, we use
degrees of freedom (n – 1) in the denominator
instead of n because this tends to produce better
estimates.
Suppose that everyone in the class caught a
sample of 6 fish from the lake. Would each of
our samples contain the same fish?

Would our mean


lengths be the same?

The samples would also


have different ranges!
Degrees of freedom
 Degrees of freedom of an estimate is the number of
independent pieces of information that went into
calculating the estimate.
 It’s not quite the same as the number of items in the sample.
 In order to get the df for the estimate, you have to subtract
1 from the number of items.
 Why do we subtract 1 from the number of items? 
 Another way to look at degrees of freedom is that they are the
number of values that are free to vary in a data set.
 Or the number of values that need to be known in order to
know all the values needed to achieve a particular value.
Degrees of freedom
 What does “free to vary” mean?
 An example using the mean:
 Pick a set of numbers that have a mean of 10.
 Some sets of numbers you might pick: 9, 10, 11 or 8, 10, 12 or 5,
10, 15.
 Once you have chosen the first two numbers in the set, the third
is fixed.
 In other words, you can’t choose the third item in the set.
 The only numbers that are free to vary are the first two.
 You can pick 9 + 10 or 5 + 15, but once you’ve made that decision
you must choose a particular number that will give you the mean you
are looking for.
 So degrees of freedom for a set of three numbers is TWO (n-1).
Degrees of freedom
 Two Samples
 If you have two samples and want to find a parameter, like
the mean, you have two “N”s to consider (sample 1(N1)
and sample 2(N2)).
 Degrees of freedom in that case is: (N1 + N2) – 2.
Measures of Variability
 Interquartile range (IQR) is the range of the middle half of
the data.
 Lower quartile (Q1) is the median of the lower half of the
data
 Upper quartile (Q3) is the median of the upper half of the
data

 iqr = Q3 – Q1

What advantage does the interquartile range have over the


standard deviation?
The IQR is resistant to extreme
values
The Chronicle of Higher Education (2009-2010
issue) published the accompanying data on the
percentage of the population with a bachelor’s or
higher degree in 2007 for each of the 50 states
and the District of Columbia.

21 27 26 19 30 35 35 26 47 26 27 30 24 29 22 24 29 20 20
27 35 38 25 31 19 24 27 27 23 34 25 32 26 24 22 28 26
30 23 25 22 25 29 33 34 30 17 25 23 34 26
Find the interquartile range for this set of data.
21 27
17 26 20
19 19 19 20
30 21
35 22
35 22
26 22
47 23
26 23
27 23
30 24 24
29 24
22 24 25
29 25
20 25
20
27 25
25 38 24
35 26 25 26
26 31 26
19 26
24 26
27 27 27
23 27
34 27
25 27
32 28
26 29
24 29
22 29
28 30
26
23 30
30 30 25 31
22 32
25 33
29 34 30 26
33 34 34 17 35
35 25 35
23 38
34 47
26
30

First put the data in order & find the


median.
Find the upper quartile (Q3) by finding the median of
the upper half.
Find the lower quartile (Q1) by finding the median of the
lower
iqr = 30 – 24 = 6 half.
Which Descriptive Statistic to use?
 Depends on measurement type and data dispersion
 Interval or Ratio (Scale)
 Normally distributed
 Mean and Standard Deviation
 Skewed
 Median and Interquartile Range
 Ordinal or nominal
 Mode and/or simple frequencies

KEY
SLIDE
The Research Process
Analysing Data
 First step: Graph the data
 Frequency Distributions (aka Histograms)
 A graph plotting values of observations on the horizontal axis,
with a bar showing how many times each value occurred in the
data set.
 Ideal: The ‘Normal’ Distribution
 Bell-shaped
 Symmetrical around the centre
The Normal Distribution
Skew
Properties of Frequency Distributions
 Skew
 The symmetry of the distribution.
 Positive skew (scores bunched at low values with the tail
pointing to high values).
 Negative skew (scores bunched at high values with the tail
pointing to low values).
 Kurtosis
 The ‘heaviness’ of the tails.
 Leptokurtic = heavy tails (more scores in the tails).
 Platykurtic = light tails (more scores in the middle).
Kurtosis
Going beyond the data
Frequency Distribution
 Not only useful for descriptive purposes
 Can be used to calculate likelihood of particular values
occurring – probability
 For any distribution we could calculate the probability of
achieving any of the possible values
 Tedious, time consuming
 Statisticians have created a range of idealized distributions
probability distributions and from these we can calculate the
likelihood of achieving particular values if our data distribution
matches
Probability
 Chance behaviour is unpredictable in the short term, but has a
regular and predictable pattern in the long term.
 The probability of any outcome of a random phenomenon is the
proportion of times the outcome would occur in a very long series
of repetitions.
 Sample Space
 The set of all possible outcomes of a random phenomenon
 Event
 Any set of outcomes of interest
 Probability of an event
 The relative frequency of this set of outcomes over an infinite number of
trials
 P(A) is the probability of event A
Probability Distributions
 X represents the random variable X.
 P(X) represents the probability of X.
 P(X = x) refers to the probability that the discrete random
variable X is equal to a particular value, denoted by x.
 As an example, P(X = 1) refers to the probability that the
random variable X is equal to 1.
 Cumulative probability is the probability that a value falls
within a particular range or interval
 P(X<=x)
Probability Distributions
 The probability distribution for a random variable X gives
the possible values for X, and the probabilities associated
with each possible value (i.e., the likelihood that the
values will occur)
 Has a probability assigned to each distinct value of the variable
 A cumulative probability refers to the probability that
the value of a random variable falls within a specified
range.
 The methods used to specify discrete probability
distributions are similar to (but slightly different from)
those used to specify continuous probability distributions
Discrete Random Variable
 Has a probability assigned to each distinct value of the
variable
 The sum of all assigned probabilities must be 1
 Probability distribution can be considered a relative-
frequency distribution and therefore has a mean and standard
deviation
 Mean is often called the expected value
 Represents a cluster point for the entire distribution
 Need not be an actual value of a point of the sample space
 Standard deviation is represented as a measure of risk
 Larger the standard deviation, the more likely it is that a random
variable x is different from the expected value
Discrete Probability Distribution
 Shows us the complete space on which the distribution is
based
 The corresponding probability of each event in the sample
space
Probability Distributions
 Suppose you flip a coin two times.
 This simple statistical experiment can have four possible
outcomes:
 HH (two heads), HT (heads and tails), TH (tails and heads),
and TT.(tails and tails)
 Let the variable X represent the number of Heads that
result from this experiment.
 X can take on the values 0, 1, or 2.
 In this example, X is a random variable because its value
is determined by the outcome of a statistical experiment.
Probability Distribution
 A probability distribution is a table or an equation that
links each outcome of a statistical experiment with its
probability of occurrence.

Number of heads(X) Probability


0 0.25
1 0.50
2 0.25
Probability of X=the number of Heads that result from this experiment
Probability Distributions
 A cumulative probability refers to the probability that the
value of a random variable falls within a specified range.
 This can be represented by a table or an equation which
refers to the probability than the random variable X is less
than or equal to x.
 If we flip a coin two times, what is the probability that the
coin flips would result in one or fewer heads?
 The answer would be a cumulative probability.
 It would be the probability that the coin flip experiment results in
zero heads plus the probability that the experiment results in one
head.
P(X < 1) = P(X = 0) + P(X = 1) = 0.25 + 0.50 = 0.75
Probability Distribution

Probability: Cumulative Probability:


Number of heads:
P(X = x) P(X < x)
0 0.25 0.25
1 0.50 0.75
2 0.25 1.00
Probability Distribution
 The simplest probability distribution occurs when all of the
values of a random variable occur with equal probability.
 This probability distribution is called the uniform distribution.
 Suppose the random variable X can assume k different values.
 Suppose also that the P(X = xk) is constant. Then,P(X = xk) =
1/k
 Suppose a die is tossed.
 What is the probability that the die will land on 5 ?
 There are 6 possible outcomes represented by: S = { 1, 2, 3, 4, 5, 6 }.
 Each possible outcome is a random variable (X), and each outcome is
equally likely to occur.
 Thus, we have a uniform distribution. Therefore, the P(X = 5) = 1/6.
Probability Distribution
 Suppose we undertake a dice tossing experiment
 This time, we ask what is the probability that the die will
land on a number that is smaller than 5 ?
 There are still 6 possible outcomes represented by: S =
{ 1, 2, 3, 4, 5, 6 }.
 This problem involves a cumulative probability.
 The probability that the die will land on a number smaller
than 5 is equal to:
 P( X < 5 ) = P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) =
1/6 + 1/6 + 1/6 + 1/6 = 2/3
Continuous Probability Distribution
 If a random variable is a continuous
variable (variable can take on any value between two
specified values), its probability distribution is called
a continuous probability distribution.
 A continuous probability distribution cannot be expressed
in tabular form
 An equation or formula (probability density function) is used.
 Hypothesis testing relies extensively on the idea that, having
such a function, one can compute the probability of all the
corresponding events i.e. probability of a X taking a value less
than or equal to a particular value (a)
Continuous Probability Distribution
 The density function has the following properties:
 Since the continuous random variable is defined over a
continuous range of values (called the domain of the variable),
the graph of the density function will also be continuous over
that range.
 The area bounded by the curve of the density function and the
x-axis is equal to 1, when computed over the domain of the
variable.
 The probability that a random variable assumes a value
between a and b is equal to the area under the density function
graph bounded by a and b.
Continuous Probability Function
 Consider the probability density
function shown in the graph below.
 Suppose we wanted to know the
probability that the random
variable X was less than or equal to a.
 This is equal to the area under the curve
bounded by a and minus infinity - as
indicated by the shaded area.
 The shaded area in the graph represents
the probability that the random
variable X is less than or equal to a.
 This is a cumulative probability.
Calculating Probability from a Frequency
distribution
 Statisticians have described several common frequency
distributions
 For each they have created mathematical formulae
(probability density functions) that specify idealized versions
of these distributions
 We can draw the function by plotting the value of a variable x
against the probability of it occurring y which gives us the
probability distribution
 The area under the curve of this distribution tells us something
about the probability of a value occurring
 We can use the area under the curve between two values to tell
us how likely it is that a score falls between these two values
So what does this mean for us?
 Our frequency distribution gives use the opportunity to to
calculate likelihood of particular values occurring –
probability using relevant probability calculations
 Tedious, time consuming
 Statisticians have created a range of idealized distributions
probability distributions and from these we can calculate the
likelihood of achieving particular values if our data distribution
matches
Standard normal
 Statisticians have calculated the probability of scores
occurring in a distribution with a mean of 0 and a standard
deviation of 1
 So what?
 If we have data shaped like the normal distribution then the
mean can be mapped to 0 and the standard deviation to 1
 We can then use the tables of probability created by these
statisticians to work out the probability of particular scores
occurring within that distribution
 How do we map our scores to fit the standard normal?
Z Scores and Raw scores
 If we want to compare samples with normal distributions
then mean of each may located anywhere on the x axis
and the scores more/less spread out as determined by the
standard deviation
 This causes difficulties when calculating the area under
the curve and hence the probability that a measurement
will fall into the interval of interest
 We could have sets of tables that calculate the area under
the curve for each combination of µ and σ but this would
be quite an onerous task to compile or use
Z Scores and Raw scores
 We need a way to standardise the distributions so we can
use one table for all normal distributions
 We can use the standard deviation as the measurement
scale
 We consider how many standard deviations a measure is from
the mean
 This allows comparison between a value in one normal
distribution with a value in another
Going beyond the data: Z-scores
 Z-scores
 Standardising a score with respect to the other scores in the
group.
 Expresses a score in terms of how many standard deviations it
is away from the mean.
 The distribution of z-scores has a mean of 0 and SD = 1.
z-scores
A z-score states the position of a raw score in relation to the mean
of the distribution, using the standard deviation as the unit of
measurement.
This is a z test
The z score is
raw score  mean therefore a
z  test statistic
standard deviation
1. Find the difference between a score
and the mean of the set of scores.
for a population :
2. Divide this difference by the SD (in
X  μ order to assess how big it really is).
z 
σ

for a sample :
X - X
z 
s
z-scores transform our original IQ scores into scores with a mean of
0 and an SD of 1.
Raw IQ scores (mean = 100, SD = 15)
z for 100 = (100-100) / 15 = 0, z for 115 = (115-100) / 15 = 1,
z for 70 = (70-100) / -2, etc.

raw: 55 70 85 100 115 130 145


z-score: -3 -2 -1 0 +1 +2 +3
Raw score distributions V Z Score distributions
A score, X, is expressed in the original units of measurement:

X = 65
X = 236

X  50 s  10 X  200 s  24

z = 1.5

z-score distribution: X0 s  1

X is expressed in terms of its deviation from the mean (in SDs).


So plotting the z scores we are plotting the test statistic
The Standard Normal Distribution
 The distribution of a normal variable with mean equal to
zero and standard deviation equal to 1.
 Looks identical to that of the normal but uses a different
measurement scale.
 So what?
 It is the fact that we can now have a table showing, for each
point in [−∞, +∞], the probability that we have a realization of a
variable to the left and to the right of that point.
The probability
that a realization
is lower than
point 2.33 = 0.99
Then the
probability that
the realization is
above 2.33 (1-
0.99) = 0.01
Why use z-scores?
 z-scores make it easier to compare scores from
distributions using different scales.
 e.g. two tests:

 Test A: Fred scores 78. Mean score = 70, SD = 8.


 Test B: Fred scores 78. Mean score = 66, SD = 6.

 Did Fred do better or worse in comparison to the rest of the


class on the second test?
Test A: as a z-score, z = (78-70) / 8 = 1.00
Test B: as a z-score , z = (78 - 66) / 6 = 2.00

Conclusion: Fred comparatively did much better


on Test B.
z-scores enable us to determine the relationship between
one score and the rest of the scores, using just one table
for all normal distributions.
If we have 480 scores, normally distributed with a mean
of 60 and an SD of 8, how many would be 76 or above?
Graph the problem:
Work out the z-score for 76:
z = (X - X) / s = (76 - 60) / 8 = 16 / 8 = 2.00
We need to know the size of the area beyond z
(remember - the area under the Normal curve
corresponds directly to the proportion of scores).
Many statistics books have z-score tables, giving us this
information:
z (a) Area between (b) Area (a)
mean and z beyond z
0.00 0.0000 0.5000
0.01 0.0040 0.4960
0.02 0.0080 0.4920 (b)
: : :
1.00 0.3413 0.1587
: : :
2.00 0.4772 0.0228
: : :
3.00 0.4987 0.0013
0.0228

So: as a proportion of 1, 0.0228 of scores are likely to be


76 or more.
As a percentage, = 2.28%
As a number, 0.0228 * 480 = 10.94 scores.
Word comprehension test scores:
Person with no brain damage: no. correct: mean = 92,
SD = 6 out of 100
Brain-damaged person: no. correct: 89 out of 100.
Is this person's comprehension significantly impaired?
Step 1: Graph the problem:

?
Step 2: Convert 89 into a z-score:

z = (89 - 92) / 6 = - 3 / 6 = - 89 92
0.5
Step 3: use the table to ?
find the "area beyond z" for
our z of - 0.5:

Area beyond z = 8 92
0.3085 9
z-score value: Area between the Area beyond z:
mean and z:
0.44 0.17 0.33
Conclusion: .31 (31%) of 0.45 0.1736 0.3264
people without brain damage 0.46 0.1772 0.3228
0.47 0.1808 0.3192
are likely to have a 0.48 0.1844 0.3156
comprehension score this 0.49 0.1879 0.3121
low or lower. 0.5 0.1915 0.3085
0.51 0.195 0.305
0.52 0.1985 0.3015
0.53 0.2019 0.2981
0.54 0.2054 0.2946
0.55 0.2088 0.2912
0.56 0.2123 0.2877
0.57 0.2157 0.2843
0.58 0.219 0.281
0.59 0.2224 0.2776
0.6 0.2257 0.2743
0.61 0.2291 0.2709
The Normal Distribution

The curve shows the idealized shape.


It is important that our data is close to this shape if we wish to
use Parametric tests.
The Normal Distribution
 Normal Curve or Bell-shaped Curve
 Key players: Abraham DeMoivre (1667-1754) and Carl Frederick Gauss
(1777-1855)
 Sometimes normal distribution is referred to as a Gaussian distribution
 Smooth, symmetrical curve about the mean which is the highest
point of the curve
 Approaches the horizontal axis but never touches it (asymptotic)
 The spread of the curve is determined by the standard deviation
 Larger this value the more spread out the curve is, smaller the more
peaked it is
 The inflection points where it starts to transition are determined by
the mean +/- one standard deviation
 The area under the curve is 1
Normal Distribution
 A density curve describes the overall pattern of a
distribution.
 Formula used to generate the shape of the curve is the
normal density function
 A distribution is normal if its density curve is symmetric,
single-peaked and bell-shaped.
 Mean, Median, and mode are same for a normal distribution.
Properties of the Normal Distribution:

1. It is bell-shaped and asymptotic at the extremes.


2. It's symmetrical around the mean.
3. The mean, median and mode all have same value.
4. It can be specified completely, once mean and SD
are known.
5. The area under the curve is directly proportional
to the relative frequency of observations.
Thus we can calculate the probability of observations occurring in a
population
e.g. here, 50% of scores fall below the mean, as
does 50% of the area under the curve.
e.g. here, 85% of scores fall below score X,
corresponding to 85% of the area under the curve.
Normal Distribution

If we know µ and σ, we
derive a lot of additional
information about the
data with a normal
distribution.
Normal Distribution
 The Empirical Rule - The 68-95-99.7 Rule
 In the normal distribution with mean µ and standard
deviation σ:
 68% of the observations fall within σ of the mean µ.
 95% of the observations fall within 2σ of the mean µ.
 99.7% of the observations fall within 3σ of the mean µ.
The Normal Distribution
 If a variable is normally distributed, then:
 within one standard deviation of the mean there will be
approximately 68% of the data
 within two standard deviations of the mean there will be
approximately 95% of the data
 within three standard deviations of the mean there will be
approximately 99.7% of the data
Properties of z-scores
 1.96 cuts off the top 2.5% of the distribution.
 −1.96 cuts off the bottom 2.5% of the distribution.
 As such, 95% of z-scores lie between −1.96 and 1.96.
 99% of z-scores lie between −2.58 and 2.58,
 99.9% of them lie between −3.29 and 3.29.
Normal Distribution in summary
 Many psychological/biological properties are normally
distributed.
 This is very important for statistical inference
(extrapolating from samples to populations)
 z-scores provide a way of
 (a) comparing scores on different raw-score scales;
 (b) showing how a given score stands in relation to the overall
set of scores.
 (c) using probability tables to calculate likelihood of particular
scores.
Normal distribution in summary
 The logic of z-scores underlies many statistical tests:
1. Scores are normally distributed around their mean.
2. Sample means are normally distributed around the population
mean.
3. Differences between sample means are normally distributed
around zero ("no difference").
 We can exploit these phenomena in devising tests to help
us decide whether or not an observed difference between
sample means is due to chance.
Distribution is central to choosing the
correct test
 Parametric Tests
 Normal distribution
 Non-parametric Tests
 Non normal distribution
 Always start by looking at the data!
The Research Process
Populations and Samples
 Population
 The collection of units (be they people, plankton, plants, cities,
suicidal authors, etc.) to which we want to generalize a set of
findings or a statistical model
 Sample
 A smaller (but hopefully representative) collection of units from
a population used to determine truths about that population
The Only Equation You Will Ever Need

outcomei   model   errori


A Simple Statistical Model
 In statistics we fit models to our data (i.e. we use a
statistical model to represent what is happening in the
real world).
 The mean is a hypothetical value (i.e. it doesn’t have
to be a value that actually exists in the data set).
 As such, the mean is simple statistical model.
The Mean
 The mean is the sum of all scores divided by the number
of scores.
 The mean is also the value from which the (squared)
scores deviate least (it has the least error).
Measuring the ‘Fit’ of the Model
 The mean is a model of what happens in the real world:
the typical score.
 It is not a perfect representation of the data.
 How can we assess how well the mean represents reality?
Calculating ‘Error’
 A deviation is the difference between the mean and an
actual data point.
 Deviations can be calculated by taking each score and
subtracting the mean from it:

deviation  xi  x
Use the Total Error?
 We could just take the error between the mean and the
data and add them.

Score Mean Deviation

1 2.6 -1.6
2 2.6 -0.6
3 2.6 0.4
3 2.6 0.4
4 2.6 1.4
Total = 0

(X  X )  0
Sum of Squared Errors
 We could add the deviations to find out the total error.
 Deviations cancel out because some are positive and
others negative.
 Therefore, we square each deviation.
 If we add these squared deviations we get the sum of
squared errors (SS).
Squared
Score Mean Deviation
Deviation
1 2.6 -1.6 2.56
2 2.6 -0.6 0.36
3 2.6 0.4 0.16
3 2.6 0.4 0.16
4 2.6 1.4 1.96
Total 5.20

SS   ( X  X )  5.20
2
Variance
 The sum of squares is a good measure of overall
variability, but is dependent on the number of scores.
 We calculate the average variability by dividing by the
number of scores (n).
 This value is called the variance (s2).
Standard Deviation
 The variance has one problem: it is measured in units
squared.
 This isn’t a very meaningful metric so we take the square
root value (measured in units).
 This is the standard deviation (s).

n

2
x x 
s i 1 i
n  5.20
5  1.02
Same Mean, Different SD
The SD and the Shape of a Distribution
So what is the mean a model of?
 We have used it to model a summary of a set of data
 The standard deviation in this case represents how good a
‘fit’ that model is to the set of data
 So we are assessing the fit of the model by comparing the data
we have to the model we’ve ‘fittted’ to the data
 This is a fundamental idea within the linear statistical model
Important Things to Remember
 The sum of squares, variance, and standard deviation
represent the same thing:
 The ‘fit’ of the mean to the data
 The variability in the data
 How well the mean represents the observed data
 Error
Samples vs. Populations
 Sample
 Mean and SD describe only the sample from which they were
calculated.
 Population
 Mean and SD are intended to describe the entire population
(very rare in most studies).
 Sample to Population:
 Mean and SD are obtained from a sample, but are used to
estimate the mean and SD of the population (very common).
Going beyond the data
 We now know how to fit a simple model to our data
 But usually we want to move beyond our data to the wider
world the data represents and say something about the
world
 Based on our sample
 So we need to look at whether is model is a good fit for
the population from which it came
Going beyond the data
 We ideally want to collect data from all members of the
population
 We usually collect a number of samples
 Each sample could have a different mean - sampling variation
 We can plot the sample means into a frequency
distribution
 Sample distribution
Going beyond the data
 So what?
 If we have enough samples we can calculate the population
mean
 But how well does it fit ?
 Need to calculate the standard deviation of the sample
means
 Standard error of the mean (SE)
Sampling Variation
X  30

X  25 X  33 X  30 X  29
 = 10

M = 10 M=9

M = 11 M = 10

s
M=9 M=8 M = 12

M = 10
M = 11
X 
4
Mean = 10
SD = 1.22 N
3
Frequency

0
6 7 8 9 10 11 12 13 14

Sample Mean
Going beyond the data
 In reality we can’t collect enough samples
 Instead we rely on an approximation of the sample mean
and sample error
 Based on the Central Limit Theorem
 As samples get large, the sampling distribution has a normal
distribution with a sample mean equal to the population mean
and a standard deviation of
s
X 
N
So what does this mean?
 We can use the standard deviation of the sampling
distribution as the approximation of the sample error
 If our distribution follows the normal distribution
 For other shapes of distribution we have other ways of
approximating the population mean and standard error.
Confidence Intervals
 CI represents a range of values between which we
think a population value will fall
 Suppose we are looking at our fish in Lough Mask
 True mean
 15 thousand fish
 Sample mean
 17 thousand fish
 Interval estimate
 12 to 22 thousand(contains true value)
 16 to 18 thousand (misses true value)
 CIs constructed such that 95% contain the true value.
Fish Numbers(thousands)
Fish Numbers(thousands)

Fish Numbers(thousands)
How to construct a CI?
 Typically look at 95% CI but can also look at 99%
 What does this mean?
 If we say CI is 95% then if we collected 100 samples,
calculated the mean
 Then a CI of 95% means we are confident that of these would
contain the true mean
 How to calculate?
 Need to know the limits within which 95% of the means fall
 Go back to the normal distribution – 95% of scores fall between
+-1.96
 Once we know the mean and standard deviation we can
calculate any score and therefore the CI
CI
 Lower boundary
 Upper boundary
 = population mean
 SE = standard error of the mean

 And we can use our approximations for this also


The Art of Presenting Data
 Graphs should (Tufte, 2001):
 Show the data.
 Induce the reader to think about the data being presented (rather
than some other aspect of the graph).
 Avoid distorting the data.
 Present many numbers with minimum ink.
 Make large data sets (assuming you have one) coherent.
 Encourage the reader to compare different pieces of data.
 Reveal data.

Tufte(2001) Edward Tufte, The Visual Display of Quantitative Information, Graphics


Press, 2nd edition,2001
Why Is This Graph Bad?
Why Is This Graph Better?

18
Thoughts Error Bars Show 95% CI
Actions
16

14

12

10

4
Number of Obsessiv
Thoughts/Actions p

0
CBT BT No Treatment

Therapy
Be Careful with your scales

Two graphs about cheese


Graphical Presentation
 Nominal or Ordinal:
 Bar charts, Pie charts or Frequency Tables
 Interval or Ratio numerical:
 Histogram, Stem and Leaf diagrams or Box-plot
 Depending on dispersion

KEY
SLIDE
Stem-and-Leaf plot
 Shows data arranged by place value.
 You can use a stem-and-leaf plot when you want to
display data in an organized way that allows you to see
each value.
 Use for small to moderate sized data sets.
 Doesn’t work well for large data sets.
 Accompany with a comment on the centre, spread, and
shape of the distribution and if there are any unusual
features.
Creating Stem-and-Leaf Plots

Use the data in the table to make a stem-and-leaf plot.

Test Scores
75 86 83 91 94
88 84 99 79 86

Step 1: Group the data by tens digits. 75 79


83 84 86 86 88
Step 2: Order the data from least to 91 94 99
greatest.
Creating Stem-and-Leaf Plots

Step 3: List the tens digits of the data in order 75 79


from least to greatest. Write these in
the “stems” column. 83 84 86 86 88
91 94 99

Step 4: For each tens digit, record the ones


digits of each data value in order
from least to greatest. Write these
in the “leaves” column.
Test Scores
Stems Leaves
Step 5: Title the graph and add a key. 7 5 9
8 3 4 6 6 8
Key: 7 5 means 75 9 1 4 9
Reading Stem-and-Leaf Plots
Find the least value, greatest value, mean, median, mode,
and range of the data.

The least stem and least leaf give


the least value, 40. Stems Leaves
4 00157
The greatest stem and greatest leaf 5 1124
give the greatest value, 94. 6 333599
7 044
Use the data values to find the mean (40 + 8 367
… + 94) ÷ 23 = 64. 9 14

Key: 4 0 means 40
Reading Stem-and-Leaf Plot

The median is the middle value in


the table, 63.
To find the mode, look for the
number that occurs most often in a Stems Leaves
row of leaves. Then identify its 4 00157
stem. The mode is 63. 5 1124
6 333599
7 044
The range is the difference between 8 367
the greatest and the least value. 94 –
9 14
40 = 54.
Key: 4 0 means 40
Histograms
 When to Use
 Univariate (single variable) numerical
For comparative data – use two separate graphs with
histograms
the same scale on the horizontal axis
 Discrete data
 May only take on a finite number of Constructed differently
values or countable numberforofdiscrete
values
versus continuous data
 Draw a horizontal scale and mark it with the possible values for the
variable
 Draw a vertical scale and mark it with frequency or relative frequency
 Above each possible value, draw a rectangle centred at that value with a
height corresponding to its frequency or relative frequency
 To describe
 Comment on the centre, spread, and shape of the distribution and if there
are any unusual features
Queen honey bees mate shortly after they
become adults. During a mating flight, the
queen usually takes several partners, collecting
sperm that she will store and use throughout
the rest of her life. A study on honey bees
provided the following data on the number of
partners for 30 queen bees.

12 2 4 6 6 7 8 7 8 11 8 3 5 6 7 10 1
9 7 6 9 7 5 4 7 4 6 7 8 10

Create a histogram for the number of partners


of the queen bees.
Draw a rectangle above each
Firstvalue
draw with
a horizontal
a heightaxis,
scaled
Next with
corresponding
draw thetopossible
theaxis,
a vertical
7
values ofwith
the variable
frequency.
scaled frequencyof or
6 interest.
relative frequency.
5

0 0 1 2 3 4 5 6 7 8 9 10 11 12

Suppose we use relative


frequency instead of frequency on
the vertical axis.
Histograms
 When to Use
 Univariate numerical data (one variable)
 Continuous data
 Mark the boundaries of the class intervals on the horizontal axis
 Draw a vertical scaleThis is theittype
and mark with of histogram
frequency that most
or relative
frequency people are familiar with.
 Draw a rectangle directly above each class interval with a
height corresponding to its frequency or relative frequency
 To describe
 Comment on the centre spread, and shape of the distribution
and if there are any unusual features
A study examined the length of hours spent
watching TV per day for a sample of children
age 1 and for a sample of children age 3.
Below are comparative histograms.

Notice the common scale on the horizontal axis


Write a few sentences comparing the distributions.

The median number of hours spent watching TV per


day was greater for the 1-year-olds than for the 3-
year-olds. The distribution for the 3-year-olds was
more strongly skewed right than the distribution for
the 1-year-olds, but the two distributions had
similar ranges.

Children Age 1 Children Age 3


Frequency Distribution
 Graphs are useful in assisting us in assessing the distribution in a set of
data for a particular variable
 Frequency distribution shows the relative frequencies of values for
variables of interest in a dataset
 Where values have been binned into groups (e.g. 10 to 20, 21 to 30 etc)
 The height of each bar is proportional to the relative frequency in the data set of
the group it represents.
 Normal distribution
 Bell-shaped, scores equally distributed around a central value (mean)
 Skewed
 Lack of symmetry
 Data pulled towards one end of the graph
 Kurtosis
 Pointyness
Plots in R
ggplot2

In ggplot2 a plot is made up of layers.


A graph is
made up of a
series of
layers

Each layer can be made


up of different visual
elements - geoms
The anatomy of a graph
Geometric Objects(geoms)
 geom_bar(): creates a layer with bars
 geom_poin(): creates a layer with data points
 geom_line(): creates a layer with lines
 Geom_histogram(): creates a layer with a histogram
 geom_boxplot(): creates a layer with a box-whisker
diagram
 geom_text(): creates a layer with text on it
 geom_density(): creates a layer with a density plot on it
Geometric Object functions
 Each takes different parameters.
 See handout for pages from Andy Field re properties
associated with common geom + some common aesthetics
Using ggplot()
 Create an object that specifies the plot
 Pass in the data and set whatever aesthetics you want to apply
to all layers (if any)
 Adjust the layers
 Display/save the graph
Examples
 PSIWeek2.R
 Details of the datasets used are also provided.

You might also like