0% found this document useful (0 votes)
16 views39 pages

3 SamplingDistributions Complete

Uploaded by

Geetha Panneer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views39 pages

3 SamplingDistributions Complete

Uploaded by

Geetha Panneer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Unit 3: Sampling Distributions,

Parameters and Parameter Estimates

1
What are the risks of excessive drinking at a department
party?

A picture is worth a thousand


words……

2
Inferential Statistics
Inferential statistics are used estimate “parameters” in the
population from parameter estimates in a sample drawn from
that population.

In inferential statistics, we use these parameter estimates to


test hypotheses (predictions; Null and alternative
hypotheses) about the size of the population parameter.

These predictions about the size of populations parameters


typically map directly onto research questions about (causal)
relationships between variables (IVs and DV)

Answers from inferential statistical are probabilistic. In other


words, all answers have the potential to be wrong and you
will provide an index of that probability along with your
results. 3
Populations
A population is any clearly defined set of objects or events
(people, occurrences, animals, etc.). Populations usually
represent all events in a particular class (e.g., all college
students, all alcoholics, all depressed people, all people). It
is often an abstract concept because in many/most
instances you will never have access to the entire
population.

For example, many of our studies may have the population


of all people as its target.

Nonetheless, researchers usually want to describe or draw


conclusions about populations. (e.g., we don’t care if some
new drug is an effective treatment for 100 people in your
sample- Will it work, on average, for everyone we might
treat?) 4
(Population) Parameters
A parameter is a value used to describe a certain
characteristic of a population. It is usually unknown and
therefore has to be estimated.
For example, the population mean is a parameter that is often
used to indicate the average/typical value of a variable in the
population.
Within a population, a parameter is a fixed value which does
not vary within the population at the time of measurement
(e.g., the mean height of people in the US at the present
moment).
You typically cant calculate these parameters directly
because you don’t have access to the entire population.
We use Greek letters to represent population parameters (,
, 2, 0, j) 5
Samples & Parameter Estimates
A sample is a finite group of units (e.g., participants)
selected from the population of interest.

A sample is generally selected for study because the


population is too large to study in its entirety. We typically
have only one sample in a study.

We use the sample to estimate and test parameters in the


population.

These estimates are called parameter estimates.

We use Roman letters to represent sample parameter


estimates (X, s, s2, b0, bj).

6
Sampling Error
Since a sample does not include all members of the
population, parameter estimates generally differ from
parameters on the entire population (e.g., use mean height of
a sample of 1000 people to estimate mean height of US
population).

The difference between the (sample) parameter estimate and


the (population) parameter is sampling error.

You will not be able to calculate the sampling error of your


parameter estimate directly because you don’t know the
value of the population parameter. However, you can
estimate it by probabilistic modeling of the hypothetical
sampling distribution for that parameter.

7
Hypothetical Sampling Distribution
A sampling distribution is a probability distribution of all
possible samples of size N taken from a population
A sampling distribution can be formed for any population
parameter.
Each time you draw a sample of size N from a population,
you can calculate an estimate of that population parameter
from that sample.
Because of sampling error, these parameter estimates will
not exactly equal the population parameter. They will not
equal each other either. They will form a distribution.
A sampling distribution, like a population, is an abstract
concept that represent the outcome of repeated (infinite)
sampling. You will typically only have one sample.
8
What if we didn’t need samples?
Research question: How do inhabitants of a remote pacific
island feel about the ocean?
Population size = 10,000
Dependent measure: Ocean liking scale scores range from -
100 (strongly dislike) to 100 (strongly like); 0 represents
neutral
Hypotheses: H0:  = 0; Ha:  <> 0)

How would you answer this question if you had unlimited


resources (time, money, and patience!)
Administer the Ocean liking scale to all 10,000
inhabitants in the population and calculate the
population mean score. Is it 0? If not, the inhabitants
are not neutral on average.
9
Ocean Liking Scale Scores in Full Population
> setwd("C:/Users/LocalUser/Desktop/GLM")
> d = lm.readDat('3_SamplingDistributions_Like.dat')
> str(d)
'data.frame': 10000 obs. of 2 variables:
$ Like0: num -23.61 -9.01 30.54 5.89 -9.16 ...

> lm.describeData(d)
var n mean sd median min max skew kurtosis
Like0 1 10000 0 23.67 0.1 -86.64 84.46 0 -0.03

10
Ocean Liking Scale Scores in Full Population
> windows() #quartz() for MAC users
> par('cex' = 1.5, 'lwd' = 2, 'font.axis'=1.5, 'font.lab' = 2)
> hist(d$Like0, col=‘yellow’)

11
Parameter Estimation and Testing
What do you conclude?
Inhabitants of island ARE neutral on average on the
Ocean Liking Scale;  = 0

How confident are you about this conclusion?


Excluding issues of measurement of the scale (i.e.,
reliability), you are 100% confident that the population
mean score on this scale is 0 ( = 0).
Of course, this approach to answering a research question is
not typical. Why? And how would you normally answer this
question?
You will very rarely have access to all scores in the
population. Instead, you have to use inferential
statistics to “infer” (estimate) the size of the population
parameter from a sample. 12
Obtain a Sample
You are a poor graduate student. All you can afford is N=10

> dS = data.frame(Sample1 = sample(d$Like0,10))


> lm.describeData(dS$Sample1,1)
n mean sd min max
Sample1 10 2.4 24.97 -33.93 46.97

What do you conclude and why?


A sample mean of 2.40 is not 0. However, you know
that the sample mean will not match the population
mean exactly. How likely is it to get a sample mean of
2.40 if the population mean is 0 (think about it!)

13
Obtain a Sample
Your friend is a poor graduate student too. All she can afford
is N=10 too.
> dS$Sample2 = sample(d$Like0,10)
> lm.describeData(dS$Sample2,1)
n mean sd min max
sample2 10 1.04 22.42 -22.74 44.43

What do you conclude and why?


A sample mean of 1.04 is not 0. However, you know
that the sample mean will not match the population
mean exactly. It is more likely to get a sample mean of
1.04 than 2.40 if the population mean is 0 but you still
don’t know how likely either outcome is. What if she
obtained a sample with mean of 30?

14
Sampling Distribution of the Mean
You can construct a sampling distribution for any sample
statistic (e.g., mean, s, min, max, r, B0, B1)
For the mean, you can think of the sampling distribution
conceptually as follows:
1. Imagine drawing many samples (lets say 1000 samples
but in theory, the sampling distribution is infinite) of
N=10 participants (10 participants in each sample) from
your population

2. Next, calculate the mean for each of these samples of 10


participants

3. Finally, create a histogram (or density plot) of these


sample means
15
1000 Samples of N=10 OLS Scores
Descriptives for each of 1000 samples of N=10
n mean sd min max
sample1 10 2.40 24.97 -33.93 46.97
sample2 10 1.04 22.42 -22.74 44.43
sample3 10 -2.52 25.39 -47.83 37.05
sample4 10 -0.08 22.78 -32.19 34.35
sample5 10 -13.48 21.14 -42.61 15.04
sample6 10 2.77 26.30 -49.92 45.56
sample7 10 -9.56 21.35 -38.03 25.86
sample8 10 -5.32 16.74 -30.69 25.57
sample9 10 5.89 30.28 -55.08 44.32
sample10 10 4.51 30.30 -43.83 56.36
sample11 10 5.65 28.41 -55.83 43.16
sample12 10 8.23 23.62 -37.88 54.17
sample13 10 1.14 23.90 -29.68 48.80
sample14 10 -9.44 27.63 -47.12 32.19
sample15 10 -6.20 24.50 -51.34 33.58

...

sample999 10 -6.33 22.38 -31.03 36.77


16
sample1000 10 15.22 22.12 -19.57 59.47
Sampling Distribution of the Mean
Descriptives for 1000 sample means of N=10
n mean sd median min max skew kurtosis
mean 1000 0.02 7.48 -0.14 -27.5 22.25 -0.03 0.09

NOTE: In your research, you don’t form a sampling


17
distribution. You (typically) only have one sample.
Raw Score Distribution vs. Sampling Distribution
NOTE: The distinction between raw score distribution vs.
sample distribution is very important to keep clear in your
mind!

18
Sampling Distribution of the Mean
What will the mean of the sample means be? In other
words, what is the mean of the sampling distribution?
The mean of the sample means (i.e., the mean of the
sampling distribution) will equal the population mean of
raw scores on the dependent measure. This is
important b/c it indicates that the sample mean is an
unbiased estimator of the population mean.

19
Sampling Distribution of the Mean
The mean is an unbiased estimator:
The mean of the sample means will equal the mean of the
population. Therefore individual sample means will neither
systematically under or overestimate the population mean.
Raw Ocean Liking scores
n mean sd median min max skew kurtosis
Like0 10000 0 23.67 0.1 -86.64 84.46 0 -0.03

Sample (N=10) means


n mean sd median min max skew kurtosis
mean 1000 0.02 7.48 -0.14 -27.5 22.25 -0.03 0.09

The sample variance (s2; with n-1 denominator) is also an


unbiased estimator of the population variance (2). In other
words, the mean of the sample s2’s will approximate the
population variance. Sample s is negatively biased 20
Sampling Distribution of the Mean
Will all of the sample means be the same?
No, there was a distribution of means that varied from each
other. The mean of the sampling distribution was the
population mean but the standard deviation was not zero
n mean sd median min max skew kurtosis
mean 1000 0.02 7.48 -0.14 -27.5 22.25 -0.03 0.09

21
Standard Error (SE)
The standard deviation of the sampling distribution (i.e.,
standard deviation of the infinite sample means) is equal to:

 Nsample
Where  is the standard deviation of the population of raw
scores
This variability in the sampling distribution is due to
sampling error.
Therefore, b/c we use sample statistics (parameter
estimates) to estimate population parameters, we would like
to minimize sampling error.
The standard deviation of the sampling distribution for a
population parameter has a technical name. It is called the
standard error of the statistic. Here, we are talking about the
22
standard error of the mean
Standard Error
What factors affect the size of the sampling error of the mean
(i.e., the standard error)?


 Nsample

The standard deviation of the population raw scores and the


sample size

23
Factors that Affect the Standard Error (SE)

Variation among raw scores for a variable in the population


is broadly caused by two factors. What are they?
(a) Individual differences
(b) Measurement error (the opposite of reliability)

What is the relationship between population variability ( )


and SE?
As the variability of the variable increases in the population,
the SE increases.

What would happened to SE if there was no variation in


population scores?
There would be no SE b/c no matter which participants you
sampled, they would all have the same scores.
24
Factors that Affect the Standard Error (SE)
What is the relationship between sample size and SE?
As the sample size increases, the SE for the statistic will
decrease.

What would the SE be if the sample size = population size?


If the sample contained ALL participants from the
population, the SE would be equal to 0 because each sample
mean would have exactly the same value as the overall
population mean (b/c all same scores).

What would happen if the samples contained only 1


participant?
If each sample contained only 1 participant, the SE would be
equal to the variation ( ) observed within the population.
25
Shape of the Sampling Distribution
Central Limit Theorem:
The shape of the sampling distribution approaches normal
as N increases.
Roughly normal even for moderate sample sizes assuming
that the original distribution isn’t really weird (i.e., non-
normal).

26
Normal Pop and Various Sampling Distributions

27
NOTES: Population size = 100,000; Simulated 10,000 samples
Uniform Pop and Various Sampling Distributions

28
Skewed Pop and Various Sample Distributions

29
NOTE: x-axis scale changes across figures on this slide
An Important Normal Distribution: Z-scores
Z scores are normally distributed scores with a mean of 0
and a standard deviation of 1.
You can therefore think of a z-score as telling you the
position of the score in terms of standard deviations above
the mean.
The probability distribution is known for z-scores.

16% 16% 2.5% 2.5% 0.5% 0.5%

30
Probability of Parameter estimate given H0
How could you use the z-score distribution to determine
the probability of obtaining a sample mean (parameter
estimate) of 2.40 if you draw a sample of N=10 from a
population of Ocean Liking scores with a population mean
(parameter) of 0?

Think about it……

31
Hypothetical Sampling Distribution for H0
If H0 is true; sampling distribution has a mean of 0 and
standard deviation of  / Nsample = 23.7 / 10 = 7.5

32
Hypothetical Sampling Distribution for H0
If H0 is true and this is the sampling distribution (in blue),
how likely is it to get a sample mean of 2.4 or more extreme?
Pretty likely….. 
But we can do better than that…….

33
Our first inferential test: the z-test
z = 2.4 – 0 = 0.32; p < .749
7.5
pnorm(0.32, mean=0, sd=1, lower.tail=FALSE) * 2
0.7489683

37.4% 37.4%

34
t vs. z
z = 2.4 – 0 = 0.32
7.5
Where did we get the 2.4 from in our z test?
Our sample mean from our study. This is our parameter
estimate of the population mean of OLS (Like0) scores

Where did we get the 0 from in our z test?


This is the mean of the sampling distribution of OLS scores
if H0 is true.

Where did we get the 7.5 from in our z test and what is the
problem with this?
This was our estimate of the standard deviation of the
sampling distribution.  / NSample
We do not know . 35
t vs. z
How can we estimate  ?
We can use our sample standard deviation (s) but s is a
negatively biased parameter estimate. On average, it will
underestimate 

So what do we do?
We account for this underestimation of  and therefore of the
standard deviation (standard error) of the sampling
distribution by using the t distribution rather than the z
distribution to calculate the probability of our parameter
estimate if H0 is true.

The t distribution is slightly wider, particularly for small


sample sizes to correct for our underestimate of the
standard deviation
36
Our second inferential test: t-test
t(df) = Parameter estimate – Parameter: H0
Standard error of parameter estimate

Where SE is estimated use s from sample data


df = N – P = 10 - 1 = 9

37
t vs. z
The bias in s decreases with increasing N. Therefore, t
approaches z with larger sample sizes

38
Null Hypothesis Significance Testing (NHST)
1. Divide reality regarding the size of the population parameter
into two non-overlapping possibilities. (Null hypothesis &
Alternate hypothesis).

2. Assume that the Null hypothesis is true.

3. Collect data.

4. Calculate the probability (p-value) of obtaining your parameter


estimate (or a more extreme estimate) given your assumption
(i.e., the Null hypothesis is true)

5. Compare probability to some cut-off value (alpha level).

6. (a) If this parameter estimate is less probable than cut-off


value, reject null hypothesis in favor of alternate hypothesis.

6. (b) If data is not less probable, fail to reject Null hypothesis. 39

You might also like