0% found this document useful (0 votes)
9 views74 pages

5-6.sampling Error and Confidence Interval

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 74

Sampling Error and Confidence Interval

抽样误差与置信区间

Haomin Yang
School of Public Health
Fujian Medical University
Content
 Sampling error and Sampling distribution
 Central limit theorem
 Standard error
 t distribution
 Point estimation
 Confidence Interval estimation

2
2
Normal distribution

 Any normal distribution can be transformed into


the standard normal distribution.

X 
X ~ N ( , )
2
Z

3
Critical value

Z / 2 : Two sided critical value


Z : One sided critical value

Critical value Area of Area of


one tail two tails
1.645 0.05 0.10
1.960 0.025 0.05
2.576 0.005 0.01

 Z

4
The relationship between the population and sample

Population
(The complete set) inference

sampling Sample
(The subset of
the population)

Samples are taken from populations to provide estimates


of population parameters. Then we use sample data to make
an inference about a population.

5
Sampling error

 The difference between statistics from different


samples, as well as the difference between
sample statistics and population parameter, is
called sampling error.
Sampling distribution

a probability distribution of a statistic


obtained from a larger number of
samples(with sample size N) drawn from a
specific population, usually the mean
Central limit theorem
 For simple random samples of n observations taken
from a population with mean equal to μ and
standard deviation equal to σ, regardless of the
population’s distribution, provided the sample size
n is sufficiently large, the distribution of the
sample mean will be approximately normal with
mean equal to μ and standard deviation equal to
the population standard deviation divided by the
square root of the sample size.
Standard error

 The variation of the sample mean, or the


standard deviation of the sample mean, is
called the standard error of the mean SE,

denoted by  X  .
n
Standard error

 In practice, the population standard deviation σ


is usually unknown and replaced by the sample
standard deviation s approximately.

 S
X  SX 
n n

10
t distribution

 If the population standard deviation σ is unknown


and replaced by the sample standard deviation s,
at this time, it is t transformation.

X  X 
Z ╳ t SX 
S
X SX n

11
 It was developed by William Sealy Gosset under the pseudonym Student.

12
t distribution
 t value obeys t distribution with
the degree of freedom v. Standard normal distribution

X 
t ,t ~t ( ), =n  1
SX

14
The properties of t distribution:

 The t distribution is a family of curves, each is


determined by a degree of freedom v.
 The t distribution is bell shaped and symmetric
about 0.

15
The properties of t distribution:

 As the degree of freedom increases, the peak


of curve is higher and the base is more narrow
and the t distribution approaches the standard
normal distribution. The t distribution with v=∞
is the standard normal distribution.

16
Area under t distribution curve

t ,

One sided critical value:t ,

P (t  t ,v )  ,P (t  t ,v )  

17
Area under t distribution curve

-t 2, t 2,

Two sided critical value:t 2,


P ( t  t 2,v )  P (t  t 2,v )  P (t  t 2,v )  
P (t 2,v  t  t 2,v )  1-
18
19
Example

 When ν =9 and one side probability α =0.05, how


much is t , ?

 When ν =9 and two sides probability α =0.05,


how much is t 2, ?

20
21
 When ν =9 and one side probability α =0.05,
how much is t , ? t , =1.833

P(t  1.833)  0.05, P(t  1.833)  0.05

 When ν =9 and two sides probability α =0.05,


how much is t 2, ? t 2, =2.262

P( t  2.262)  P(t  2.262)  P(t  2.262)  0.05


The Difference Between a T Distribution and a
Normal Distribution
 Normal distributions are used when the population
distribution is assumed to be normal. The T
distribution is similar to the normal distribution,
just with fatter tails. Both assume a normally
distributed population. T distributions have higher
kurtosis than normal distributions. The probability
of getting values very far from the mean is larger
with a T distribution than a normal distribution.
25
KEY TAKEAWAYS
•The T distribution is a continuous probability
distribution of the z-score when the estimated
standard deviation is used in the denominator rather
than the true standard deviation.
•The T distribution, like the normal distribution, is bell-
shaped and symmetric, but it has heavier tails, which
means it tends to produce values that fall far from its
mean.
•T-tests are used in statistics to estimate significance
26
Exercise

 The total area under a t-curve is:


A.10
B.100
C.1/2
D.1
Exercise

 A t-curve is symmetrical around:

A. Its standard deviation


B. 100
C. 10
D. 0
Exercise

 A t-curve never:

A. Touches the y-axis


B. Touches the vertical axis
C. Looks like the standard normal distribution curve
D. Touches the horizontal axis
example

 The CEO of light bulbs manufacturing company claims that

an average light bulb lasts 300 days. A researcher randomly

selects 15 bulbs for testing. The sampled bulbs last an

average of 290 days, with a standard deviation of 50 days. If

the CEO’s claim were true, what is the probability that 15

randomly selected bulbs would have an average life of no

more than 290 days?


Exercise
A chemical engineer has the following results
for the active ingredient yields from 16 pilot
batches processed under a retorting
procedure:X = 32 grams/liter, s = 3 Determine
the approximate probability for getting a result
this rare or rarer if the true mean yield is 30.5
grams/liter
Exercise

 Suppose that T has the t distribution with n = 10


degrees of freedom. For each of the following,
compute the true value using Excel and then
compute the normal approximation. Compare the
results.

 a.ℙ(−0.8 < T < 1.2)


 b. The 90th percentile of T .
Parameter estimation
& confidence interval
Statistical analysis

Numerical variable
Statistical description Nominal variable

Parameter estimation
Statistical inference Hypothesis testing

35
Parameter
 Given a model, the parameters are the numbers that
yield the actual distribution.

In the real world often you don’t know the “true” parameters,
but you get to observe data. Next up, we will explore how we
can use data to estimate the model parameters.
Statistics as Estimators

 We use sample data compute statistics.


 The statistics estimate population values, e.g.,
 X 

 An estimator is a method for producing a best guess


about a population value.
 An estimate is a specific value provided by an
estimator.
 We want good estimates. What is a good estimator?
What properties should it have?
Parameter estimation
 Using the sample statistics to estimate the population
parameter is called parameter estimation.
 There are two types of estimations for population
parameter:
 Point estimation & Interval estimation

39
Point estimation

 Directly use sample statistics to estimate


population parameter.
 The point estimation of the population mean μ is
the sample mean x .
 The point estimation of the population standard
deviation σ is the sample standard deviation s.

40
Point estimation
 Point estimation represents our best
“determination” of the parameter.
 However, it does not express the uncertainty in
the estimation.
 Point estimation does not consider sampling
errors.

41
A best estimator

 Unbiased: any sample statistic whose expected


value is equal to the parameter.
o Mean of the sampling distribution 𝑥ҧ is equal to 𝜇

 Minimum variance: no other sample statistic


based on random samples of size n has a
sampling distribution with a smaller variance
Methods of Point Estimation

 1. Method of Moments

 2. Maximum Likelihood
 3. Bayesian
Let M1, M2,... be independent random variables having a
common distribution possessing a mean µ. Then the sample
means converge to the distributional mean as the number of
observations increase.
Interval estimation

 An interval estimation of a parameter is an


interval or a range of values used to estimate
the parameter.
 A confidence interval (CI) is a specific interval
estimation of a parameter determined by using
data obtained from a sample.

47
Confidence interval (CI)

 There is a 1-α probability that an interval


containing the population mean. This interval is
called a 1-α confidence interval (CI) of the
population mean.

48
Confidence interval (CI)
 1-α CI denoted by (A, B).

lower confidence limit upper confidence limit

 CI is open interval.
 In practice, 1-α is usually 90%, 95%
and 99%.

49
Methods to calculate the confidence interval

 Parametric interval estimation


 Method of normal distribution: n is large
 Method of t distribution: n is small
 Nonparametric interval estimation (bootstrap)

50
Method of normal distribution

2
X ~ N ( , )
n
X 
Z ,Z ~N (0,1) α/2 1-α α/2
X
P (  Z 2  Z  Z 2 )  1  
-Z 2
0 Z 2

X 
P(Z 2   Z 2 )  1  
X
P(Z 2 X  X    Z 2 X )  1  

P( X  Z 2 X    X  Z 2 X )  1  

( X  Z 2 X , X  Z 2 X ) ( X  Z 2 S X , X  Z 2 S X )

51
Method of normal distribution

 When the sample size is large, say n greater than


or equal to 100, not only is the sampling
distribution of sample means well approximated
by the normal distribution, but the sample
standard deviation, s, is a reliable estimate of
the population standard deviation, σ, which is
usually also not known.

52
example

 The average height of 120 boys who were seven


year-old was 123.62cm and the standard deviation
was 4.75cm. Calculate the 90% confidence
interval of the population mean of height for 7
year-old boys.

53
answer

n  120, X  123.62, S  4.75, Z 2  Z 0.10 2  1.645


( X  Z 2 S X , X  Z  2 S X )
4.75 4.75
 (123.62  1.645  ,123.62  1.645  )
120 120
 (122.91,124.33)

54
Method of t distribution

 However, for small sample (n<100), s is far from σ,


so the method of normal distribution cannot be
used.

 Because σ is unknown for small sample, we cannot


do the Z transformation for sample means.

55
Method of t distribution

2
X ~ N ( , )
n
X 
t ,t ~t (n  1)
SX
-t  t 
P(t 2,v  t  t 2,v )  1   2, 2,

X  ( X  t 2,v S X , X  t 2,v S X )
P(t 2,v   t 2,v )  1  
SX
P(t 2,v S X  X    t 2,v S X )  1  
P( X  t 2,v S X    X  t 2,v S X )  1  

56
Example

 Randomly select 25 healthy adult males with


average pulse equal to 73.6 beats/min and
standard deviation equal to 6.5 beats/min.
calculate the 95% confidence interval of the
population mean of pulse for healthy adult males.

57
Answer

n  25, X  73.6, S  6.5, t 2,v  t0.05 2,24  2.064


( X  t 2,v S X , X  t 2,v S X )
6.5 6.5
 (76.3  2.064  ,76.3  2.064  )
25 25
 (73.62,78.98)
58
How to interpret the 95% CI

 We can increase the confidence (accuracy)


that the interval covers the population mean
by increasing the confidence level 1-α. The
effect of increasing the confidence level will
increase the width of the confidence interval.
(decrease the precision)

59
How to interpret the 95% CI

 To improve the precision of the confidence


interval, we can decrease the confidence level.
 To improve the accuracy and precision at the
same time, the only one method is to increase the
sample size.

60
How to interpret the 95% CI

 It is incorrect to interpret a 95% CI by saying that


“there is a 95% probability that the population mean
lies within the CI”.
 Either the population mean is in the interval or it is not
in the interval. It is deterministic rather than
probabilistic.

61
How to interpret the 95% CI

 A 95% CI indicates that if we repeatedly draw 100


independent, random samples from the same
population and calculate 100 95% confidence
intervals of population mean using these samples,
theoretically, there are 95 confidence intervals
will contain the population mean, and 5 will not.

62
How to interpret the 95% CI

63
How to interpret the 95% CI
 However, most statisticians often describe
confidence intervals in this way: the value of 0.95
is really the probability that the limits calculated
from a random sample will include the population
value. For 95% of the calculated confidence
intervals it will be true to say that the population
mean, μ, lies within this interval.

64
How to interpret the 95% CI

 The problem is with a single study we just do not


know which one of these 100 intervals we will
obtain and hence we will not know if it includes μ.
So we usually interpret a confidence interval as the
range of values within which we are 95% confident
that the true population mean lies.

65
Exercise
 What is meant by the term “90% confident” when
constructing a confidence interval for a mean?
A. If we took repeated samples, approximately 90% of the
samples would produce the same confidence interval.
B. If we took repeated samples, approximately 90% of the
confidence intervals calculated from those samples would
contain the sample mean.
C. If we took repeated samples, approximately 90% of the
confidence intervals calculated from those samples would
contain the true value of the population mean.
D. If we took repeated samples, the sample mean would equal
the population mean in approximately 90% of the samples
Exercise
Among various ethnic groups, the standard deviation of heights is
known to be approximately three inches. We wish to construct a
95% confidence interval for the mean height of male Swedes.
Forty-eight male Swedes are surveyed. The sample mean is 71
inches. The sample standard deviation is 2.8 inches.
1.x¯=________
2.σ =________
3.n=________
2.In words, define the random variables X and X¯
.

3.Which distribution should you use for this problem? Explain your
choice.
4.Construct a 95% confidence interval for the population mean
height of male Swedes.
1.State the confidence interval..
5.What will happen to the level of confidence obtained if 1,000
male Swedes are surveyed instead of 48? Why?
 71
 3
 48
 X is the height of a Swedish male, and x is the mean height
from a sample of 48 Swedish males.
 Normal. We know the standard deviation for the population,
and the sample size is greater than 30.
 CI: (70.15, 71.85)
 The confidence interval will decrease in size, because the
sample size increased. Recall, when all factors remain
unchanged, an increase in sample size decreases variability.
Thus, we do not need as large an interval to capture the true
population mean.
Exercise
A pharmaceutical company makes tranquilizers. It is assumed that the
distribution for the length of time they last is approximately normal.
Researchers in a hospital used the drug on a random sample of nine patients.
The effective period of the tranquilizer for each patient (in hours) was as
follows: 2.7; 2.8; 3.0; 2.3; 2.3; 2.2; 2.8; 2.1; and 2.4.
1.x¯= __________
2.sx= __________
3.n= __________
4.n–1= __________
2.Define the random variable Xin words.
3.Define the random variable X¯in words.
4.Which distribution should you use for this problem? Explain your choice.
5.Construct a 95% confidence interval for the population mean length of
time.
1.State the confidence interval.
2.Sketch the graph.
3.Calculate the error bound.
6.What does it mean to be “95% confident” in this problem?
1.x¯=2.51
2.sx=0.318
3.n=9
4.n−1=8
2.the effective length of time for a tranquilizer
3.the mean effective length of time of tranquilizers from a sample
of nine patients
4.We need to use a Student’s-t distribution, because we do not
know the population standard deviation.
1.CI:(2.27,2.76)
2.Check student's solution.
3.EBM:0.25
6.If we were to sample many groups of nine patients, 95% of the
samples would contain the true population mean length of time.
72
Exercise
 The average height of young adult males has a
normal distribution with standard deviation of 2.5
inches. You want to estimate the mean height of
students at your college or university to within one
inch with 93% confidence. How many male students
must you measure?
EBM= error bound for the mean

You might also like