0% found this document useful (0 votes)
8 views108 pages

7 Estimation

The document discusses statistical estimation, focusing on how to infer population parameters from sample data. It explains the concepts of point and interval estimation, the properties of good estimators, and the calculation of confidence intervals. Additionally, it covers the implications of sample size on estimation precision and the use of t-distribution for small sample sizes when population variance is unknown.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views108 pages

7 Estimation

The document discusses statistical estimation, focusing on how to infer population parameters from sample data. It explains the concepts of point and interval estimation, the properties of good estimators, and the calculation of confidence intervals. Additionally, it covers the implications of sample size on estimation precision and the use of t-distribution for small sample sizes when population variance is unknown.

Uploaded by

Berhanu Yelea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 108

Yilma Chisha (Msc, Biostatistician)

AMU, CMHS, School of public health


Estimation
• Up until this point, we have assumed that
the values of the parameters of a
probability distribution are known.
• In the real world, the values of these
population parameters are usually not
known
• Instead, we must try to say something
about the way in which a random variable
is distributed using the information
contained in a sample of observations
• The process of drawing conclusions about
an entire population based on the data in a
sample is known as statistical inference.
• Methods of inference usually fall into one of
two broad categories: estimation or
hypothesis testing.
• For now, we will focus on using the
observations in a sample to estimate a
population parameter
Estimation
• Is concerned with estimating the values
of specific population parameters based
on sample statistics.
• is about using information in a sample to
make estimates of the characteristics
(parameters) of the source population.
Example
• A sample survey revealed:
– Proportion of smokers among a certain
group of population aged 15 to 24.
– Mean of SBP among sampled population
– Prevalence of HIV-positive among people
involved in the study

The next question is what can we predict


about the characteristics of the population
from which the sample was drawn
Estimation, Estimator & Estimate
♣ Estimation is the computation of a statistic
from sample data, often yielding a value that
is an approximation (guess) of its target, an
unknown true population parameter value.
♣ The statistic itself is called an estimator and
can be of two types - point or interval.
♣ The value or values that the estimator
assumes are called estimates.
• Two methods of estimation are commonly
used: point estimation and interval
estimation
• Point estimation involves the calculation
of a single number to estimate the
population parameter
• Interval estimation specifies a range of
reasonable values for the parameter
Point versus Interval Estimators
♣ An estimator that represents a "single
best guess" is called a point estimator.
♣ When the estimate is of the form of a
"range of plausible values", it is called
an interval estimator.
 Thus,
– A point estimate is of the form: [ Value ],
– Whereas, an interval estimate is of the
form: [ lower limit, upper limit ]
Sample mean ( ) is an unbiased estimator of population mean.

E ( )=µ
Properties of good estimates
A. Unbiased Estimator
♣ A statistic is said to be an unbiased estimator
of the corresponding population parameter
if, taken over its sampling distribution, is
equal to the population parameter value.
♣ The "long run" average of the statistic is
equal to the population parameter value.
♣ The sample mean and median are unbiased
estimators of the population mean .
B. Minimum Variance
Estimating the Sampling Error
• Any estimates derived from samples are
subject to the sampling error.
• This comes from the fact that only a part of
the population was observed, instead of the
whole.
• A different samples could have come up with
different results. The amount of variation
that exists among the estimates from the
different possible samples is the sampling
error.
• The set of sample means in repeated random
samples of size n from a given population has
variance .

• The standard deviation of this set of sample


means is and is referred to as the
standard error of the mean (sem) or the
standard error.

• The sem is estimated by if  is unknown.


• The sampling error is dependent on on sample
size (n), the variability of individual sample
points (), sampling and estimation methods.
• As n increases, the sample mean ( ) and the
sample variance s2 approach the values of the
true population parameters, µ and 2,
respectively.
Example
• Suppose that the mean ± sd of DBP on
20 old males is 78.5 ± 10.3 mm Hg.
1. What is our best estimate of µ ?
2. What is the sem?
3. Compare the sem with the sd.
• The following table gives the se for mean
of DBP for different sample sizes.

n sem
1 10.3
20 2.3
100 1.0

• Our best estimate of µ is 78.5.


• The sem of this estimate is 10.3/√20 = 2.3
• The sem (2.3) is much smaller than sd
(10.3).
1. Point Estimate
• A single numerical value used to estimate
the corresponding population parameter.
Sample Statistics are Estimators of Population Parameters
Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
Sample Odds Ratio, OR

RR
Sample Relative Risk, RŔ
ρ
Sample correlation coefficient, r
2. Interval Estimation
• Interval estimation specifies a range of
reasonable values for the population
parameter based on a point estimate.
• A confidence interval is a particular
type of interval estimator.
Confidence Intervals
• Give a plausible range of values of the
estimate likely to include the “true”
(population) value with a given confidence
level.
• An interval estimate provides more
information about a population
characteristic than does a point estimate
• Such interval estimates are called
confidence intervals.
• CIs also give information about the
precision of an estimate.
• How much uncertainty is associated with a
point estimate of a population parameter?
• When sampling variability is high, the CI
will be wide to reflect the uncertainty of the
observation.
• Wider CIs indicate less certainty.
• CIs can also answer the question of whether or
not an association exists or a treatment is
beneficial or harmful. (analogous to p-values…)

e.g., if the CI of an odds ratio includes the


value 1.0 we cannot be confident that
exposure is associated with disease.
• A CI in general:
– Takes into consideration variation in
sample statistics from sample to sample
– Based on observation from 1 sample
– Gives information about closeness to
unknown population parameters
– Stated in terms of level of confidence
• Never 100% sure
General Formula:
The general formula for all CIs is:
The value of the statistic in my
sample (eg., mean, odds ratio,
etc.)
point estimate  (measure of how confident we
want to be)  (standard error)

From a Z table or a T table,


depending on the sampling
distribution of the statistic.

Standard error of the statistic.


Lower limit = Point Estimate - (Critical Value) x (Standard Error)

Upper limit = Point Estimate + (Critical Value) x (Standard Error)

• A wide interval suggests imprecision of


estimation.
• Narrow CI widths reflects large sample
size or low variability or both.

• Note: Measure of how confident we want to be =


critical value = confidence coefficient
Confidence Level
• Confidence Level
– Confidence in which the interval will contain
the unknown population parameter
• A percentage (less than 100%)
– Example: 95%
• Also written (1 - α) = .95
Definition: 95% CI
1. Probabilistic interpretation:
• If all possible random samples (an infinite
number) of a given sample size (e.g. 10 or
100) were obtained and if each were used to
obtain its own CI, then 95% of all such CIs
would contain the unknown population
parameter; the remaining 5% would not.
• It is incorrect to say “There is a 95%
probability that the CI contains the unknown
population parameter”.
2. Practical interpretation
• When sampling is from a normally
distributed population with known standard
deviation, we are 100 (1-α) [e.g., 95%]
confident that the single computed interval
contains the unknown population
parameter.
Estimation for Single Population
1. CI for a Single Population
Mean (normally distributed)
A. Known variance (large sample size)
• There are 3 elements to a CI:
1. Point estimate
2. SE of the point estimate
3. Confidence coefficient ( 1-alpha)
• Consider the task of computing a CI
estimate of μ for a population distribution
that is normal with σ known.
• Available are data from a random sample of
size = n.
Assumptions
 Population standard deviation () is known
 Population is normally distributed
 If population is not normal, use large sample
• A 100(1-)% C.I. for  is:

  is to be chosen by the researcher, most common values of  are 0.05,


0.01 and 0.1.
3. Commonly used CLs are 90%, 95%, and

99%
Finding the Critical Value
Margin of Error
(Precision of the estimate)
Factors Affecting Margin of Error

The CI for mean or margin of error is


determined by n, s, and α.
– As n increases, the CI decreases.
– As s increases, the length of CI increases.
– As the confidence level increases (α decreases),
the length of CI increases.
Example:
1. Waiting times (in hours) at a particular
hospital are believed to be approximately
normally distributed with a variance of 2.25
hr.
a. A sample of 20 outpatients revealed a mean
waiting time of 1.52 hours. Construct the 95%
CI for the estimate of the population mean.
b. Suppose that the mean of 1.52 hours had
resulted from a sample of 32 patients. Find the
95% CI.
c. What effect does larger sample size have on
the CI?
a. 2.25
1.52 1.96 1.52 1.96(.33)
20
1.52 .65 (.87, 2.17)
• We are 95% confident that the true mean waiting time is between 0.87
and 2.17 hrs.

• Although the true mean may or may not be in this interval, 95% of the
intervals formed in this manner will contain the true mean.

• An incorrect interpretation is that there is 95% probability that this


interval contains the true population mean.
b.
2.25
1.52 1.96 1.52 1.96(.27)
32
1.52 .53 (.99, 2.05)

c. The larger the sample size makes the CI


narrower (more precision).
• When constructing CIs, it has been
assumed that the standard deviation of the
underlying population,  , is known
• What if  is not known?
• In practice, if the population mean μ is
unknown, then the standard deviation,,
is probably unknown as well.
• In this case, the SE of the population can
be replaced by the SE of the sample if the
sample size is large enough (n>30). With
large sample size, we assume a normal
distribution.
• Example: It was found that a sample of 35 patients were
17.2 minutes late for appointments, on the average, with
SD of 8 minutes. What is the 90% CI for µ? Ans: (15.0,
19.4).
• Since the sample size is fairly large (>30) and the
population SD is unknown, we assume the
distribution of sample mean to be normally
distributed based on the CLT and the sample SD
to replace population .
B. Unknown variance
(small sample size, n ≤ 30)
• What if the  for the underlying population
is unknown and the sample size is small?
• As an alternative we use Student’s t
distribution.
Student’s t Distribution
• The t is a family of distributions
• Bell Shaped
• Symmetric about zero (the mean)
• Flatter than the Normal (0,1). This means
– The variability of a t is greater than that of a Z that
is normal(0,1)
– Thus, there is more area under the tails and less
at center
– Because variability is greater, resulting
confidence intervals will be wider.
• Note: t approaches z as n increases
What happens as
sample gets larger?
T-distribution and Standard Normal Z distribution

0.4
Z distribution
0.3
density

0.2 T with 60 d.f.

0.1

0.0

-5 0 5
Value

As the df gets larger, the student’s t-distribution looks more and more
like the SND with mean=0 and variance=1.
What happens to CI as
sample gets larger?
 s  For large samples:
x Z  
 n Z and t values
become almost
identical, so CIs are
 s  almost identical.
x t  
 n
Degrees of Freedom (df)
df = Number of observations that are free to vary after
sample mean has been calculated
df = n-1
Student’s t Table
t distribution values
• With comparison to the Z value
Example
Example

• Standard error =
• t-value at 90% CL at 19 df =1.729
Exercise
• Compute a 95% CI for the mean birth
weight based on n = 10, sample mean =
116.9 oz and s =21.70.
• From the t Table, t9, 0.975 = 2.262
• Ans: (101.4, 132.4)
2. CIs for single population
proportion, p

• Is based on three elements of CI.


– Point estimate
– SE of point estimate
– Confidence coefficient
Lower limit = Point Estimate - (Critical Value) x (Standard Error of Estimate)

Upper limit = Point Estimate + (Critical Value) x (Standard Error of Estimate)

Hence,

is an approximate 95% CI for the true proportion p.


Example 1
• A random sample of 100 people shows that
25 are left-handed. Form a 95% CI for the
true proportion of left-handers.
Interpretation
Changing the sample size
Example 2
• It was found that 28.1% of 153 cervical-cancer cases
had never had a Pap smear prior to the time of case’s
diagnosis. Calculate a 95% CI for the percentage of
cervical-cancer cases who never had a Pap test.


Example 3
• Suppose that among 10,000 female operating-room
nurses, 60 women have developed breast cancer over
five years. Find the 95% for p based on point estimate.
• Point estimate = 60/10,000 = 0.006
• The 95% CI for p is given by the interval:

• The 95% CI for p is:


Estimation for Two Populations
3. CI for the difference between
population means (normally distributed)

A. Known variances (2 independent samples)


• When 1 and 2 are known and both
populations are normal or both sample sizes
are at least 30, the test statistic is a z-
value…
Assumptions
• Samples are randomly and independently
drawn
• Population distributions are normal or
both sample sizes are ≥30
• Population standard deviations are known
Illustration
• A researcher performs a drug trial
involving two independent groups.
– A control group is treated with a placebo
while, separately;
– The intervention group is treated with an
active agent.
– Interest is in a comparison of the mean
control response with the mean
intervention response under the
assumption that the responses are
independent.
Examples
• We are interested in the similarity of the
two groups.
1) Is mean blood pressure the same for males and
females?
2) Is body mass index (BMI) similar for breast
cancer cases versus non-cancer patients?
3) Is length of stay (LOS) for patients in hospital “A”
the same as that for similar patients in hospital
“B”?
Example
• Researchers are interested in the difference between
serum uric acid levels in patients with and without
Down’s syndrome.
• Patients without Down’s syndrome
– n=12, sample mean=4.5 mg/100ml, 2=1.0
• Patients with Down’s syndrome
– n=15, sample mean=3.4 mg/100ml, 2=1.5
• Calculate the 95% CI.
• SE = 0.43, 95% CI = 1.1 ± 1.96 (0.43) = (0.26, 1.94)
• WE are 95% confident that the true difference between
the two population means is between 0.26 and 1.94.
B. Unknown variances
(Independent samples)
I. Population variances equal (large sample)
• Assumptions:
– Samples are randomly and independently drawn
– Both sample sizes are ≥30
– Population standard deviations are unknown
Forming confidence estimates:
• Use sample standard deviation s to
estimate , and
• the test statistic is a z-value
Example
• The mean CD4 + cells for 112 men with HIV
infection was 401.8 with a SD of 226.4. For 75
men without HIV, the mean and SD were 828.2
and 274.9, respectively. Calculate a 99% CI for
the difference between population means.
• SE of the difference b/n two means = 38.28
• 99% CI = 426.4 ± 2.58 (38.28)
= (327.6, 525.2)
II. Population variances equal (small
sample)
• Assumptions:
– Populations are normally distributed
– The populations have equal variances
– Samples are independent
– Both sample sizes are <30
– Population standard deviations are unknown

* If 0.5  s12/s22  2 then we assume that the population variances


are equal.
Forming confidence estimates:
• The population variances are assumed
equal, so use the two sample standard
deviations and pool them to estimate 
• The test statistic is a t value with (n1 + n2 –
2) degrees of freedom
• The pooled estimate (s2p) is the weighted
average of the two sample variances.
• The pooled standard deviation is :

• The standard error of the estimate is given


by:
Example 1
• A study was conducted to compare the serum iron levels
of children with cystic fibrosis to those of healthy
children. Serum iron levels were measured for random
samples of n1 = 9 healthy children and n2 = 13 children
with cystic fibrosis.
• The two underlying populations of serum
iron levels are independent and normally
distributed.
A t-value at 95% CL with 20 df is
2.086
Example 2
• Birth weights of children born to 14 heavy
smokers (group 1) and to 15 non-smokers
(group 2) were sampled from live births at a
large teaching hospital. For the heavy
smokers, sample mean = 3.17 kg, SD =
0.46 and for non-smokers, sample mean =
3.63 kg and SD = 0.36.
• Sp = 0.4121, SE = 0.1531, t-value at 27 df = 2.05
• 95% CI = (0.14, 0.77)
III. Population variances unequal
(small sample)

• The confidence interval for µ1-µ2 is:


• Where the degree of freedom (d’) is
given by:
Example
• For a t distribution with 19 df at 95% CL t-
value is 2.093.
• Therefore, a 95% confidence interval
would take the form
• Using the data from two samples of
patients with tuberculosis meningitis, the
95% CI for μ1 − μ2 is
C. Paired Samples
 Tests Means of 2 Related Populations
∆ Paired or matched samples
∆ Repeated measures (before/after)
∆ Use difference between paired values:
d = x1-x2
 Eliminates variation among subjects
 Assumptions:
 Both populations are normally distributed,
 Or, if not normal, use large samples.
Paired Data
• Paired data arises when each individual
(more specifically, each unit of
measurement) in a sample is measured
twice.
• Measurement might be "pre/post”,
"before/after", “right/left, “parent/child”,
etc.
Examples of paired data
1) Blood pressure prior to and following
treatment,
2) Number of cigarettes smoked per week
measured prior to and following participation
in a smoking cessation program,
3) Number of sex partners in the month prior to
and in the month following an HIV education
campaign.
• Notice in each of these examples that the two
occasions of measurement are linked by
virtue of the two measurements being made
on the same individual.
• Longitudinal or follow-up study
Paired differences
• If two measurements of the same
phenomenon (eg. blood pressure, #
cigarettes/week, etc) X and Y are
measured on an individual and if each
is normally distributed, then their
difference is also distributed normal.

• The interest in the difference between


two measurements
• Where tα/2 has n-1 df.
Example
• Ten hypertensive patients are screened at
a neighborhood health clinic and are given
methyl dopa, a strong antihypertensive
medication for their condition. They are
asked to come back 1 week later and have
their blood pressures measured again.
Suppose the initial and follow-up SBPs
(mm Hg) of the patients are given below.
1. What is the mean and sd of the
difference?
2. What is the standard error of the mean?
3. Assume that the difference is normally
distributed, construct a 95% CI for µ.
Answer
• We have the following data and summary statistics
4. Two Population Proportions
• We are often interested in comparing
proportions from 2 populations:
• Is the incidence of disease A the same in
two populations?
• Patients are treated with either drug D, or
with placebo. Is the proportion “improved”
the same in both groups?
Confidence Interval for
Two Population Proportions
• SE of the difference =

• The confidence interval for p1 – p2 is:


The following formula is also equally used

• An approximate 95% confidence interval


takes the form
Example
• In a clinical trial for a new drug to treat hypertension,
N1 = 50 patients were randomly assigned to receive
the new drug, and N2 = 50 patients to receive a
placebo. 34 of the patients receiving the drug
showed improvement, while 15 of those receiving
placebo showed improvement.
• Compute a 95% CI estimate for the difference
between proportions improved.
• p1 = 34/50 = 0.68, p2 = 15/50 = 0.30
• The point estimate for the difference is:
= [0.68−0.30]=0.38

• SE of the difference =

• 95% CI
– Lower = ( point estimate ) - (Zα/2) (SE)
= 0.38 – (1.96)(0.0925) = 0.20
– Upper = ( point estimate ) + (Zα/2) (SE)
= 0.38 + (1.96)(0.0925) = 0.56
• 95% CI = (0.20, 0.56)
CI for the true OR
• The odds ratio is defined as the odds of
disease among exposed individuals
divided by the odds of disease among the
unexposed
• It is estimated by:
• Upper and lower confidence limits for the natural log of the odds ratio are
calculated with the formula:

ˆ Z 1 1 1 1
ln OR   
a b c d

1 1 1 1 1 1 1 1
ln ORˆ  Z
   ln ORˆ  Z   
a b c limits
d a b c d
e
ntiate the upper and lower confidence for ,e
• where
log of the
– Z OR:
is the standard normal value for the level of confidence desired
– a-d are the cells of a 2x2 table

You might also like