Estimation & Sample Size Determination
Estimation & Sample Size Determination
January 2024
Estimation 1
Objectives
By the end of this part the learners will be able to:
Know methods and principles of drawing Conclusions about a larger group (or
population) based on samples taken from that population
Define point estimate, standard error, confidence level, and margin of error
Estimation 2
Revisions
• Descriptive statistics:
o Data collection
o Data organization and presentation
o Data summarization
• Probability:
o Concepts, and
o Distributions
Estimation 3
Statistical inference
Estimation
Hypothesis testing
Estimation 4
Statistical inference
Estimation 5
Statistical inference
Estimation 6
The concept of statistical inference
Parameters
population
Random Sample
Statistic
Statistical Estimation
Estimation 8
Estimation …
Estimation 9
Estimation …
• In practice, we select a sample from the population and use sample
statistics;
Estimation 10
Estimation …
• The techniques for estimation and other procedures in statistical
inference depend on;
o The appropriate classification of the outcome/dependent
variable (the key study variable) as continuous or dichotomous
o The number of comparison groups in the investigation
o Eg. Two comparison groups
Independent Men Vs Women
Dependent (matched/paired )
Estimation 11
Estimation techniques
• The are two types of estimates that can be produced for any
population parameter:
Estimation 13
Point estimate
Estimation 14
Point estimate …
• The sample mean is an unbiased estimator of the population
mean
• The same holds true for the sample proportion with regard to
estimating the population proportion
Estimation 15
Point estimates
Estimation 16
Desirable properties of estimators
• Unbiasedness
• Efficiency
• Consistency
• Sufficiency
o contains all the information in the data about the parameter it estimates.
Estimation 17
Confidence interval estimate
Estimation 18
Margin of error
Estimation 19
Confidence interval estimate …
• CI estimate=
Estimation 20
Confidence interval estimate …
Estimation 21
A 95% confidence level estimate of mean
The Central Limit Theorem, which stated that for large samples,
the distribution of the sample means is approximately normal
with a mean and standard deviation
Estimation 22
Confidence interval estimate …
chance that a standard normal variable (z) will fall between -1.96
and 1.96
Estimation 23
Confidence interval estimate …
Estimation 24
Confidence interval estimate …
• The 95% CI for the population mean is the interval in the last
probability statement and is given by
Estimation 25
Confidence intervals for one sample, continuous outcome
Estimation 26
Confidence intervals for one sample, continuous outcome …
127.3±0.63
Therefore the 95% CI: (126.7, 127.9)
• The margin of error is very small here because of the large
sample size (narrow CI and precise estimate).
• We are 95% confident that the true mean is between 126.7
and 127.9.
Estimation 27
Confidence intervals for one sample, continuous outcome …
Estimation 28
Confidence intervals for one sample, continuous outcome …
Estimation 29
Confidence intervals for one sample, dichotomous outcome
Estimation 30
Confidence intervals for one sample, dichotomous outcome…
Estimation 31
Confidence intervals for one sample, dichotomous outcome…
Estimation 32
Confidence intervals for one sample, dichotomous outcome…
Estimation 33
Confidence intervals for one sample, dichotomous outcome…
Estimation 34
Confidence intervals for one sample, dichotomous outcome…
Estimation 35
Confidence intervals for two independent sample, continuous outcome
Estimation 37
Confidence intervals for two independent sample, continuous outcome…
Estimation 38
Confidence intervals for two independent sample, continuous outcome…
Estimation 40
Confidence intervals for two independent sample, continuous outcome…
• The formulas (in the Table) assume equal variability in the two
populations (i.e., that the population variances are equal, or S12=s22).
• This means that the outcome is equally variable m in each of the
comparison populations.
• For analysis, we have samples from each of the comparison populations. If
the sample variances are similar, then the assumption about variability in
the populations is reasonable.
• As a guideline, if the ratio of the sample variances, S12/s22, is between 0.5
and 2 (i.e., if one variance is no more than double the other), then the
formulas in the above Table are appropriate.
• If the ratio of the sample variances is greater than 2 or less than 0.5, then
alternative formulas must be used to account for the heterogeneity in
variances.
Estimation 41
Confidence intervals for two independent sample, continuous outcome…
Estimation 42
Confidence intervals for two independent sample, continuous outcome…
n s n s
Systolic BP 1623 128.2 17.5 1911 126.5 20.1
Estimation 43
Confidence intervals for two independent sample, continuous outcome…
• The data that we have large samples (more than 30) of both men and
women, and therefore we use the CI formula from the above Table with z
as opposed to t.
• However, before implementing the formula, we first check whether the
assumption of equality of the population variances is reasonable.
• The guideline suggests investigating the ratio of the sample variances,
S12/s22
• Suppose we call men Group 1 and women Group 2. (Again, this is
arbitrary. It only needs to be noted when interpreting the results.)
• The ratio of the sample variances is 17.52 / 20.12 =0.76, which falls
between 0.5 and 2, suggesting that the assumption of equality of the
population variances is reasonable.
Estimation 44
Confidence intervals for two independent sample, continuous outcome…
Estimation 45
Confidence intervals for two independent sample, continuous outcome…
Estimation 46
Confidence intervals for two independent sample, continuous outcome…
• The 95% CI for the difference in mean systolic blood pressures is:
Estimation 47
Confidence intervals for two independent sample, continuous outcome…
Estimation 48
Confidence intervals for two independent sample, continuous outcome…
Estimation 49
Confidence intervals for matched sample, continuous outcome
Estimation 50
Confidence intervals for matched sample, continuous outcome…
Estimation 51
Confidence intervals for matched sample, continuous outcome…
Estimation 53
Confidence intervals for matched sample, continuous outcome…
Estimation 54
Confidence intervals for matched sample, continuous outcome…
Estimation 55
Confidence intervals for matched sample, continuous outcome…
Estimation 56
Confidence intervals for matched sample, continuous outcome…
Estimation 57
Confidence intervals for matched sample, continuous outcome…
Estimation 58
Confidence intervals for matched sample, continuous outcome…
Estimation 59
Confidence intervals for matched sample, continuous outcome…
Estimation 60
Confidence intervals for matched sample, continuous outcome…
Estimation 61
Confidence intervals for matched sample, continuous outcome…
Estimation 62
Confidence intervals for two independent sample, dichotomous
outcome
• It is very common to compare two groups in terms of the presence or absence of a
particular characteristic or attribute.
• There are many instances in which the outcome variable is dichotomous (e.g.,
prevalent cardiovascular disease or diabetes, current smoking status, incident
coronary heart disease, cancer remission, successful device implant).
• Similar to the applications described for continuous outcomes, we focus here on
the case where there are two comparison groups that are independent or
physically separate and the outcome is dichotomous.
• The two groups might be determined by a particular attribute or characteristic of
the participant (e.g., sex, age less than 65 versus age 65 and older) or might be set
up by the investigator (e.g., participants assigned to receive an experimental drug
or a placebo, a pharmacological versus a surgical treatment).
• When the outcome is dichotomous, the analysis involves comparing the
proportions of successes between the two groups.
Estimation 63
Proportions in two independent groups
• The methods that are used to compare proportions in two
independent groups
• Risk difference: which is computed by taking the difference
in proportions between comparison groups and is similar to
the estimate of the difference in means for a continuous
outcome described earlier
• Relative risk: is computed by taking the ratio of proportions.
• Odds ratio: is computed by taking the ratio of the odds of
success in the comparison groups.
Estimation 64
Confidence Intervals for the Risk Difference
The risk difference (RD) is similar to the difference in means
when the outcome is continuous. The parameter of interest is
the difference in proportions in the population, RD=p1-p2.
The point estimate is the difference in sample proportions,
RD ˆ=ˆp1-ˆp2. The sample proportions are computed by taking
the ratio of the number of successes (x) to the sample size (n)
in each group, ˆp1=x1 ⁄n1 and ˆp2=x2 ⁄n2, respectively. The
formula for the CI for the difference in proportions, or the
RD, is given in Table below.
Estimation 65
• Confidence interval for (P1-P2)
Estimation 66
• The formula in the above Table is appropriate for large
samples, defined as at least five successes (nˆp) and at least
five failures [n(1-ˆp)] in each sample. If there are fewer than
five successes or five failures in either comparison group, then
alternative procedures called exact methods must be used to
estimate the difference in population proportions.
Estimation 67
• Example: we presented data measured in participants
who attended the fifth examination of the offspring in
the Framingham Heart Study. A total of n=3799
participants attended the fifth examination, and the
Table below contains data on prevalent CVD among
participants who were and were not currently smoking
cigarettes at the time of the fifth examination in the
Framingham Offspring Study.
Estimation 68
• Prevalent CVD in Smokers and Nonsmokers
Estimation 69
• Confidence interval for (P1-P2)
Estimation 70
• The outcome is prevalent CVD and the two comparison groups are defined
by current smoking status. The point estimate of prevalent CVD among
nonsmokers is 298 / 3055=0.0975, and the point estimate of prevalent
CVD among current smokers is 81 / 744= 0.1089. When constructing CIs
for the RD, the convention is to call the exposed or treated Group 1 and
the unexposed or untreated Group 2. Here smoking status defines the
comparison groups, and we will call the current smokers Group 1 and the
nonsmokers Group 2. A CI for the difference in prevalent CVD (or RD)
between smokers and nonsmokers is given below.
• In this example, we have more than enough successes (cases of prevalent
CVD) and failures (persons free of CVD) in each comparison group.
Estimation 71
• We are 95% confident that the difference in proportions of smokers
as compared to nonsmokers with prevalent CVD is between -0.0133
and 0.0361.
• The null, or no difference, value for the RD is 0. Because the 95% CI
includes 0, we cannot conclude that there is a statistically significant
difference in prevalent CVD between smokers and nonsmokers.
Estimation 72
• Example: A randomized trial is conducted to evaluate the effectiveness of
a newly developed pain reliever designed to reduce pain in patients
following joint replacement surgery. The trial compares the new pain
reliever to the pain reliever currently used (called the standard of care). A
total of 100 patients undergoing joint replacement surgery agree to
participate in the trial. Patients are randomly assigned to receive either
the new pain reliever or the standard pain reliever following surgery. The
patients are blind to the treatment assignment. Before receiving the
assigned treatment, patients are asked to rate their pain on a scale of 0 to
10, with higher scores indicative of more pain. Each patient is then given
the assigned treatment and after 30 minutes is again asked to rate his or
her pain on the same scale. The primary outcome is a reduction in pain of
3 or more scale points (defined by clinicians as a clinically meaningful
reduction). The data shown in the Table below are observed in the trial
Estimation 73
• A point estimate for the difference in proportions of patients
reporting a clinically meaningful reduction in pain between
treatment groups is 0.46-0.22=0.24. (Notice that we call the
experimental or new treatment Group 1 and the standard Group 2.)
• There is a 24 percentage point increase in patients reporting a
meaningful reduction in pain with the new pain reliever as
compared to the standard pain reliever.
• We now construct a 95% CI for the difference in proportions.
• The sample sizes in each comparison group are adequate—i.e.,
each treatment group has at least five successes (patients reporting
reduction in pain) and at least five failures—therefore the formula
for the CI is
Estimation 74
• We are 95% confident that the difference in proportions of patients
reporting a meaningful reduction in pain is between 0.06 and 0.42
comparing the new and standard pain relievers.
• Our best estimate is an increase of 24 percentage points with the new
pain reliever.
• Because the 95% CI does not contain 0 (the null value), we can conclude
that there is a statistically significant difference between pain relievers—in
this case, in favor of the new pain reliever.
Estimation 75
Sample size determination
Estimation 76
Thanks!
Introduction 91