0% found this document useful (0 votes)
28 views91 pages

7 Estimation

Statistical estimation involves using sample statistics to estimate unknown population parameters. There are two main types of estimation - point estimation and interval estimation. Point estimation provides a single value for the population parameter while interval estimation provides a range of plausible values for the population parameter with a specified level of confidence, such as a 95% confidence interval. Desirable properties for estimators include being unbiased, efficient, consistent, and sufficient.

Uploaded by

TESFAYE YIRSAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views91 pages

7 Estimation

Statistical estimation involves using sample statistics to estimate unknown population parameters. There are two main types of estimation - point estimation and interval estimation. Point estimation provides a single value for the population parameter while interval estimation provides a range of plausible values for the population parameter with a specified level of confidence, such as a 95% confidence interval. Desirable properties for estimators include being unbiased, efficient, consistent, and sufficient.

Uploaded by

TESFAYE YIRSAW
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

7-STATISTICAL ESTIMATION

By: Amare M. (MPH/Biostatistics)


Learning objectives
After the end of this session the learners should be able

 Sample statistic and population parameter


 Sampling distributions (means & proportions)
 Central limit theorem
 Define point estimate, standard error, confidence level, and
margin of error
 Compute and interpret confidence intervals for means and
proportions
 Sample size determination formula for mean & proportion
Inferential statistics

• Consists of generalizing from samples to populations,


performing
– estimations,

– hypothesis testing

– determining relationships among variables, and

– making predictions.

• Used when we want to draw a conclusion for the


data obtained from the sample
Inferential statistics,…

• The two primary methods for making

inference are Estimation and Hypothesis

Testing
Sampling distributions

Sampling distributions of
• Means and
• Proportions
 Statistical analysis of data begins with description of quantities, their
relationships, and structure.
 A quantity is anything that can be counted, ranked, and measured.
 it is essential to examine the distribution of the variable for
skewness (tails),
 kurtosis (peaked or flat distribution), spread (range of the values)
and
 outliers (data values separated from the rest of the data).
 Information about each of these characteristics determines to choose
the statistical analyses and can be accurately explained and
interpreted.

10/17/2023 6
 We have two facts that are key to statistical inference.

 Population parameters are fixed numbers whose values are usually


unknown

 Sample statistics are known values for any given sample, but vary
from sample to sample taken from the same population.

 This variability of sample statistics(sampling variation) is always


present and must be accounted for in any inferential procedure by
identifying probability distributions that describe the variability of
sample statistics.

10/17/2023 7
Sampling distributions

 The frequency distribution of all these samples forms the sampling


distribution of the sample statistic
10/17/2023 8
 This sampling distribution has characteristics that can be
related to those of the population from which the sample is
drawn.

 This relationship is usually provided by the parameters of


the probability distribution describing the population.

 E.g. Sampling Distribution of the means

 E.g. Sampling Distribution of the proportions

10/17/2023 9
If the sample statistic is the sample mean, then the distribution is
the sampling distribution of sample means.

The sampling distribution consists of the values of the sample


means,

10/17/2023 10
 In practice we do not take repeated samples from a
population

 i.e. We do not encounter sampling distribution empirically,

 However, yet it is necessary to know their properties in


order to draw statistical inferences.

10/17/2023 11
Properties of sampling distribution

Properties of Sampling Distributions of Sample Means


 The mean of the sample means, is equal to the population
mean.
𝝈
 The standard deviation of the sample means, 𝝈𝒙 = 𝒏
is equal to the
population standard deviation, divided by the square root of n.
 Hence, the sampling distribution of averages has a smaller variance
than the underlying population.
 The standard deviation of the sampling distribution of the sample
means is called the standard error of the mean.

10/17/2023 12
The central limit theorem

If a sample of size n 30 is taken from a population with any


type of distribution that has a mean =  and standard
deviation =, the sample means will have a normal
distribution

10/17/2023 13
An illustration showing how a sample size determines the
shape of the sampling distribution

10/17/2023 14
If the population itself is normally distributed, with mean = 
and standard deviation = , the sample means will have a
normal distribution for any sample size n.
Population distribution Sample means
distribution

10/17/2023 15
 If the sample statistic is a proportion, providing n is large
the sample proportions will be distributed normally with
mean p and standard deviation called the
standard error of the proportion

10/17/2023 16
The mean and standard error
Example:

The weights of one year old children in a certain region are distributed
with a mean weight of 8 kg and a standard deviation of 0.7 kg. 38 children
are randomly selected from the population, and the mean of each sample
is determined.

a. Find the mean and standard error of the mean of the sampling
distribution.

10/17/2023 17
Interpreting the Central Limit Theorem

10/17/2023 18
Important points about assumptions of normal distribution

We use the normal distribution to a sample if:


 Our sample is taken from a normally distributed population

whose population standard deviation () is known, or

 The sample size is large (i.e. greater than 30) so that we can
use the Central Limit Theorem (CLT).

10/17/2023 19
We use the student t-distribution (t- statistic)
provided that we have the following three
conditions satisfied:
 The sample is from a normally distributed population,

 Population standard deviation is unknown, and

 the sample size is small i.e. less than 30. (Note that we can also
use t-test even if n> 30 if we want to be more conservative!)

 Otherwise we use non parametric test

10/17/2023 20
The concept of statistical inference
Parameters

population

Random Sample

Statistic
Statistical Estimation

What is the goal of conducting survey?


• The goal of conducting surveys is to obtain
information about a particular population.
• When the sample has been selected and the
information collected and, there still remains the
task of linking the information gathered from the
sample back to the overall population.
• Estimation is the use of sample statistics to estimate
population parameters
• The true population parameter value is usually
unknown
Statistical Estimation….
Statistical estimation

Estimate

Point estimate Interval estimation


 Sample mean  Confidence interval for mean
 Sample proportion  Confidence interval for proportion

• Point estimate is always within the interval estimate


• A point estimate for a population parameter is a single-
valued estimate of that parameter
• A confidence interval (CI) estimate is a range of values for a
population parameter with a level of confidence attached
• A confidence interval provides additional information about
variability
Estimation

• (e.g., 95% confidence that the interval


contains the unknown parameter)
Estimation

♣ Estimation is the computation of a statistic from sample data,


often yielding a value that is an approximation (guess) of its
target, an unknown true population parameter value.

♣ The statistic itself is called an estimator and can be of two


types - point or interval.

♣ The value or values that the estimator assumes are called


estimates.

10/17/2023 25
Point estimation

 From a single sample, we can calculate a sample statistic to estimate a


single parameter (a point estimate).
 Point estimate for population mean µ is

 Point estimate for population proportion is given by

where x is the total number of success (events)

10/17/2023 26
point estimates
Estimation
Desirable properties of estimators include:
• Unbiasedness
– expected value =population parameter
– Unbiasedness is an average or long-run property
– Any systematic deviation of the estimator from the population
parameter is called bias
• Efficiency
– An estimator is efficient if it has a relatively small variance
• Consistency
– probability of being close to the parameter it estimates
increases as the sample size increases
• Sufficiency
– contains all the information in the data about the
parameter it estimates.
Interval estimation
• A confidence interval (CI) estimate is a range of
values for a population parameter with a level of
confidence attached (e.g., 95% confidence that the
interval contains the unknown parameter).
• The level of confidence is similar to a probability.
• The CI starts with the point estimate and builds in
what is called a margin of error.
• The margin of error incorporates the confidence level
(e.g., 90% or 95%, which is chosen by the
investigator) and
• The sampling variability or the standard error of the
point estimate.
Interval estimation….
• A CI is a range of values that is likely to cover
the true population parameter, and
• Its general form is point estimate ± margin of
error.
• The point estimate is determined first.
• The point estimates for the population mean
and proportion are the sample mean and
sample proportion, respectively.
• These are our best single-valued estimates of
the unknown population parameters.
Interval estimation….

• A level of confidence is selected that reflects the


likelihood that the CI contains the true, unknown
parameter.
• Usually, confidence levels of 90%, 95%, and 99%
are chosen
• The Central Limit Theorem, which stated that for
large samples, the distribution of the sample
means is approximately normal with a mean
• and standard deviation
• We use the Central Limit Theorem to develop the
margin of error
Interval estimation….

• For the standard normal distribution


• The Central Limit Theorem states that for large
samples
• If we make this substitution, the following
statement is true:
Interval estimation….
• The 95% CI for the population mean is the interval in
the last probability statement and the margin of
error is
• where 1.96 reflects the fact that a 95% confidence
level is selected and is the standard error (or the
standard deviation of the point estimate, ).
• The general form of a CI can be rewritten as follows:
Interval estimation….
• we find for 90%, z a/2 = 1.645; for 95%, z a/2 = 1.96; and
for 99%, z a/2 = 2.58.
• Higher confidence levels have larger z a/2 values, which
translate to larger margins of error and wider CIs.
• There are instances in which the sample size is not
sufficiently large (e.g., n < 30), and therefore the
general result of the Central Limit Theorem does not
apply.
• In this case, we cannot use the standard normal
distribution (z) in the confidence interval.
• Instead we use another probability distribution, called
the t distribution, which is appropriate for small
samples.
Student’s t-distribution
• Was proposed in 1908 by William Gosset.

• The t distribution is another probability model for a continuous variable.

• The t distribution is similar to the standard normal distribution but takes a

slightly different shape depending on the exact sample size.

• Specifically, the t values for CIs are larger for smaller samples, resulting in

larger margins of error (i.e., there is more imprecision with small samples).

• t values are indexed by degrees of freedom (df) which is defined as n - 1

• It is important to note that appropriate use of the t distribution assumes

that the outcome of interest is approximately normally distributed.


 Uni-modal

 Asymptotic to the horizontal axis;

 Symmetrical about zero,

 Dependent on n or the degrees of freedom (v = n-1);

 If degree of freedom or v is large (>=100), approximately the


same as the standard normal distribution
Formula for t-distribution
Interval estimation….

Confidence Level
Interpretation of CIs in general
• Suppose we want to estimate a population mean using a 95%
confidence level.
• If we take 100 different samples (in practice, we take only one)
and for each sample we compute a 95% CI, in theory 95 out of the
100 CIs will contain the true mean value (μ).
• This leaves 5 of 100 CIs that will not include the true mean value.
• In practice, we select one random sample and generate one CI.
• This interval may or may not contain the true mean; the observed
interval may overestimate μ or underestimate μ.
• The 95% CI is the likely range of the true, unknown parameter.
• It is important to note that a CI does not reflect the variability in
the unknown parameter but instead provides a range of values
that are likely to include the unknown parameter.
Example
 In a study of preeclampsia, the mean systolic blood pressure of 10 healthy,
non-pregnant women to be 119 with a standard deviation of 2.1.

a) What is the estimated standard error of the mean?


b) Construct the 99% confidence interval for the mean
of the population from which the 10 subjects may
be presumed to be a random sample.
c) What is the precision of the estimate?
d) What assumptions are necessary for the validity of
the confidence interval you constructed?
A. SE= , s=2.1 , n=10
SE=0.66
B. df=n-1=9
99%=a =0.01,a ½=0.005
t9,0.005=3.25

The 99% CI is (116.8,121.2)


c. Precision/ margin of error=SE*t9,0.005
=0.66*3.25= 2.16
D. The population is normally distributed.
The 10 subjects represent a random sample
from this population.
Confidence intervals for one sample, continuous outcome

• The formulas for CIs for the population mean


depend on the sample size
• Confidence Intervals for μ

n ≥ 30 Find z in z Table

n < 30 (Find t in t Table ,df = n − 1)


Example

• In examination of the Framingham Offspring


Study the mean and standard deviation of
Systolic blood pressure was 127.3 and 19
respectively from 3534 participants.
• Generate a 95% CI for systolic blood pressure
• Solution
• Because the sample size is large, we use the
following formula
Example
• = 127.3 ± 0.63

• Adding and subtracting the margin of error,


we get CI (126.7, 127.9)
• Why the margin of error (0.63) is very small?
• Because of the large sample size
• The 90% CI is also (126.7- 127.83)
• The CI is very precise or narrow due to the
large sample size and small confidence level.
• From 10 Participants Attending the
Framingham Offspring Study the mean body
mass index was 27.26 with standard deviation
of 3.1.
• Compute a 90% CI for the true BMI
• Because the sample size is small, we must now
use the CI formula that involves t rather than z
• Df=10-1=9
• the t value for 90% confidence with df 9 is
t=1.833

• We are 90% confident that the true mean BMI in


the population is between 25.46 and 29.06
• because of the small sample size, the CI is less
precise
Confidence intervals for one sample, dichotomous outcome/proportion

• There are many applications where the


outcome of interest is dichotomous.
• The parameter of interest is the unknown
population proportion, denoted p
• For example, suppose we wish to estimate the
proportion of people with diabetes in a
population, or the proportion of people with
hypertension or obesity
• The sample proportion is denoted , , and is
computed by taking the ratio of the number of
successes in the sample to the sample size

• The formula for the CI for the population


proportion is given
• Standard error of the point estimate

• The preceding formula is appropriate for large samples,


defined as at least five successes and at least five
failures in the sample.
• If there are fewer than five successes or failures, then
alternative procedures called exact methods must be
used to estimate the population proportion
• There were a total of 1219 participants on
certain treatment and 2313 participants not
on treatment.
• If we call treatment a success, then x = 1219
and n = 3532.
• The sample proportion is
• To use the preceding formula, we need to
satisfy the sample size criterion—specifically,
we need at least five successes and five
failures. Here we are more than satisfy that
requirement, so the CI formula used
• Thus, we are 95% confident that the true
proportion of persons on certain medication is
between 0.329 and 0.361, or between 32.9%
and 36.1%.
• Specific applications of estimation for a single
population with a dichotomous outcome
involve estimating prevalence and cumulative
incidence
• Prevalence of CVD in women and men

• Total prevalence of CVD=379/3799=0.0998=p1


• Prevalence among men= 244/1792=0.1362=p2
• Prevalence in women=135/2007=0.0673=p3
• Here we more than satisfy that requirement
for men, women, and the pooled or total
sample.
• For the total sample

• For men
• For women

• We are 95% confident that the true prevalence


of CVD in men is between 12% and 15.2%, and in
women, between 5.6% and 7.8%.
• Each of these CI estimates is very precise (small
margins of error) because of the large sample
sizes
Reading assignments #1:

Confidence intervals for two


independent samples,
continuous outcome
Confidence intervals for two independent samples, continuous outcome

• There are many applications to compare two groups


with respect to their mean scores on a continuous
outcome
• E.g. mean systolic blood pressures in men versus
women, or
• mean BMI or total cholesterol levels in patients
assigned to experimental treatment versus placebo
• A key feature here is that the two comparison groups
are independent, or physically separate.
• In the two-independent-samples application
with a continuous outcome, the parameter of
interest is the difference in population
means, μ1 − μ1
• The point estimate for the difference in
population means is the difference in sample
means, x bar1 −x bar2.
• The standard error of the point estimate
incorporates the variability in the outcome of
interest in each of the comparison groups
Two sample means
• Confidence Intervals for (μ1-μ2)

• The formulas assume equal variability in the two populations (i.e.,


that the population variances are equal l, or .
• For analysis, we have samples from each of the comparison
populations. If the sample variances are similar, then the assumption
about variability in the populations is reasonable.
• As a guideline, if the ratio of the sample variances S21/s22 is
between 0.5 and 2 (i.e., if one variance is no more than double the
other), then the formulas are appropriate
• If the ratio of the sample variances is greater
than 2 or less than 0.5, then alternative
formulas must be used to account for the
heterogeneity In variances
• In the formulas mean x1 and mean2 are the
means of the outcome in the independent
samples, z or t are values from
the z or t distributions reflecting the desired
confidence level, and is the standard
error of the point estimate,
• Here, Sp is the pooled estimate of the common
standard deviation (again, assuming that the
variances in the populations are similar)
computed as the weighted average of the
standard deviations in the samples
Example
men women
Sample size Sample Standard Sample Sample Standard
(n) mean (x) deviation size (n) mean (x) deviation
(s) (s)
Systolic 1623 128.2 17.5 1911 126.5 20.1
blood
pleasure

S21/s22 = 17.52 / 20.12 = 0.76which falls between 0.5 and 2,


suggesting that the assumption of equality of the population
variances is reasonable
The appropriate CI formula for the difference in mean systolic
blood pressures between men and women is:
• Before substituting, we first compute Sp, the pooled
estimate of the common standard deviation

• Notice that the pooled estimate of the common standard


deviation, Sp, falls between the standard deviations in the
comparison groups (i.e., 17.5 and 20.1).
• Spis a weighted average of the standard deviations in the
comparison groups, weighted by the respective sample
sizes
• The 95% CI for the difference in mean systolic
blood pressures is

• The CI is interpreted as follows: We are 95%


confident that the difference in mean systolic
blood pressures between men and women is
between 0.44 and 2.96 units
Reading assignments #2:

Confidence intervals for


matched/paired samples,
continuous outcome
Confidence intervals for matched samples, continuous outcome

• Here the two comparison groups are dependent (or


matched, or paired)
• This involves a single sample of participants and each
participant is measured twice, possibly before and
after or a crossover trial)
• The goal of the analysis is to compare the mean
score measured before the intervention with the
mean score measured afterward
• Another scenario is matched samples analysis.
• For example, the difference in an outcome between
twins or siblings
• In one-sample and two-independent-samples
applications, participants are the units of
analysis.
• In the two-dependent-samples application,
the pair is the unit and not the number of
measurements
• Confidence Intervals for μd
Example

• The blood pressure (BP) of 10 mothers were


measured before and after taking a new drug.
Mean-d= -9.5, Sd = 8.64, n = 10,df =9

=-9.5-2.262x8.64/3.162, -9.5+2.262x8.64/3.162
(-15.68,3.32)
The mean difference before and after taking drug has no significant
difference because the confidence interval contain zero.
Example
• A crossover trial is conducted to evaluate the effectiveness
of a new drug designed to reduce the symptoms of
depression in adults over 65 years of age following a stroke.
Symptoms of depression are measured on a scale of 0 to 100,
with higher scores indicative of more frequent and severe
symptoms of depression. Patients who suffered a stroke are
eligible for the trial. The trial is run as a crossover trial where
each patient receives both the new drug and a placebo.
Patients are blind to the treatment assignment, and the
order of treatments (e.g., placebo and then new drug or new
drug and then placebo) are randomly assigned. After each
treatment, depressive symptoms are measured in each
patient. The difference in depressive symptoms is measured
in each patient by subtracting the depressive symptom score
after taking the placebo from the depressive symptom score
after taking the new drug. A total of 100 participants
completed the trial and the data are summarized in Table
Summary statistics on differences in depressive symptoms

n Mean Std Dev


Difference Difference
Depressive symptoms after taking the new 100 −12.7 8.9
drug – Depressive symptoms after taking
placebo

We are 95% confident that the mean


improvement in depressive symptoms
after taking the new drug as compared to
placebo is between 10.7 and 14.4 units.
Because the 95% CI for the mean
difference does not include 0, we can
conclude that there is a statistically
significant difference
Reading assignments #3:

Confidence intervals for two


independent samples,
dichotomous outcome
Confidence intervals for two independent samples; dichotomous outcome
Reading assignments #4:

Sample size determination


for two independent
samples, continuous and
dichotomous outcome
Points to be considered
during sample size
determination
• Design effect
• Non-response rate
• Correction formula
THANK YOU!

91

You might also like