0% found this document useful (0 votes)
7 views

Lecture 5

Uploaded by

Moybon Kalif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 5

Uploaded by

Moybon Kalif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 130

Statistical Estimation Techniques

1
Introduction
In the real world, the values of population parameters are fixed
and usually not known.
Instead, we must try to say something about the way in which
a variable is distributed using the information contained in a
sample of observations.
The process of drawing conclusions about an entire population
based on the data in a sample is known as statistical inference.
Two broad categories: Estimation and Hypothesis testing.

2
Estimation
Is concerned with estimating the values of specific population
parameters based on sample statistic.
Is about using information in a sample to make estimates of the
characteristics (parameters) of the source population.

Examples: A sample survey revealed:


 Proportion of smokers among a certain group of population aged 15 to 24.

 Mean of SBP among sampled population

The next question is what can we predict about the characteristics of


the population from which the sample was drawn
3
Estimation, Estimator & Estimate
♣ Estimation is the computation of a statistic from sample data,
often yielding a value that is an approximation (guess) of its
target, an unknown true population parameter value.

♣ The statistic itself is called an estimator and can be of two


types - point or interval.

♣ The value or values that the estimator assumes are called


estimates.
4
Two methods of estimation are commonly used:
point estimation and interval estimation

Point estimation involves the calculation of a single number to


estimate the population parameter
Interval estimation specifies a range of reasonable values for
the parameter

5
Point versus Interval Estimators
♣ An estimator that represents a "single best guess" is called a
point estimator.

♣ When the estimate is of the form of a "range of plausible


values", it is called an interval estimator.

 Thus,
 A point estimate is of the form: [ Value ],

 Whereas, an interval estimate is of the form: [ lower limit,


upper limit ] 6
Sample mean ( ) is an unbiased estimator of population mean.

7
Estimating the Sampling Error

 Any estimates derived from samples are subject to the


sampling error.
 This comes from the fact that only a part of the population
was observed, instead of the whole.
 A different samples could have come up with different results.

 The amount of variation that exists among the estimates from


the different possible samples is the sampling error. 8
 The set of sample means in repeated random samples of size n from a
given population has variance .
 The standard deviation of this set of sample means is and is
referred to as the standard error of the mean (sem) or the standard
error.
 The sem is estimated by if  is unknown.

9
 The sampling error is dependent on sample size (n) and the

variability of individual sample points ().


 As n increases, the sample mean ( ) and the sample variance
s2 approach the values of the true population parameters, µ
and 2, respectively.

10
Example
 Suppose that the mean ± sd of DBP on 20 old males is 78.5 ± 10.3
mm Hg.

1. What is our best estimate of µ ?

2. What is the sem?

3. Compare the sem with the sd.

11
 The following table gives the se for mean of DBP for different
sample sizes.
 Our best estimate of µ is 78.5.

 The sem of this estimate is 10.3/√20 = 2.3

 The sem (2.3) is much smaller than sd (10.3).

12
1. Point Estimate
 A single numerical value used to estimate the corresponding
population parameter.
Sample Statistic are Estimators of Population Parameters

Sample mean, µ
Sample variance, S2 2
Sample P or π
proportion, p OR
Sample Odds Ratio,
RR

ρ 13
Sample Relative Risk, RŔ
2. Interval Estimation
 Interval estimation specifies a range of reasonable values for
the population parameter based on a point estimate.
 A confidence interval is a particular type of interval estimator.

Confidence Intervals
 Give a plausible range of values of the estimate likely to include
the “true” (population) value with a given confidence level.
 An interval estimate provides more information about a
population characteristic than does a point estimate
14
 CIs also give information about the precision of an estimate.

 When sampling variability is high, the CI will be wide to reflect


the uncertainty of the observation.

 Wider CIs indicate less certainty.

 CIs can also answer the question of whether or not an


association exists (analogous to p-values…).

 Narrow CI widths reflects large sample size or low variability


or both. 15
General Formula:
The general formula for all CIs is:

The value of the statistic in sample


(eg., mean, proportions, etc.)
point estimate  (measure of how confident we want to be)
 (standard error)

From a Z table or a T table, depending on the


sampling distribution of the statistic.

16
A confidence interval has 3 components:

1) A point estimate (e.g. the sample mean)

2) The standard error of the point estimate ( e.g. SEM =σ/√ n )

3) A confidence coefficient (conf. coeff)


Lower limit = Point Estimate - (Critical Value/ confidence
coefficient) x (Standard Error)
Upper limit = Point Estimate + (Critical Value/ confidence
coefficient) x (Standard Error)
17
Confidence Level
 Confidence Level:

 Confidence in which the interval will contain the unknown

population parameter
 A percentage (less than 100%)

Example: 95%
 Also written (1 - α) = .95

18
Definition of 95% CI
1. Probabilistic interpretation:
 If all possible random samples of a given sample size were obtained
and if each were used to obtain its own CI, then 95% of all such CIs
would contain the unknown population parameter; the remaining 5%
would not.

2. Practical interpretation
 When sampling is from a normally distributed population with known
standard deviation, we are 100 (1-α) [e.g., 95%] confident that the
single computed interval contains the unknown population
parameter. 19
Estimation for Single Population

20
1. CI for a Population Mean (normally distributed)

A. Known variance (large sample size)

Consider the task of computing a CI estimate of μ for a


population distribution that is normal with σ known.
 Available are data from a random sample of size = n.

21
Assumptions
 Population standard deviation () is known

 Population is normally distributed

 If population is not normal, use large sample

A 100(1-)% C.I. for  is:

  is to be chosen by the researcher, most common values


of  are 0.1, 0.05 and 0.01. 22
3. Commonly used CLs are 90%, 95%, and 99%

23
Finding the Critical Value

24
Margin of Error
(Precision of the estimate)

25
Factors Affecting Margin of Error

The CI for mean or margin of error is determined by n, s,


and α.
As n increases, the CI decreases.

As s increases, the length of CI increases.

As the confidence level increases (α decreases), the length


of CI increases.
26
Example:
1. Waiting times (in hours) at a particular hospital are believed to
be approximately normally distributed with a variance of
2.25 hr.

a. A sample of 20 outpatients revealed a mean waiting time of


1.52 hours. Construct the 95% CI for the estimate of the
population mean.

b. Suppose that the mean of 1.52 hours had resulted from a


sample of 32 patients. Find the 95% CI.

c. What effect does larger sample size have on the CI? 27


a.
2.25
1.52 1.96 1.52 1.96(.33)
20
1.52 .65 (.87, 2.17)

 We are 95% confident that the true mean waiting time is


between 0.87 and 2.17 hrs.
 95% of the intervals formed in this manner will contain the true
mean.

28
b. 2.25
1.52 1.96 1.52 1.96(.27)
32
1.52 .53 (.99, 2.05)

c. The larger the sample size makes the CI narrower (more


precision).

29
 When constructing CIs, it has been assumed that the standard
deviation of the underlying population,  , is known
 What if  is not known?

 In this case, the SE of the population can be replaced by the


SE of the sample if the sample size is large enough (n>30).
With large sample size, we assume a normal distribution.

30
 Example: It was found that a sample of 35 patients were 17.2
minutes late for appointments, on the average, with SD of 8
minutes. What is the 90% CI for µ? Ans: (14.98, 19.42).
 Since the sample size is fairly large (>30) and the
population SD is unknown, we assume the distribution
of sample mean to be normally distributed based on
the CLT and the sample SD to replace population .

31
B. Unknown variance
(small sample size, n ≤ 30)
 What if the  for the underlying population is unknown and
the sample size is small?

 As an alternative we use Student’s t distribution .

32
33
Student’s t Distribution
 The t is a family of continuous probability distributions

 Bell Shaped

 Symmetric about zero (the mean)

 Flatter than the Normal (0,1). This means

The variability of a t is greater than that of a Z that is


normal(0,1)
Thus, there is more area under the tails and less at center

Because variability is greater, resulting confidence intervals


34
will be wider.
• Note: t approaches z as n increases

35
Student’s t Table

36
t distribution values
 With comparison to the Z value

37
Example

 Standard error =

 t-value at 90% CI at 19 df =1.729

38
39
2. CIs for population proportion, p

Is based on three elements of CI.


Point estimate

SE of point estimate

Confidence coefficient
40
41
42
Lower limit = Point Estimate - (Critical Value) x (Standard
Error of Estimate)
Upper limit = Point Estimate + (Critical Value) x (Standard
Error of Estimate)

Hence,

is an approximate 95% CI for the true proportion p.

43
Example 1
 A random sample of 100 people shows that 25 are left-
handed. Form a 95% CI for the true proportion of left-
handers.

44
Interpretation

45
Example
 It was found that 28.1% of 153 cervical-cancer cases had never
had a Pap smear prior to the time of case’s diagnosis. Calculate
a 95% CI for the percentage of cervical-cancer cases who never
had a Pap test.

46
Sample size Determination
Too small sample size :
May fail to detect an important effect

Estimates of effect may be too imprecise (wide CI’s)

Too many sample size:


May results in wastage of resources.

To make generalizations about entire population, we need


a total sample size of 200-400
47
Confidence interval approach
 Given confidence interval
mean ( proportion ) z  s.e
2

 Hence the absolute precision denoted by d is given as


Margin of error
d = z s.e

 Where s.e is the standard error2 of the estimator of the
parameter of interest.

48
Steps to determine sample size:
1. Specify tolerable error (i.e., desired precision and confidence
level via d and  )

2. Identify appropriate equation relating tolerable error (d, ) to


sample size (n)

3. Estimate unknown quantities in equation

4. Solve for n

5. Evaluate (and return to first step)


sample size calculation should relate to the study’s outcome
variable 49
Estimating a single population
mean/proportion

50
Examples
1. A survey is being planned to determine what proportion of
families in a certain area are medically indigent. It is found
that the proportion is 0.35 from previous studies. A 95%
confidence interval is desired with d=5% What size sample of
families should be selected?
2. Suppose that you are interested to know the proportion of
infants who breastfed >18 months of age in a rural area.
Suppose that in a similar area, the proportion (p) of breastfed
infants was found to be 0.20. What sample size is required to
estimate the true proportion within ±3% points with 95%
confidence. Let p=0.20, d=0.03, α=5%

52
Example
3. Suppose that for a certain group of cancer patients, we are
interested in estimating the mean age at diagnosis. We would like
a 95% CI and wants margin of error of 2 units.

If the population SD is 124 years, how large should our sample


be?

= 1.96*1.96*124 = 119
2*2

53
Suppose there is no prior information about the proportion
(p) who breastfeed

For a fixed absolute precision (d), the required sample


size increases as P increases form 0 to 0.5, and then
decreases in the same way as the prevalence
approaches 1.

54
 An estimate of p is not always available.

 However, the formula may also be used for sample size


calculation based on various assumptions for the values of
p.
P = 0.1  n = (1.96)2(0.1)(0.9)/(0.05)2 = 138
P = 0.2  n = (1.96)2(0.2)(0.8)/(0.05)2 = 246
P = 0.3  n = (1.96)2(0.3)(0.7)/(0.05)2 = 323
P = 0.5  n = (1.96)2(0.5)(0.5)/(0.05)2 = 384
P = 0.7  n = (1.96)2(0.7)(0.3)/(0.05)2 = 323
P = 0.8  n = (1.96)2(0.8)(0.2)/(0.05)2 = 246
55
Some Considerations

56
Using design effect
 The loss of effectiveness by the use of cluster sampling,
instead of simple random sampling, is the design effect.
 The design effect is basically the ratio of the actual variance,
under the sampling method actually used, to the variance
computed under the assumption of simple random
sampling
Using design effect cont.…
 When simple and systematic random sampling
techniques are used design effect is one.
 When clustering sampling technique is used design
effect is two.
 When multi stage sampling technique is used design
effect is equal to the number of stages.
Hypothesis Testing

60
Introduction
In statistical analyses hypotheses are formulated, experiments are
performed, and results are evaluated for their consistency (non-
consistency) with a hypothesis.
Hypothesis Testing (HT) provides an objective framework for making
decisions using probabilistic methods.
The purpose of HT is to aid the clinician, researcher or administrator
in reaching a decision (conclusion).

61
Hypothesis
A statistical hypothesis is an assumption, claim or a statement
which may or may not be true concerning one or more
populations.
Is a statement about one or more population parameter

Is frequently concerned with the parameters of the population


about which the statement is made.

62
Examples of Research
Hypotheses
Population Mean
The average length of stay of patients admitted to the hospital is five
days
The mean birth weight of babies delivered by mothers with low SES is
lower than those from higher SES.

Population Proportion
The proportion of adult smokers in Harar is P = 0.40

 The prevalence of HIV among non-married adults is higher than that in


married adults
63
Types of Hypothesis

1. The Null Hypothesis, H0


Is a statement claiming that there is no difference between the
hypothesized value and the population value. (The effect of
interest is zero = no difference)
States the assumption (hypothesis) to be tested
H0 is a statement of agreement (or no difference)
H0 is always about a population parameter, not about a sample
statistic
64
Begin with the assumption that the Ho is true
Similar to the notion of innocent until proven guilty

Always contains “=” , “ ≤” or “≥ ” sign

May or may not be rejected

65
2. The Alternative Hypothesis, HA

Is a statement what we will believe is true if our sample data causes
us to reject Ho.
 Is generally the hypothesis that is believed(or needs to be supported)
by the researcher.
Is a statement that disagrees (opposes) with Ho (The effect of
interest is not zero).
Never contains “=” , “ ≤” or “≥ ” sign

May or may not be accepted


66
Steps in Hypothesis Testing
1. Choose the null hypothesis that is to be questioned.

2. Choose an alternative hypothesis which is accepted if the original


hypothesis is rejected.

3. Choose a rule for making a decision about when to reject the


original hypothesis and when to fail to reject it.

4. Choose a random sample from the appropriate population and


compute appropriate statistics: that is, mean, proportions and so
on.
67
5. Make the decision.
Rules for Stating Statistical
Hypotheses
Indication of equality (either =, ≤ or ≥) must appear in Ho.

Ho: μ = μo, HA: μ ≠ μo

Ho: P = Po, HA: P ≠ Po


Can we conclude that a certain population mean is not 50?

Ho: μ = 50 and HA: μ ≠ 50


 Can we conclude that a certain population mean is greater than
50?

Ho: μ ≤ 50 HA: μ > 50


68
Can we conclude that the proportion of patients with leukemia
who survive more than six years is not 60%?

Ho: P = 0.6 HA: P ≠ 0.6


Now think about how the hypothesis test should be carried out

We draw a random sample of size n from the underlying population


and calculate its sample mean (¯x).
We compare (¯x) to the postulated mean μ0.

69
Decision Rule
The decision to reject or not to reject the Ho is based on the
magnitude of the test statistic.
An example of a test statistic is the quantity

When the variance of the population is unknown and sample size is


small, the test statistics is:

70
Rejection and Non-Rejection
Regions
The values the test statistic assume on the horizontal axis of the
normal distribution and are divided into two groups:
 Rejection region, and
 Non-rejection region.

The values of the test statistic forming the rejection region are less
likely to occur if the Ho is true.
The values making the non-rejection region are more likely to occur
if the Ho is true.
71
Example: Two-sided test at α
5%

= 0.025 = 0.025
0.95

-1.96 1.96

Rejection Non-rejection region Rejection


region region

72
Statistical
Decision
Reject Ho if the value of the test statistic that we compute from our
sample is one of the values in the rejection region
Don’t reject Ho if the computed value of the test statistic is one of
the values in the non-rejection region.

73
Level of
Significance, α
Is the probability of rejecting a true Ho (type 1 error)

Defines unlikely values of sample statistic if Ho is true

Defines rejection region of the sampling distribution

The decision is made on the basis of the level of significance,


designated by α.
More frequently used values of α are 0.01, 0.05 and 0.10.

α is selected by the researcher at the beginning

74
One tail and two tail
tests
In a one tail test, the rejection region is at one end of the distribution or the
other.
In a two tail test, the rejection region is split between the two tails.

Which one is used depends on the way the Ho is written.

Level of Significance and the Rejection Region


Example:
The average survival year after cancer diagnosis is less than 3 years.

75
76
Types of Errors in
Hypothesis Tests
Whenever we reject or accept the Ho, we commit errors.

Two types of errors are committed.


 Type I Error
 Type II Error

77
Type I Error
The error committed when a true Ho is rejected

The probability of a type I error is the probability of rejecting the Ho


when it is true-
The probability of type I error is α

 Called level of significance of the test

 Set by researcher in advance

78
Type II Error
The error committed when a false Ho is not rejected
The probability of Type II Error is 

Power
The probability of rejecting the Ho when it is false.
Power = 1 – β = 1- probability of type II error
We would like to maintain low probability of a Type I error (α)
And low probability of a Type II error (β) [high power = 1 - β].

79
Action Reality
(Conclusion)
Ho True Ho False

Do not Correct action Type II error (β)


reject Ho (Prob. = 1-α) (Prob. = β= 1-Power)

Reject Ho Type I error (α) Correct action


(Prob. = α = Sign. level) (Prob. = Power = 1-β)

80
Type I & II Error
Relationship

81
1. Hypothesis Testing of a Single
Mean
(Normally Distributed)

82
1.1 Known
Variance

83
Example:
1. A simple random sample of 10 people from a certain population
has a mean age of 27. Can we conclude that the mean age of the
population is not 30? The variance is known to be 20 and
population is normally distributed. Take α = .05.
Data : n = 10, sample mean = 27, 2 = 20, α = 0.05

84
State the Hypotheses

Ho: µ = 30

HA: µ ≠ 30

Test statistic
 As the population variance is known, we use Z as the test
statistic.

85
Decision Rule
Reject Ho if the Z value falls in the rejection region.

Don’t reject Ho if the Z value falls in the non-rejection region.

Because of the structure of Ho it is a two tail test.

Therefore, reject Ho if Z ≤ -1.96 or Z ≥ 1.96.

86
Calculation of test statistic

Statistical decision

We reject the Ho because Z = -2.12 is in the rejection region.


Conclusion

 We conclude that mean age of population is not 30.

87
Hypothesis test using confidence
interval
A problem like the above example can also be solved using a
confidence interval.
A confidence interval will show that the calculated value of Z does
not fall within the boundaries of the interval.

Confidence interval

88
Example: One -Tailed Test
A simple random sample of 10 people from a certain population has a
mean age of 27. Can we conclude that the mean age of the population
is less than 30? The variance is known to be 20. Let α = 0.05.
Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
Hypotheses
Ho: µ ≥ 30, HA: µ < 30

89
Test statistic

Rejection Region

Lower tail test

With α = 0.05 and the inequality, we have the entire rejection region at the left.
The critical value will be Z = -1.645. Reject Ho if Z < -1.645.
90
Statistical decision
 We reject the Ho because -2.12 < -1.645.
Conclusion
 We conclude that µ < 30.

91
Suppose that the Ho and HA take the form

Ho: µ = µo, HA: µ > µo


In this case, Ho would be rejected for large values of test statistic

Upper tail test

92
1.2 Unknown Variance
In most practical applications the standard deviation of the
underlying population is not known
In this case,  can be estimated by the sample standard deviation
s.
If the underlying population is normally distributed, then the test
statistic is:

93
Example: Two-Tailed Test
A random sample of 14 people from a certain population gives a
sample mean body mass index (BMI) of 30.5 and sd of 10.64. Can we
conclude that the BMI is not 35 at α 5%?

Ho: µ = 35, HA: µ ≠35

Test statistic

If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.

94
Decision rule

We have a two tailed test. With α = 0.05 it means that each tail is 0.025. The
critical t values with 13 df are -2.1604 and 2.1604. We reject Ho if the t ≤ -
2.1604 or t ≥ 2.1604.

Do not reject Ho because -1.58 is not in the rejection region.

Conclusion : Based on the data of the sample, it is possible that µ = 35.

95
Summary
Population mean, known population variance (or standard
deviation): Normal test.
Population mean, Unknown population variance (or standard
deviation) and small sample: Student’s t-test.
Single population proportion: Normal test.

96
Hypothesis Tests for
Proportions
Involves categorical values

 Two possible outcomes

“Success” (possesses a certain characteristic)


“Failure” (does not possesses that characteristic)
 Fraction or proportion of population in the “success” category is
denoted by p

97
Hypothesis Testing for Population Proportion

98
Example
We are interested in the probability of developing asthma over a
given one-year period for children 0 to 4 years of age whose mothers
smoke in the home.
In the general population of 0 to 4-year-olds, the annual incidence of
asthma is 1.4%. If 10 cases of asthma are observed over a single year
in a sample of 500 children whose mothers smoke, can we conclude
that this is different from the underlying probability of p0 = 0.014?
alpha = 5% H0 : p = 0.014
HA: p ≠ 0.014
99
The test statistic is given by:

100
The critical value of Zα/2 at α=5% is ±1.96.

Don’t reject Ho since Z (=1.14) in the non-rejection region between ±1.96.

We do not have sufficient evidence to conclude that the probability of


developing asthma for children whose mothers smoke in the home is different
from the probability in the general population

101
Chi-squared
test
The Chi-squared test measures the disparity between observed
frequencies (data from the sample) and expected frequencies.
Helps to check association between categorical variables.

The Chi-squared test is valid


If no observed cell is 0
And no 20% of expected cell is less than 5
Chi-square test (x2)…
Chi square test is used for nominal or ordinal explanatory and
response variables
Variables can have any number of distinct levels
If the two variables have two level each, the resulting contingency
table will be 2X2
Variable 1
Variable 2 Diseased Not diseased N ( ad  bc ) 2
 2
cal 
( a  c )(b  d )( a  b )(c  d )
Exposed A B A+B
Not exposed C D C+D
A+C B+D N
Chi-square test
(x 2
)…
 If the two variables have more than two levels; r rows and c
columns, the resulting table will be rXc
Variable 1
Total
1 2 … c

Variable 2 1 O11 O12 … O1c r1


2 O21 O22 … O2c r2
… … … … … …
r Or1 Or2 … Orc rr
Total c1 c2 … cc n

(observed frequency - expected frequency)2 (O  E) 2


χ 
2

all cells expected frequency all cells E
Chi-square
test (x2)…
Oij is the observed frequency and Eij is expected frequency

Expected Frequency (E) for i th row and the jth column ( Eij )
( row i total ) ( column j total ) ri c j
Eij  
sample size n
ri row i total c j column j total n total sample size

 The degrees of freedom df = (r-1)x(c-1)


 r = # of rows and c = # of columns in the contingency
Chi-square test
(x2)…
Hypothesis testing steps in chi square test
1. State Hypotheses:
Null hypothesis (Ho): The classification variables are independent

Alternative hypothesis (Ha): There is relationship between the variables

2. Compute test statistic: Get calculated 2 value


3. Determine critical values: Find the table value of 2 at a given df

4. Decision: Reject H0 if calculated 2 > critical value of 2 from the table.


Example 1

Consider the following 2X2 table. Is there association between wearing Helmet and head injury??
Use 95% confidence level.
Head injury Wearing helmet
Yes No Total
Yes 17 218 235

No 130 428 558

Total 147 646 793


Example 1…

Step 1: hypothesis

 HO : There is no association between wearing helmet and head


injury
HA : There is an association between wearing helmet and head
injury
 Step 2: Test statistics
χ2 = __N (ad-bc)2__= (17*428-218*130)2*793= 28.26
 nD nND nE nNE 235*559*147*646
Step 3: critical value χ2 = 3.84
Step 4: Decision reject the null hypothesis
Example 2

A sample of 263 students who bought lunch at a school


canteen were asked whether or not they developed
gastroenteritis. The response is given below
Gastroenteritis
Yes No Total
Ate sandwich
Yes 109 116 225

No 4 34 38

Total 113 150 263


Example 2…
 Step 1: hypothesis

 HO : There is no association between eating sandwich and gastroenteritis

 Ha : There is an association between eating sandwich and gastroenteritis

 Step 2. Test statistics χ2 = 17.6

 Step 3: Critical value of χ21(0.05)= 3.84

 Step 4: decision: since 17.6>3.84 then reject the null hypothesis and
decide as there is association between eating sandwich and
gastrointestinal pain
Parametric vs nonparametric
test
 Statistical methods which depend on the assumptions about
the distribution of parameters in the population are referred
to as parametric methods
 Parametric tests include t-test, ANOVA, Regression,
Correlation, and so on
 To use a parametric test, we must assume a normal
distribution for the dependent variable, equality of variance
where populations are compared, and large sample size
 However, in real research situations things do not come with
labels detailing the characteristics of the population of origin
111
01/19/25
Parametric vs
nonparametric test
 Non-parametric statistics (we call sometimes distribution free
statistics) were designed to be used when we know nothing
about the distribution of the variable of interest in the
population
 It requires fewer assumptions about the population
probability distribution
 It also handles data collected in the form of ranking
 Nonparametric methods are often the way to analyze
nominal or ordinal data and draw statistical112conclusions.
01/19/25
Parametric vs
nonparametric test
More generally, a nonparametric method has the following
advantages;
Methods are quick and easy to apply.
More powerful when the assumptions of normality have been
violated.
Can be used with small sample size.
•Not affected by the presence of outliers.
•Less sensitive for measurement error as it uses ranks.
•Inherently robust due to lack of stringent assumption.
113
01/19/25
Parametric Versus Non-
parametric tests
Parametric Non-parametric
Unpaired t- Wilcoxon rank sum test/
test Mann-Whitney-Wilcoxon Test
Paired t-test Wilcoxon signed rank test
One way Kruskal-Wallis test
ANOVA

114
01/19/25
1. Wilcoxon Signed-
Rank Test
 This test is the nonparametric alternative to the
parametric matched-sample test.

 The methodology of the parametric matched-sample


analysis requires:

 The assumption that the population of differences


between the pairs of observations is normally
distributed.

 If the assumption of normally distributed differences is


not appropriate, the Wilcoxon signed-rank test can be
115
used.
01/19/25
1. Wilcoxon Signed-Rank Test
(cont’d)
Cases: We are testing the effectiveness of a new fuel
additive. We run an experiment with 12 cars. We first run
each car without the fuel treatment and measure the
mileage. We then add the fuel treatment and repeat the
experiment.

We want to test the null hypothesis that the treatment had


no effect. STATA CODE: signrank mpg1=mpg2

01/19/25
116
1. Wilcoxon Signed-Rank Test
(cont’d)
Stata outputs

Conclusion: We reject Ho and conclude treatment had


effect

01/19/25 117
2. Wilcoxon rank-sum test / Mann-
Whitney-Wilcoxon Test
 It is frequently used as the nonparametric analogy of the
Student’s t-test to compare two sets.

 The data are non-normally distributed.

 It is also used, when the original measurements were made


on an ordinal scale.

 This test, unlike the Wilcoxon signed-rank test, is not based


on a matched sample.

118
01/19/25
2. Wilcoxon rank-sum test / Mann-
Whitney-Wilcoxon Test (cont’d)
 This method tests to determine whether the two
populations are identical.

The hypotheses are:

H0: The two populations are identical

Ha: The two populations are not identical

119
01/19/25
2. Wilcoxon rank-sum test /
Mann-Whitney-Wilcoxon Test
(cont’d)

• We want to check the existence of weight difference


among male and female infants (assume that
normality distribution is violated even after
transformation)

STATA CODE: ranksum weight, by(sexChild)

01/19/25
120
2. Wilcoxon rank-sum test /
Mann-Whitney-Wilcoxon Test (cont’d)

Stata outputs

Conclusion: We reject Ho and conclude that there is


weight difference among male and female infants

01/19/25 121
2. Wilcoxon rank-sum test /
Mann-Whitney-Wilcoxon Test
(cont’d)
Student class activity 1
1. Check the existence of length difference among male
and female infants (assume that normality
distribution is violated even after transformation )

01/19/25 122
3. Kruskal-Wallis
Test
 For a Gaussian outcome the means of three or more
independent groups are compared by one-way ANOVA.

 When the assumption of one-way ANOVA are not met, i.e.:

 Populations are not normally distributed with equal variance;


data consist of only ranks.

 The alternative is the Kruskal-Wallis one-way analysis.

123
01/19/25
3. Kruskal-Wallis Test
 (cont’d)
The Mann-Whitney-Wilcoxon test can be used to test whether two
populations are identical.

 The MWW test has been extended by Kruskal and Wallis for cases
of three or more populations.

 The Kruskal-Wallis test can be used with ordinal data as well as


with interval or ratio data.

 Also, the Kruskal-Wallis test does not require the assumption of


normally distributed populations.

The hypotheses are:

H0: All populations are identical

Ha: Not all populations are identical


124
01/19/25
3. Kruskal-Wallis Test (cont’d)

• We want to check the existence of weight difference


among different categories of gravidity (assume that
normality distribution is violated even after
transformation )

STATA CODE: kwallis weight, by(CatGravida)

01/19/25
125
3. Kruskal-Wallis Test (cont’d)

Stata outputs

Conclusion: We reject Ho and conclude that there is


weight difference among different categories of
gravidity
01/19/25 126
3. Kruskal-Wallis Test (cont’d)

Students class activity 2


1. Check the existence of length difference among
different categories of gravidity (assume that
normality distribution is violated even after
transformation )

01/19/25 127
Spearman rank correlation
• Measures the strength and direction of association between
two variables that are measured on an ordinal or
continuous scale.
• The Spearman correlation coefficient is often denoted by
the symbol rs (or the Greek letter ρ, pronounced rho).
• It is a useful test when Pearson's correlation cannot be run
due to violations of normality, a non-linear relationship or
when ordinal variables are being used.
• Stata code: spearman variable 1 variable 2 or using menu
bar Statistics > Nonparametric analysis > Tests of hypotheses >
Spearman's rank correlation

01/19/25 128
Quiz
Write null and alternative hypothesis for the following statements (1&2).

1. The average weight of patients admitted to the hospital is 45KG

2. The proportion of children exposed to asbestos is 0.21

TRUE OR FALSE(3 &4)

3. Chi-square distribution is used to assess association between categorical variables.

4. Hypothesis is written in terms of sample statistic.

5. Researcher who is rejecting true null hypothesis when in fact it is True is committing which
Type of errors? A. Type one error

B. Type two error C. Both type of errors

129
THE END OF THE COURSE

THANKS!!!

All the best!!!

130

You might also like