0% found this document useful (0 votes)
5 views74 pages

Lecture Note On Biostatistics

The lecture notes on Biostatistics cover fundamental concepts such as statistics, data types, measurement scales, and the distinction between descriptive and inferential statistics. It discusses various statistical measures, including central tendency and dispersion, as well as probability concepts and hypothesis testing. Additionally, it addresses correlation analysis and non-parametric tests, providing a comprehensive overview of biostatistical methods and their applications in biological and health sciences.

Uploaded by

zambuzaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views74 pages

Lecture Note On Biostatistics

The lecture notes on Biostatistics cover fundamental concepts such as statistics, data types, measurement scales, and the distinction between descriptive and inferential statistics. It discusses various statistical measures, including central tendency and dispersion, as well as probability concepts and hypothesis testing. Additionally, it addresses correlation analysis and non-parametric tests, providing a comprehensive overview of biostatistical methods and their applications in biological and health sciences.

Uploaded by

zambuzaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

[Type text]

Lecture note on
Biostatistics

Wai Yan Htun 1/1/18


Statistics – A field of study concerned with the collection, organization, summarizing and analysis of data and the
drawing of inference about a body of data when only a part of data is observed.

A descriptive measure computed from the data of a sample is called a statistic.

A descriptive measure computed from the data of a population is called a parameter.

Biostatistics is the science of data when the focus is on the biological and health sciences.

Summarization of data describing with graph, table and summary measure in statistical idea is called descriptive
statistics.

Drawing conclusion, decision or prediction about population from the sample evidence is inferential statistics.

Data – The numbers derived from measuring or counting/ the raw material of statistics.

Variable - The characteristics which can vary in focus may be people, place , material etc. E.g. height, sex. Age.

Types of variables

1. Qualitative variable - It measures the quality or characteristic of each experiment. eg. Eye color, Gender,
Degree. Presented by graphic. Category: Nominal and Ordinal
2. Quantitative variables – It measures the numerical quantity or the amount of each experiment. Presented
by Central tendency, Measurement of Dispersion.
Two classifications
1. Discrete (no. of children, no. of egg)
2. Continuous (Age, weight, height)

Random variable – It can’t be exactly predicted in advance due to chance factors.

Measurement scales

1. Nominal: Categorized, naming or classifying observations into collectively exhaustive categories. E.g.
Gender (male/female), Status (No IHD, IHD).
2. Ordinal: Categorized and ranked according to criteria. E.g. Degree of pain, Socioeconomic status.
3. Interval: Categorized, ranked and has constant units but not true zero. Distance between two
measurements is known, not true zero (Arbitrary zero). E.g. Temperature degree, IQ level
4. Ratio: Categorized, ranked, constant units and has true zero i.e. zero means that absence of
characteristics. E.g. Height, weight.

Data graphical presentation

In numerical data – measurement of central tendency and measurement of dispersion. Line chart, stem and leaf,
box and whisker can be used.

1
In categorical data – tables, frequency distribution. Histogram, frequency polygon, pie chart, bar graph can be
used.

Measurements of central tendency – Mean, Median, Mode

Properties of mean

 Unique (only one)


 Simple to calculate
 Extreme values (outlier) can influence on the mean. Trimmed mean exclude the outlier.

Properties of median

 Unique and simple to calculate


 Not affected by outlier

Trimmed mean

Estimators that are insensitive to outliers are called robust estimator. Another robust measure and estimator of
central tendency is the trimmed mean.

1. Order the measurement


2. Discard the smallest percent and largest percent.
3. Compute the arithmetic mean of the remaining measurements.

Measurements of dispersion (Deviation, Variation, Scatter, Spread)

 Range
 Variance
 Standard deviation
 Standard error
 Coefficient of variation
 Percentile, quartiles
 Interquartile range

Range

 Difference between maximum and minimum value

Interquartile range means (

Variance

 Dispersion related to the scatter of values about their mean


 The average summation of squared deviation of individual value from mean
2
 Sum of the squared deviation of the value from their mean is divided by numbers of observation (sample size
minus 1)
Formula :
̅
(For sample)

(For population)

Standard deviation: The square root of the variance. (Individual true variation). In journals, Mean ± SD , Median
± IQR

Formula

√ ( For population) √ ( For sample )

Difference between SD and SE

 SD measures the dispersion of individual which characterize a normal distribution.


 SE measures the precision for a sample that has been selected from a study population.

Krutosis

Kurtosis is a measure of the degree to which a distribution is “peaked” or flat in comparison to a normal
distribution whose graph is characterized by a bell-shaped appearance.
Characteristics of normal distribution

1. It is symmetrical about the mean (µ). The curve on either side of µ is mirror image of the other side.
2. The mean, median and mode are all equal.
3. The total area under the curve above X axis is one square unit.
4. 1SD from the mean in both directions, the area is 68.2%, 2SD and 3SD are 95.5% and 99.7%
respectively.
5. The normal distribution is determined by µ(shift axis) and σ degree of flatness). OR Mean shows height
and SD shows width of bell shape.
6. Tails extend indefinitely & never touch the base line.

Coefficient of Variation

Formula , COV = ( SD Mean ႔)

Central limit theorem

If random sample of n observation are drawn from a non - normal population with finite µ & σ, then n is large (n
≥ 30) the sampling distribution of sample mean ̅ is approximately normal distributed. The approximation
become more accurate as n becomes large.
3
Probability

Probability is the likelihood of an event occur and it can be assumed a value between 0 and 1.

Two views of probability

I. Subjective probability
 Called personalistic (based on belief, experience, prejudices)
 Does not rely on the repeatability
 Not fully accepted by statistician
 E.g. probability of major earthquake, probability of winning lottery.
II. Objective probability (Classical, Relative frequency probability)
 Based on equally likely events and long run relative frequency of events.
 Is the same for all observers (objective)
 Not based on personal belief
 E.g. toss a coin, pick a card.

Elementary properties of Probability

1. The probability of any event must be non- negative number.


P (E) ≥ 0
2. The sum of the probability of all mutually exclusive outcomes is equal to 1.
P (E1) + P(E2) +….+ P (En) = 1
3. In mutually exclusive events, P (E1 or E2) = P(E1) + P(E2)
4. In not mutually exclusive events, P (E1 or E2) = P (E1) +P (E2) - P (E1 and E2)

Rules of probability

1. Additional rule
In mutually exclusive events,
P (E1 or E2) = P (E1 U E2) = P (E1) + P (E2)
In not mutually exclusive events,
P (E1 U E2) = P (E1) + P (E2) – P (E1 and E2)
2. Multiplicative rule
If two events are independent,
P (A and B) = P (A) × P (B)
If two events are not independent,
P (A and B) = P (A) × P (B/A) (OR) P (B)× P (A/B)
3. Complementary rule
P (A) = 1- P (not A)

4
Types of probability

1. Marginal probability: the probability of the marginal total is used as numerator and the total group as
the denominator. E.g. P(Male) , P (Eye glass wearing)
2. Joint probability: the probability of an event possessing two characteristics at the same time. E.g. P
(male and E+)
3. Conditional probability: the probability of an event occurring given that another event has occurred.
E.g. P(A/B) =No. of occurrence possessing A and B / marginal total B

Joint probability = Marginal P × Conditional P


P (male and E+) = P (male) × P (E+ / male)

Probability Distribution

 Standard normal distribution = Continuous data distribution

 Binomial distribution Discrete data distribution

 Poisson distribution Discrete data distribution (average time of occurrence)

Estimation
Estimation is the process entails calculating from the data of a sample, some statistic that is offered as an
approximation of the corresponding parameter of the population from which the sample was drawn.

Also known as Inferential Statistics.

A point estimate is a single numerical value used to estimate the corresponding population parameter.

̅=

An interval estimate consists of two numerical values defining a range of values that, with a specified degree of
confidence, we feel includes the parameter being estimated.

Sample population is the population from which one actually draws a sample.

Target population is the population about which one to make an inference.

Confidence Interval An interval estimate provides more information about a population characteristic than does
a point estimate. Such interval estimates are called confidence intervals.

Confidence interval for population proportion

̂ -d <p< ̂ +d

The Estimate and The Estimator:

The estimate is a single computed value, but the estimator is the rule that tell us how to compute this value, or
estimate.
5
For example, ̅ = ∑ is an estimator of the population mean,. The single numerical value that results from
evaluating this formula is called an estimate of the parameter .

Point and Interval Estimates

Formula for estimation = Estimator ± (reliability coefficient) × SE

*Estimator may be mean or proportion. Reliability coefficient may be z or t.*

Margin of error

The maximum likely difference observed between sample mean (̅̅̅ and the population mean (µ) and also called
maximum error of the estimate.

It is denoted by (d).

.

The T Distribution

1. Bell shaped curve, symmetric and fatter tail than ND.


2. It has mean of zero.
3. It is symmetric about the mean.
4. The variance of the t distribution is (n-1)/(n-3) (Variance is 1 in Ẕ)
5. It ranges from - to  so 2 tail never touch the baseline.
6. Compared to the normal distribution, the t distribution is less peaked in the center and has higher tails.
(More sample size similar normal distribution).
7. It depends on the degrees of freedom (n-1).
8. The t distribution approaches the standard normal distribution as (n-1) approaches .

Hypothesis

- A hypothesis may be defined simply as a statement about one or more populations. (OR)
- An “educated guess” based on prior knowledge and observation
- Hypothesis testing is the process of making an inference or generalization on population parameters based
on the results of the study on samples.
6
- The research hypothesis is the conjecture or supposition that motivates the research. Statement about the
expected relationship of the variables.
- Statistical hypotheses are hypotheses that are stated in such a way that they may be evaluated by
appropriate statistical techniques. States there is no relationship between variables.

Type І error (α)

– The rejection of a null hypothesis which is actually true.


– The probability that we find an association between 2 factors when, in truth, one does not exist.
– Can also called false positive.
– Denoted by α.
– Is the level of significance or criterion for a hypothesis test.
– Considered more serious type of error.

Type П error (β)

- Fail to reject a false null hypothesis


- The probability that we find no association between 2 factors when, in truth, one does exist.
- Occur when sample is too small.
- Can also called false negative.

P Value

– The p-value is defined as the smallest value of α for which the null hypothesis can be rejected.
– If the p-value is less than or equal to α ,we reject the null hypothesis (p ≤ α)
– If the p-value is greater than α ,we do not reject the null hypothesis (p > α)

Chi square
Characteristics

- Data must be discrete & in the form of frequencies.


- Data are actual numbers.
- No negative numbers.
- 80% of cell have expected frequency at least 5 (not > 20% of cell that have expected frequency < 5)
- No cell contain expected frequency < 1.
- Chi square assume values between 0 to ∞.
- It is derived from normal distribution.
- For every cell, expected frequency can’t be equal to zero.

Uses

- To investigate the distribution of categorical variables


- Non parametric test to determine is a distribution of observed frequencies differ from the theoretical
expected frequencies.
- Use to compare the counts of categorical responses between two or more independent groups.
- Test of agreement between observation and hypothesis whenever data are in the form of frequencies.

7
Chi square test of independence

- Testing the null hypothesis that in the population the two criteria of classification are independent.
- A single sample drawn from a single population.
- Observation cross classified on the basis of two variables of interest.
- Calculating expected based on joint probability law.

Chi square test of Goodness of Fit

- One of the most commonly used non parametric test.


- This test is to determine how well an observed set of data fits an expected outcome.
- It is used to test the hypothesis that an observed frequency distribution fits (or conforms to) some claimed
distribution.

Advantages and disadvantages of non – parametric tests

Advantages

– Suitable for hypothesis not concerned with population parameter.


– Used when functional form of sample population is unknown.
– Calculation is easier and quicker than parametric tests.

8
– Used for data ranking and classifications.

Disadvantages

– Waste of data
– Laborious when n is large.

Correlation
– Correlation analysis is used to measure strength of the association (linear relationship) between two
variables
o Correlation is only concerned with strength of the relationship
o No causal effect is implied with correlation

Types of coefficients

– Pearson’s correlation coefficient (r)


– Spearman’s (rho)

Pearson’s correlation coefficient (r)

– (r) is to measure the precision of the linear relationship between two variables.
– It is a measure of how well the data fit a straight line.

Interpretation of Correlation coefficient

Correlation Coefficient values Direction and strength of Correlation


-1.0 Perfectly negative
-0.8 Strongly negative
-0.5 Moderately negative
-0.2 Weakly negative
0.0 No association
+0.2 Weakly positive
+0.5 Moderately positive
+0.8 Strongly positive
+1.0 Perfectly positive

9
Uses for correlations
• Prediction
• Validity
• Reliability
• Theory Verifications
Features of Correlation Coefficient, r
– Unit free
– Ranges between –1 and 1
– The closer to –1, the stronger the negative linear relationship
– The closer to 1, the stronger the positive linear relationship
– The closer to 0, the weaker any positive linear relationship
– If r > 0 we have a Positive correlation
– Plus sign means positive correlation. They tend to increase or decrease together.
– If r < 0 we have a Negative correlation
– Minus sign means negative correlation. One tends to increase as the other tends to decrease.
– If r = 0 we have No linear correlation

Common Errors Involving Correlation

– Causation: It is wrong to conclude that correlation implies causality.


– Averages: Averages suppress individual variation and may inflate the correlation coefficient.

10
– Linearity: There may be some relationship between x and y even when there is no significant linear
correlation.

Factors Affecting the size of the Pearson r

– Linearity
– Homogeneity or Restriction of Range
– Outliers

Spearman’s Rank Correlation

– A non-parametric method of analysis


– Measured by Spearman's rank correlation coefficient (Spearman's Rho or Spearman' r)
– Association between 2 ordinal variables or one ordinal and one numerical variable

When to use

- At least one variable is ordinal.

- Variables are not normally distributed.

- Sample size is small.

- Relationship is non-linear.

- Data include outlying values.

Simple Linear Regression


– Simple linear regression is the process of predicting or estimating scores on a Y variable based on
knowledge of scores on an X variable: the regression of Y on X.

– This straight line is the linear regression line; it represents how, on average, a change in the X variable is
associated with a change in the Y variable.

– Regression is most often used to study things that can’t be studied in experimental procedures of quasi-
experimental studies.

– The independent and dependent variables are usually continuous

o Exception: can be used if the independent variable is dichotomous (example: gender)

– Regression allows us to have statistical control over confounding effects.

Steps in Regression analysis

– Assumption underlying a linear relationship are met?

– Obtain the equation for the line that best fits the sample data.

– Strength of the relationship and the usefulness of the equation for predicting and estimating.

– Satisfactorily to the linear model, use the equation obtained from the sample data to predict and to
estimate.

11
Interpretation of the Slope and the Intercept

– a (Y intercept) is the estimated average value of Y when the value of X is zero

– B (slope) is the estimated change in the average value of Y as a result of a one-unit change in X

Coefficient of determination (r2)

– The coefficient of determination is the portion of the total variation in the dependent variable that is
explained by variation in the independent variable
– The coefficient of determination is also called r-squared and is denoted as r2

SSR regression sum of squares


r2  
SST total sum of squares

Note. 0  r 2  1

ANOVA
– Analysis of variance (ANOVA) is a technique whereby the total variation present in a set of data is
partitioned into two or more components. Associated with each of these components is a specific source
of variation, so that in the analysis it is possible to ascertain the magnitude of the contributions of each of
these sources to the total variation.
– used for two different purposes:
(1) to estimate and test hypotheses about population variances, and
(2) to estimate and test hypotheses about population means.

One way ANOVA Two way ANOVA


Is a hypothesis test in which only one categorical Is a hypothesis test in which two categorical variables
variable considered. are considered.
Involve one independent variable Involve two independent variables
Analyze three or more categorical groups Analyze multiple group of two factors
Need only two principles of design experiment Need three principles (replication, randomization &
(replication & randomization) local control)
Do not need to be same observation number in each Need to be same observation number in each group
group

Features of Two-Way ANOVA F Test

– Degrees of freedom always add up


 n-1 = rc(n’-1) + (r-1) + (c-1) + (r-1)(c-1)
 Total = error + factor A + factor B + interaction
– The denominator of the F Test is always the same but the numerator is different
12
– The sums of squares always add up
 SST = SSE + SSA + SSB + SSAB
 Total = error + factor A + factor B + interaction

Assumptions
Hypothesis

Single mean

• Population is normally or approximately normally distributed with known or unknown variance (sample
size n may be small or large),

• Population is not normal with known or unknown variance (n is large i.e. n≥30).

Two means

• Samples are randomly and independently drawn


• Population is normally or approximately normally distributed with known or unknown variance (sample
size n may be small or large)
• Population is not normal with known variances (n is large i.e. n≥30).

Two proportion

Normally distributed and randomly selected

Chi Square

– Variables are QUALITATIVE Data (e.g. sex, age group, intensity of exposure, etc.)
– For every cell in the table, the expected frequency cannot be equal to zero
– Not more than 20% of total number of cell in the table should have values <5
– If these are not met, the Fisher’s Exact Test should be used instead of the 2 test.
13
Correlation

(1) For each value of X there is a normally distributed subpopulation of Y values.

(2) For each value of Y there is a normally distributed subpopulation of X values.

(3) The joint distribution of X and Y is a normal distribution called the bivariate normal distribution.

(4) The subpopulations of Y values all have the same variance.

(5) The subpopulations of X values all have the same variance.

ANOVA

1. Populations are normally distributed

2. Populations have equal variances

3. Samples are randomly and independently drawn

VIVA QUOTES

1. What is statistics and why study statistics?

To know information To increase Critability To do correct Decision

2. What is descriptive and inferential statistics?


Descriptive – Just describe the data in meaningful style and meaningful idea by using graph and table.
Inferential – same as definition
3. What is data? (Look definition)
4. What is variable? (Look definition)

14
Nominal (Just naming) Eg.
blood Group
Qualitative (Categorized)

Ordinal (+ order array) Eg. BMI

By Characteristics

Discrete (Gap+)

Quantitative (Measurable)
Continuous (Gap-) Eg. Blood
glucose Level,Wt. Height
Nominal

Ordinal

By Measurement Scale

Variable
Interval

Ratio

Dependent (Outcome)

By Function

Independent (Exposure)

Univariate

By Statistical Uses Bivariate

Multivariate

Probability Concepts
Probability distribution

Relationship between probabilities and Random variable (d/t chance factor) summarized by mean.

 for summarizing and describing a set of data and

for reaching conclusions about a population of data.


TABLE 1. Number of Assistance Programs Utilized by Families with Children in Head Start Programs in
Southern Ohio.
(probability and cumulative probability distribution)
No; of Programs(X) Frequency P(X=x) Cumulative frequency P(X< x)
1 62 .2088 .2088
2 47 15.1582 .3670
3 39 .1313 .4983
4 39 .1313 .6296
1 .0.1313 2. O.3670 3. 0.3670 4. 0.4983 P ( X <4)= P ( X≥3) 5. 1 - P (X≤4) =0.3704
 Continuous random variable
has infinitely many values, and those values can be associated with measurements on a continuous scale with
no gaps or interruptions. Normal, Uniform, Exponential.

 Discrete random variable

has a finite (or countable) number of values. Binomial, Hypergeometeric, Poisson.

Group Data Exercise

Age No. of case Cumulative Mi (mean) MiFi Fi True Ci


(frequency) Frequency (mean *
frequency)
5-14 5 5 9.5 90.25 47.5 451.25 4.5-14.5
15-24 10 15 19.5 380.25 195 380.25 14.5-24.5
25-34 20 35 29.5 870.25 590 17405 24.5-34.5
35-44 22 57 39.5 1560.25 869 34325.5 34.5-44.5
45-54 13 70 49.5 2450.25 643.5 31853.25 44.5-54.5
55-64 5 75 59.5 3540.25 297.5 17701.25 54.5-64.5

75 2642.5

16
Mean = = = 35.23 Li = lower level of mean frequency group

J = N – F (N = total frequency & F = total


Median = Li + (Ui – Li) = 34.5 + (10) = 35.86
frequency above the box)

= Fi = frequency of group which contain mean


frequency
= = 168.04

SD = √ = 12.96

Histogram & Polygon by Age in X axis and No. of cases in Y axis.

……………………………………………………………………………………………..

17
 Binomial distribution

- is a probability distribution

- is derived from an experiment or a trial can result in only one of two mutually exclusive outcomes.
(Bernoulli trial)

- Probability of success = p

- Probability of failure = q or (1-p)

- no. of occurrence = x

- no. of trial (independent)= n (Sample size)

- Combination formula =C

f (x) = nCx px qn-x

= n!/ x! (n-x)!

The Bernoulli process

1) results are one or two possible, mutually exclusive, outcomes. One is denoted as a success and the other is
denoted as a failure

2) success denoted by p and failure denoted by q (1-p)

3) Trials are independent

P(x) = n C x . (p) x . (q) n-x

P( ≤x) = use of Binomial table

Mean (µ) = n p

Variance ( σ2) = n .p .q

F(X) is used in exactly data and P(X) is used when table apply.

Rule of Combinations

 The number of combinations of selecting X objects out of n objects is

where:

n! =n(n - 1)(n - 2) . . . (2)(1)

X! = X(X - 1)(X - 2) . . . (2)(1)

0! = 1 (by definition)

18
Binomial Distribution Formula

P(x) = n C x . (p) x . (q) n-x

P(X) = probability of X successes in n trials, with probability of success p on each trial

X = number of ‘successes’ in sample, (X = 0, 1, 2, ..., n)

n = sample size (number of trials or observations)

p = probability of “success”

Example: Flip a coin four times, let x = # heads:

n = 4, p = 0.5, 1 - p = (1 - .5) = .5, X = 0, 1, 2, 3, 4

Example: Calculating a Binomial Probability

1. What is the probability of one success in five observations if the probability of success is .1?

X = 1, n = 5, and p = .1
n!
P( X  1)  p X (1  p) n  X
X !(n  X )!
5!
 (.1)1 (1  .1)51
1!(5  1)!
 (5)(.1)(.9) 4
 .32805

2. The data from the North Carolina State Center for Health Statistics (A-3) show that 14 percent of mothers
admitted to smoking one or more cigarettes per day during pregnancy. If a random sample of size 10 is
selected from this population, what is the probability that it will contain exactly four mothers who
admitted to smoking during pregnancy?

1
N= 10 x= 4 p= 14% = 0.14 P(4) = ? Use P(x) = n C x . (p) x . (q) n-x

ANS . 0.0326
3. Suppose it is known that 10 percent of a certain population is color blind. If a random sample of 25
people is drawn from this population, use Table B in the Appendix to find the probability that:
N = 25 , p= 10% = 0.1

(a) Five or fewer will be color blind. P ( x ≤5 ) = 0.9666 ( in ≤ use direct number)

(b) Six or more will be color blind. P ( x≥6 ) = 1- 0.9666 = 0. 0334( in ≥ , 1- lower number)

(c) Between six and nine inclusive will be color blind.

P ( 6≤x≤9) = P( x ≤9) – P (x ≤ 5 ) = 0.9999 – 0. 9666 = 0.0333

19
(d) Two, three, or four will be color blind.

P (2 ≥ x ≤4 ) = P ( x ≤ 4 ) – P ( x ≤ 1 ) = 0.8302

Using Table B When p > .5

 P ( X = x / n, p > .5) = P ( X = n - x, n, 1 – p)

 P ( X ≤ x / n, p > .5) = P ( X ≥ n – x/ n, 1 – p)

 P ( X ≥ x / n, p > .5) = P ( X ≤ n – x/ n, 1 – p)

4. Assuming that the probability of giving this answer to the question is .55 for any Massachusetts resident,
use Table B to find the probability that if 12 residents are chosen at random: p= .55 , n = 12

(a) Exactly seven will answer “serious problem.” X = 7

P ( X = x / n, p > .5) = P ( X = n - x, n, 1 – p)

P (X=5, 12, 0.45) = P ( x ≤ 5 ) – P ( x ≤ 4 ) = 0.5269 – 0.3044 = 0.2225 (Look at x value 5 instead of


7)

(b) Five or fewer households will answer “serious problem.” X ≤ 5

P ( X ≤ x / n, p > .5) = P ( X ≥n – x/ n, 1 – p)

= P (x ≥ 7 / 12, 1-0.45) = 1- x ≤ 6 = 1- 0.7393 = 0.260

(c) Eight or more households will answer “serious problem.”

P (X ≥ x / n, p > .5) = P ( X ≤ n – x/ n, 1 – p)

= P (x ≤4,12, 0.45) = 0.3044#

EXERCISE 4.3.1

n= 20, p = .24

(a) Exactly 3 = P (x ≤ 3) – P (x ≥ 2) = 0.2569 – 0.1085 = 0.1484 #


(b) 3 or more = 1 – P (x ≤ 2) = 1 – 1085 = 0.8915 #
(c) Fewer than 3 = P (x < 3) = P (x = 2) = 0.1085 #
(d) Between 3 and 7 inclusive = P (x ≥ 7) – P (x≤ 3) = 0.9165 – 0.1085 = 0.808 #

EXERCISE 4.3.4

N = 15, P =0.32

(a) Three = P (x = 3) = P (x≤ 3) – P (x ≤ 2) = 0.2420 – 0.0962 = 0.1458 #


(b) Less than 5 = P (x < 5) = P (x ≤ 4) = 0.4477 #
(c) Between 5 and 9 inclusive = 0.3331 #
(d) P (5 < x < 10) = P (x = 9) – P (x = 6) = 0.9938 – 0.8278 = 0.166 #

EXERCISE 4.3.7

20
N= 3, P = 0.19

(a) Exactly 0 =
P (0) = =0.5314 #
(b) P (1) = 0.374 #
(c) P (x > 1) = P (x=2) + P (x=3) = 0.0945 #
(d) P (x≤2) = P (x=0) + P (x=1) + P (x=2) = 0.9931 #
(e) P (x=2 or 3) = P (x≥ 2) = P (x=2) + P (x=3) = 0.0945 #
(f) P (x=3) = 0.0069 #

Poisson Model

𝒒 or P(x) =

Exercise 4.4.1

λ=4

(a) P(exactly 5 ) = = = 0.1562 #


Use Table C
(b) P (x > 5) = 1- P (x= 5) = 1- 0.785 = 0.215 #
(c) P (x <5) = 0.629 #
(d) P (5 < x <7 inclusive) = P (x=7) – P (x=4) = 0.949 – 0.629 = 0.32#

Exercise 4.4.3

λ=5

(a) P (exactly 7) =
(b) P (x >10) = 1 – P (x = 9) = 1- 0.0968 = 0.032 #
(c) P (x= 0) = 0.007 #
(d) P (x <5) = 0.440 #

Exercise 4.4.4

λ = 0.5

(a) P (exactly 1 ) = = 0.303 #


(b) P(x=0) = 0.607 #
(c) P (x= 4) = 1 – P (x ≤ 3) = 1- 0.998 = 0.002 #
(d) P (x ≥ 1) = 1 – P (x ≤ 0) = 1 – 0.607 = 0.393 #

Exercise 4.4.5

λ = 13

(a) P (x=10) = 0.086 #


(b) P (x ≥ 8) = 1- P (x ≤ 7) = 1- 0.054 = 0.946 #
21
(c) P (x ≤ 12) = 0.463 #
(d) P (9 ≤ x ≤ 15) = P (x ≤ 15) – P (x ≤ 8) = 0.764 – 0.100 = 0.664 #
(e) P (x <7) = P (x ≤ 6) = 0.026 #

Exercise 5.3.1

µ = 204 σ= 44 N = 50

Use √

= 6.2225#

Exercise 5.3.3

If the uric acid values in normal adult males are approximately normally distributed with a mean and standard
deviation of 5.7 and 1 mg percent, respectively, find the probability that a sample of size 9 will yield a mean:
(a) Greater than 6 (b) Between 5 and 6 (c) Less than 5.2
µ = 5.7 σ = 1 N = 9

(a) Greater than 6, so x= 6 Ẕ > 6 =?

̅
Use Ẕ=

= 0.9 = 0.8159 (0.9 of table D)


P (x > 6) = 1 - o.8159 = 0.1841#
(b) Between 5 and 6 so x = 5 and x = 6

P ( 5<Ẕ<6 ) = <Ẕ< = -2.1<Ẕ,0.9 = 0.0179 <Ẕ < 0.8159 ( Look Table D)


√ √

= 0.8159 – 0.0179 =0.7980 #

(c) Less than 5.2, so x = 5.2


P ( x < 5.2 ) = Ẕ< 5.2 = = - 1.5 ( look table D ) = 0.0668 #

Exercise 5.3.5 is similar with 5.3.3

Distribution of the difference between two sample means

Exercise 5.4.1

Use these estimates as the mean m and standard deviation s for the respective U.S. populations. Suppose we select
a simple random sample of size 50 independently from each population. What is the probability that the
difference between sample means _xB _ _xA will be more than 8.
22
=183 = 189 = 37.2 = 34.7 n=50 * by given.

̅ ̅
= = 0.28 = 0.6103 in Table D
√ √

Due to Ẕ > 8 = 1- 0.6103 =0.3897 #

Exercise 5.4.3 = 0 (by given) Ẕ ≥ 10 = 1 – 0.9962 = 0. 0038 #

Exercise 5.4.5 ,2,4,6, similar methods.

*TO READ THE SUM CAREFULLY AND NOTICE THE WORDS SUCH AS EQUAL MEANS,
VARIANCES*

DISTRIBUTION OF THE SAMPLE PROPORTION


̂

̂ is the sample proportion, find np and nq to determine whether > 5 or not.

Remember q = 1-p .

EXERCISE 5.5.1

Smith et al. [A-5] performed a retrospective analysis of data on 782 eligible patients admitted with myocardial
infarction to a 46-bed cardiac service facility. Of these patients, 248 (32 percent) reported a past myocardial
infarction. Use .32 as the population proportion. Suppose 50 subjects are chosen at random from the population.
What is the probability that over 40 percent would report previous myocardial infarctions?
N=50, p=0.32, ̂ q = (1-p) = 0.68 , np = 16 , nq = 34

̂
=0.8869

Ẕ >40% = 1- 0.8869 = 0.1131 #

EXERCISE 5.5.3 similar above.

DISTRIBUTION OF THE DIFFERENCE BETWEEN TWO SAMPLE PROPORTIONS

The mean difference ̂ ̂ =

The variance , ̂ ̂

̂ ̂

23
EXERCISE 5.6.1

According to the 2000 U.S. Census Bureau [A-8], in 2000, 9.5 percent of children in the state of Ohio were not
covered by private or government health insurance. In the neighboring state of Pennsylvania, 4.9 percent of
children were not covered by health insurance. Assume that these proportions are parameters for the child
populations of the respective states. If a random sample of size 100 children is drawn from the Ohio population,
and an independent random sample of size 120 is drawn from the Pennsylvania population, what is the probability
that the samples would yield a difference, ^p1 - ^p2 of .09 or more?
̂ ̂

̂ ̂ =

̂ ̂
= (Table D)

Ẕ > 0.9 = 1 – 0.8944 = 0.1056 #

EXERCISE 5.6.3

From the results of a survey conducted by the U.S. Bureau of Labor Statistics [A-9], it was estimated that 21
percent of workers employed in the Northeast participated in health care benefits programs that included vision
care. The percentage in the South was 13 percent. Assume these percentages are population parameters for the
respective U.S. regions. Suppose we select a simple random sample of size 120 northeastern workers and an
independent simple random sample of 130 southern workers. What is the probability that the difference between
sample proportions, ^p1_ ^p2, will be between .04 and .20?

̂ ̂= 0.00225

Ẕ> 0.04 = - 0.8428 = 0.2005 (Table D)

0.04 < Ẕ < 0.2 = 0.9943 – 0.2005 = 0.7938 #

STANDARD NORMAL DISTRIBUTION

 Is a special case of normal distribution with mean equal 0 and SD of 1.


 It is symmetrical about 0.
 We can use the table D to find the probability and areas

z2
1 
f ( z)  e 2

2
Note that
The cumulative probabilities P(Z  z) are given in
tables for -3.49 < z < 3.49. Thus, P (-3.49 < Z < 3.49)  1.
For standard normal distribution, P (Z > 0) = P (Z < 0) = 0.5
24
Example 4.6.1:

If Z is a standard normal distribution,


then P( Z < 2) = 0.9772
= is the area to the left to 2 and it equal 0.9772 #
Example 4.6.2: 2
P (-2.55 < Z < 2.55) is the area between
-2.55 and 2.55, Then it equals
P (-2.55 < Z < 2.55) =0.9946 – 0.0054
= 0.9892#
-2.55 0 2.55
Example 4.6.3:
P (-2.74 < Z < 1.53) is the area between
-2.74 and 1.53.
P (-2.74 < Z < 1.53) =0.9370 – 0.0031 = 0.9339 -2.74 1.53
Example 4.6.4
P (Z > 2.71) is the area to the right to 2.71.
So, P (Z > 2.71) =1 – 0.9966 = 0.0034#

Example 4.6.5: 2.71


P (0.84 ≤ z ≤ 2.45) is the area between 0.84 and 2.45.
So, P (0.84 ≤ z ≤ 2.45) = P (z ≤ 2.45) – P (z ≤ 0.84)
= .9929 - .7995 = .1934#

Example 4.7.1:

The ‘Uptime ’is a custom-made light weight battery-operated activity monitor that records the amount of time an
individual spend the upright position. In a study of children ages 8 to 15 years. The researchers found that the
amount of time children spend in the upright position followed a normal distribution with

Mean of 5.4 hours and standard deviation of 1.3. Find If a child selected at random, then 1-The probability that
the child spends less than 3 hours in the upright position 24-hour period

P ( X < 3) = P( < ) = P(Z < -1.85) = 0.0322#

-------------------------------------------------------------------------

2-The probability that the child spend more than 5 hours in the upright position 24-hour period

P ( X > 5) = P( > ) = P(Z > - 0.31)

= 1- P (Z < - 0.31) = 1- 0.3520= 0.648#

3-The probability that the child spends exactly 6.2 hours in the upright position 24-hour period

25
P (X = 6.2) = 0

4-The probability that the child spends from 4.5 to 7.3 hours in the upright position 24-hour period

P ( 4.5 < X < 7.3) = P ( < < ) = P( -0.69 < Z < 1.46 )

= P(Z<1.46) – P (Z< -0.69) = 0.9279 – 0.2451 = 0.6828#

EXAMPLE 4.7.2
Diskin et al. (A-11) studied common breath metabolites such as ammonia, acetone, isoprene, ethanol, and
acetaldehyde in five subjects over a period of 30 days. Each day, breath samples were taken and analyzed in the
early morning on arrival at the laboratory. For subject A, a 27-year-old female, the ammonia concentration in
parts per billion (ppb)followed a normal distribution over 30 days with mean 491 and standard deviation
119.What is the probability that on a random day, the subject’s ammonia concentration is between 292 and 649
ppb?

µ = 491, σ = 119, P (292 < x < 649) =?


=
=

P (292 ≤ x ≤ 649) = 0.9082 – 0.0475 = 0.8607 #

EXERCISE 4.7.1

For another subject (a 29-year-old male) in the study by Diskin et al. (A-11), acetone levels were normally
distributed with a mean of 870 and a standard deviation of 211 ppb. Find the probability that on a given day the
subject’s acetone level is:

(a) Between 600 and 1000 ppb


(b) Over 900 ppb
(c) Under 500 ppb
(d) Between 900 and 1100 ppb

µ = 870, σ = 211,

(a) P (600 < x < 1000) =?

= #

= = 0.62 = 0.7324 #

P (600 < x < 1000) = 0.7324 – 0.1003 = 0.6321#

(b) P (x > 900) = 0.14 = 0.5557 = 1 – 0.5557 = 0.4443 #


(c) P (x < 500) = - 1.75 = 0.0401 #
(d) P (900 < x < 1100) = 0.8621 – 0.5557 = 0.3064 #

EXERCISE 4.7.2 ***

In the study of fingerprints, an important quantitative characteristic is the total ridge count for the
10 fingers of an individual. Suppose that the total ridge counts of individuals in a certain population
26
are approximately normally distributed with a mean of 140 and a standard deviation of 50. Find the
probability that an individual picked at random from this population will have a ridge count of:

(a) 200 or more


(b) Less than 100

(c) Between 100 and 200

(d) Between 200 and 250


(e) In a population of 10,000 people how many would you expect to have a ridge count of 200 or
more?

EXERCISE 4.7.4***

Suppose the average length of stay in a chronic disease hospital of a certain type of patient is 60 days with
a standard deviation of 15. If it is reasonable to assume an approximately normal distribution of lengths of
stay, find the probability that a randomly selected patient from this group will have a length of stay:
(a) Greater than 50 days (b) Less than 30 days
(c) Between 30 and 60 days (d) Greater than 90 days

27
Estimation
Case 1

Population is N or ≈ N
is known (N large

or small)

N large Ẕ or t
is unknown
N small t

Case 2

If population is not normally distributed and n is large σ2 is known or σ2 is unknown use

z (Central Limit Theoram) or t.

Case 3
Two population are N or ≈ N

is known ( N1, N2

large or small)

N1, N2 large Ẕ or t

is unknown if Population variances


t pool
equal
N1, N2 small
Population variances
ť
not equal

Case 4

If population is not normally distributed and n1, n2 are large (n1 ≥ 30 , n2≥ 30) and population variances
is known → z (Central Limit Theoram).

σ
XZ
Confidence Interval for μ (σ Known) n

Finding the critical value Ẕ

28
Example 1.

 A sample of 11 circuits from a large normal population has a mean resistance of 2.20 ohms. We know
from past testing that the population standard deviation is .35 ohms. Determine a 95% confidence
interval for the true mean resistance of the population.

Solution:
Interpretation
σ
X Z
n  We are 95% confident that the true mean resistance
 2.20  1.96 (.35/ 11) is between 1.9932 and 2.4068 ohms

 2.20  .2068  Although the true mean may or may not be in this
(1.9932 , 2.4068) interval, 95% of intervals formed in this manner will
contain the true mean

Confidence Interval for μ (σ Unknown)

 If the population standard deviation σ is unknown, we can substitute the sample standard deviation, S

 This introduces extra uncertainty, since S is variable from sample to sample

 So we use the t distribution instead of the normal distribution

Assumptions

 Population standard deviation is unknown. Population is normally distributed. If population is not normal,
use large sample. Use Student’s t Distribution or z distribution substitute the sample SD (s)

S
X  t n-1
Confidence Interval Estimate: n

Example 2 A random sample of n = 25 has X = 50 and S = 8. Form a 95% confidence interval for μ. d.f. = n – 1
= 24, so

t /2 , n1  t.025,24  2.0639

The confidence interval is

S 8
X  t /2, n-1  50  (2.0639)
n 25

(46.698, 53.302)

Confidence Interval for the difference between two Population Means: (CI)

When the population is normal,

When the variance is known and the sample sizes is large or small, the C.I. has the form:

 12  22  12  22
( x1  x2 )  Z    1   2  ( x1  x2 )  Z  
1 n1 n2 1 n1 n2
2 2

29
When the variance is unknown and the sample sizes is large, the C.I. has the form;

 12  22  12  22
( x1  x2 )  Z    1   2  ( x1  x2 )  Z  
1 n1 n2 1 n1 n2
2 2

When variances are unknown but equal, and the sample size is small, the C.I. has the form:

S p2 S p2 S p2 S p2
( x1  x2 )  t    1   2  ( x1  x2 )  t  
1 , ( n1  n2  2 ) n1 n2 1 , ( n1  n2  2 ) n1 n2
2 2

where
(n1  1) S12  (n2  1) S 22
S p2 
n1  n2  2

When variances are unknown and not equal and the sample size is small, The CI has the form;

S12 S2 S12 S2
( x1  x2 )  t    2  1   2  ( x1  x2 )  t    2
1 , ( n1  n2  2 ) n1 n2 1 , ( n1  n2  2 ) n1 n2
2 2

where
w1t1  w2t 2
t  
1
2 w1  w2

Example 3

The researcher team interested in the difference between serum uric acid level in a patient with and without
Down’s syndrome. In a large hospital for the treatment of the mentally retarded, a sample of 12 individuals with
Down’s Syndrome yielded a mean of mg/ 100 ml. In a general hospital a sample of 15 normal
individuals of the same age and sex were found to have a mean value of . If it is reasonable to assume
that the two population of values are normally distributed with variances equal to 1 and 1.5, find the 95% C.I for
μ1 - μ2

Solution:

1- =0.95→ =0.05→ /2=0.025 → Z (1- /2) = Z0.975 = 1.96


 12  22 1 1.5
( x1  x2 )  Z    (4.5  3.4)  1.96 
1 n1 n2
2 12 15
1.1 ± 1.96(0.4282) = 1.1± 0.84 = (0.26, 1.94)

Example 4

The purpose of the study was to determine the effectiveness of an integrated outpatient dual-diagnosis treatment
program for mentally ill subject. The authors were addressing the problem of substance abuse issues among
people with severe mental disorder. A retrospective chart review was carried out on 50 patients, the recherché was
interested in the number of inpatient treatment days for physics disorder during a year following the end of the
program. Among 18 patients with schizophrenia. The mean number of treatment days was 4.7 with standard
30
deviation of 9.3. For 10 subject with bipolar disorder, the mean number of treatment days was 8.8 with
standard deviation of 11.5. We wish to construct 95% C.I for the difference between the means of the
populations Represented by the two samples.

Solution: 1-α =0.95 → α = 0.05 → α/2 =0.025 → 1- α/2 = 0.975, n1 + n2 – 2 = 18 + 10 -2 = 26

t (1- α/2), (n1+n2-2) = t0.975,26 = 2.0555, then 95% C.I for μ1 – μ2

S p2 S p2
( x1  x2 )  t  
1 ,( n1  n2  2 ) n1 n2
2

(n1  1) S12  (n2  1) S 22 (17 x9.32 )  (9 x11.52 )


S p2    102 .33
n1  n2  2 18  10  2

Then, (4.7-8.8) ± 2.0555 √102.33/18 +102.33/10 = - 4.1 ± 8.20 = (- 12.3, 4.1)

NOTES;

When the interval includes zero, the population means may be equal. When not includes zero, the population
means are different.

Confidence Intervals for the Population Proportion, p

pˆ (1  pˆ )
pˆ  Z1 / 2
n

where

 Z is the standard normal value for the level of confidence desired

 ṕ is the sample proportion calculated by

 n is the sample size

Example 5

A random sample of 100 people shows that 25 are left-handed. Form a 95% confidence interval for the true
proportion of left-handers.

pˆ  Z p̂(1  p̂)/n
 25/100  1.96 .25(.75)/1 00 Interpretation

 .25  1.96 (.0433) We are 95% confident that the true percentage of left-handers in the
(0.1651 , 0.3349) population is between

16.51% and 33.49%.

Confidence Interval for the difference between two Population proportions

A 100(1-α) % confident interval for P1 - P2 is given by

ˆ (1  P
P ˆ ) Pˆ (1  P
ˆ )
ˆ P
(P ˆ )Z 1 1
 2 2
1 2 
1 n1 n2
2

31
Example 6

Connor investigated gender differences in proactive and reactive aggression in a sample of 323 adults (68 females
and 255 males). In the sample ,31 of the female and 53 of the males were using internet in the internet café. We
wish to construct 99 % confident interval for the difference between the proportions of adults go to internet café
in the two sampled population.

Solution:

1-α =0.99 → α = 0.01 → α/2 =0.005 → 1- α/2 = 0.995, Z 1- α/2 = Z 0.995 =2.58 , nF=68, nM=255,

aF 31 aM 53
pˆ F    0.4559, pˆ M    0.2078
nF 68 nM 255

The 99% CI is:

PˆF (1  PˆF ) PˆM (1  PˆM )


( PˆF  PˆM )  Z  
1 nF nM
2

0.4559 (1  0.4559 ) 0.2078(1  0.2078)


(0.4559  0.2078)  2.58 
68 255

0.2481 ± 2.58(0.0655) = (0.07914, 0.4171)

Exercises

Exercise 6.2.1

We wish to estimate the average number of heartbeats per minute for a certain population. The average number of
heartbeats per minute for a sample of 49 subjects was found to be 90. Assume that these 49 patients constitute a
random sample, and that the population is normally distributed with a standard deviation of 10.

n= 49, ̅ , σ = 10

̅

90% CI for µ = 90 ± 1.645 ( = 90 ± 2.35 = (87.65, 92.35)


This interval may be interpreted from both the probabilistic and practical points of view. We are 90 percent
confident that the true population mean µ, is somewhere between 87.65 and 92.35 because, in repeated sampling,
90 percent of intervals constructed in like manner will include the population mean.
Calculate 95% CI for µ, use Ẕ value is 1.96 and 99% for 2.58.

Ans; 87.2, 92.8 for 95% and 86.32, 93.68 for 99%)

32
Exercise 6.3.3

Pedroletti et al. (A-3) reported the maximal nitric oxide diffusion rate in a sample of 15 asthmatic schoolchildren
and 15 controls as mean ± standard error of the mean. For asthmatic children, they reported 3:5 ± 0.4nL/s
(nanoliters per second) and for control subjects they reported 0:7 ± .1nL/s. For each group, determine the
following:
(a) What was the sample standard deviation?
(b) What is the 95 percent confidence interval for the mean maximal nitric oxide diffusion rate of the population?
(c) What assumptions are necessary for the validity of the confidence interval you constructed?
(d) What are the practical and probabilistic interpretations of the interval you constructed?
(e) Which interpretation would be more appropriate to use when discussing confidence intervals
with someone who has not had a course in statistics? State the reasons for your choice.
(f) If you were to construct a 90 percent confidence interval for the population mean from the information given
here, would the interval be wider or narrower than the 95 percent confidence interval? Explain your answer
without actually constructing the interval.
(g) If you were to construct a 99 percent confidence interval for the population mean from the
information given here, would the interval be wider or narrower than the 95 percent confidence
interval? Explain your answer without actually constructing the interval.

Asthmatic Group Control Group


n= 15, ̅ n=15, ̅
(a) σ = ? , SE = (a) σ = ?, SE =
√ √
σ = 1.549 σ = 0.38
(b) 95% CI = ̅ √
= 2.6, 4.4 (b) 95% CI = ̅ √
= 0.49, 0.91
(c) Assumptions (c) Assumptions
 SE of population is known.  SE of population is known.
 Population is normally distributed.  Population is normally distributed.
 Sample is small.  Sample is small.
(d) We are 95% confident that the true population (d) We are 95% confident that the true population
mean µ, is somewhere between 2.6 and 4.4 mean µ, is somewhere between 0.49 and 0.91
because, in repeated sampling, 95% of because, in repeated sampling, 95% of
intervals constructed in like manner will intervals constructed in like manner will
include the population mean. include the population mean.
(e)

(f) Narrower
(g) Wider

Exercise 6.4.1

Iannelo et al. (A-8) performed a study that examined free fatty acid concentrations in 18 lean subjects and 11
obese subjects. The lean subjects had a mean level of 299 mEq/L with a standard error of the mean of 30, while
the obese subjects had a mean of 744 mEq/L with a standard error of the mean of 62.

Supposing lean subject is 1 and 2 is obese.

ո 1 = 18, ̅ = 127.28

ո 2 = 11, ̅ , , = 205.63

33
 12  22
( x1  x2 )  Z  
1 n1 n2
̅ ̅ √ =√ = 68.876 , By the equation, 2

95% CI = 299 – 744 ± 1.96 x 68.876 = -445 ± 134.9 = - 579.9, - 310.1#

90% CI and 99% CI use 1.645 and 2.58 of Ẕ value.

Exercise 6.4.7

Twenty-four experimental animals with vitamin D deficiency were divided equally into two groups. Group 1
received treatment consisting of a diet that provided vitamin D. The second group was not treated. At the end of
the experimental period, serum calcium determinations were made with the following results:
Treated group: ̅ = 11:1mg/100 ml; s = 1:5
Untreated group: ̅ = 7:8mg/100 ml; s = 2:0
Assume normally distributed populations with equal variances.

Supposing treated group is 1 and 2 is untreated group.

, ̅ , ̃ =2

By using the equation of the pooled estimate,

= = 3.13 #

95% of CI for - = (11.1 – 7.8) ± 2.0739 x 0.72 (by using Table E)

= 3.3 ± 1.49 = (1.8, 4.8) #

90% CI and 99% CI use respective t value.

Exercise 6.4.9

The average length of stay of a sample of 20 patients discharged from a general hospital was 7 days with a
standard deviation of 2 days. A sample of 24 patients discharged from a chronic disease hospital had an average
length of stay of 36 days with a standard deviation of 10 days. Assume normally distributed populations with
unequal variances.
Supposing acute case is 1 and 2 is chronic cases.

= 20, ̅ = 7, = 2, = 24, ̅ = 36, = 10

Using t’ equation in population with unequal variance,


S12 S2 S12 S2
( x1  x2 )  t    2  1   2  ( x1  x2 )  t    2
1 , ( n1  n2  2 ) n1 n2 1 , ( n1  n2  2 ) n1 n2
2 2

where
w1t1  w2t 2
t  
1
2 w1  w2 ,

= 0.2 , = 4.17 ,
34
For 95% CI = = -1= 19, = 2.0930 , = -1 = 23, = 2.0687 (using Table E)

= = 2.07#

95% CI for - = = -29 ± 2.0687 √ = - 29 ± 4.32 = (-33.32, -24.68) #

90% and 99% CI is similar methods.

Exercise 6.5.1

Luna et al. (A-14) studied patients who were mechanically ventilated in the intensive care unit of six hospitals in
Buenos Aires, Argentina. The researchers found that of 472 mechanically ventilated patients, 63 had clinical
evidence of ventilator-associated pneumonia (VAP). Construct a 95 percent confidence interval for the proportion
of all mechanically ventilated patients at these hospitals who may be expected to develop VAP.

Similar method with Example 5.

Exercise 6.6.1 and 3


Similar method with Example 6.

…………………………………………………………………………………………………………….

35
Hypothesis Testing
Testing a hypothesis about the mean of a population

1. Data: determine variable, sample size (n), sample mean ( ̅ ) , population standard deviation or sample standard
deviation (s). In numerical data, we can know means and in categorical data, can know proportion.

2. Assumptions: We have two cases:

• Case1: Population is normally or approximately normal distributed with known or unknown variance and
sample size n may be small or large.

Similar with estimation case 1.

• Case 2: Population is not normal with known or unknown variance may use z or t test.

Similar with estimation case 2.

3.Hypothesis: we have three cases

• Case I : e.g. we want to test that the population mean is different from 50

H0: μ=50 HA: μ ≠ 50

• Case II : e.g. we want to test that the population mean is greater than 50

H0: μ ≤50 HA: μ > 50

• Case III: e.g. we want to test that the population mean is less than 50

H0: μ ≥ 50 HA: μ< 50

4.Test Statistics

36
5. Distribution of test statistic
If the null hypothesis is true and assumption are met, follow the standard normal distribution
6. Decision rule
If HA: μ ≠ μ0, Reject H 0 if Z >Z1-α/2 or Z< - Z1-α/2 (when use Z - test) Or
Reject H 0 if T >t1-α/2,n-1 or T< - t1-α/2,n-1 (when use T- test)

If HA: μ> μ0 Reject H0 if Z>Z1-α (when use Z - test) Or


Reject H0 if T>t1-α,n-1 (when use T - test)

If HA: μ< μ0, Reject H0 if Z< - Z1-α (when use Z - test) Or


Reject H0 if T<- t1-α,n-1 (when use T - test)

Note*In t table, if 95% 2 sided, look t 0.975 and 95% 1 sided look t 0.95. Due to lack of –. In t table, use
– for < cases*

7. Calculation of test statistic


By using Equation from test statistic and calculate from sample data.

8. Statistical decision
We are able to reject the null hypothesis when our calculated value is greater than or equal to the critical
value according to table.

9. Conclusion
We can conclude the HA according to our statistical decision.

10. Calculation of p value


The p-value is defined as the smallest value of α for which the null hypothesis can be rejected.
If the p-value is less than or equal to α, we reject the null hypothesis (p ≤ α)
If the p-value is greater than α, we do not reject the null hypothesis (p > α)

Note* Depend on α or other way CI. Look table d of calculated Ẕ value. For upper tail 1-calculated value
and not for lower tail. In 2 sided case, calculated p value x 2 and not for 1 sided case*

Exercise 7.2.3

The purpose of a study by Luglie et al. (A-5) was to investigate the oral status of a group of patients diagnosed
with thalassemia major (TM). One of the outcome measures was the decayed, missing, and filled teeth index
(DMFT). In a sample of 18 patients the mean DMFT index value was 10.3 with a standard deviation of 7.3. Is this
sufficient evidence to allow us to conclude that the mean DMFT index is greater than 9.0 in a population of
similar subjects? Let α = 0.1
Solution
1. Data
Sample patients (n) = 18, The mean DMFT index ( ̅ ) = 10.3, Standard deviation (σ) = 7.3
α= 0.1
2. Assumptions
The sample 18 patients are assumed normally distributed and it has known variance.
3. Hypothesis
:µ≤9
:µ>9

37
4. Test statistics
Since we assume that the population is normally distributed, and since the population variance is known,
our test statistic is given by
̅

5. Distribution of test statistics


We know that the test statistic is normally distributed with µ = 0 if H0 is true.
6. Decision rule
Let α = 0.1, the critical value of test statistics is 1.29 according to table D. reject if computed Ẕ ≥ 1.29.
The rejection and non - rejection regions are shown in figure.

Rejection zone

0 1.29
7. Calculation of the test
̅
= = 0.76
√ √
8. Statistical decision
We are not able to reject the null hypothesis since our calculated value is not greater than 1.29.
9. Conclusion
The mean DMFT index is not greater than 9 in the population of similar subjects.
10. P value
The P value for this test is 0.2 or > 0.1.

Exercise 7.2.7

A sample of 25 freshman nursing students made a mean score of 77 on a test designed to measure attitude toward
the dying patient. The sample standard deviation was 10. Do these data provide sufficient evidence to indicate, at
the .05 level of significance, that the population mean is less than 80? What assumptions are necessary?
Solution

1. Data
Sample (n) = 25, mean score ( ̅ ) = 77, sample standard deviation (s) = 10, α= 0.05
2. Assumption
The sample size is small but assumed to be normally distributed.
3. Hypothesis
: µ ≥ 80 , : µ < 80
4. Test statistics

5. Distribution of test statistics


Our test statistic is distributed as Student’s t with n – 1 = 25 – 1 =24 degrees of freedom if H0 is true.
6. Decision rule
Let α = 0.05, the critical value of test statistics is 1.7109 (according to Table E - , DOF = 24). The
rejection and non - rejection regions are shown in figure. Reject if computed t ≤ - 1.7901.

38
-1.7901 0

7. Calculation of the test


̅
= = - 1.5
√ √
8. Statistical decision
Do not reject since – 1.5 > - 1.7901
9. Conclusion
Conclude that the population mean is greater than 80.
10. P value
0.05 < p < 0.1 (-1.5 in Table D is 0.0668)

Exercise 7.2.17

Suppose it is known that the IQ scores of a certain population of adults are approximately normally distributed
with a standard deviation of 15. A simple random sample of 25 adults drawn from this population had a mean IQ
score of 105. On the basis of these data can we conclude that the mean IQ score for the population is not 100? Let
the probability of committing a type I error be .05.
Solution

1. Data
n = 25, standard deviation ( ̅ ,mean (µ) = 105, α = 0.05
2. Assumption
Population of adults are approximately normally distributed and variance is known.
3. Hypothesis
: µ = 100, ≠ 100
4. Test statistics
̅


5. Distribution of test statistic.
When the null hypothesis is true, the test statistic follows the standard normal distribution.
6. Decision rule
Let α = 0.05, since we have 2 sided test and so = 0.025 in each tail. The critical values of z are -1:96.
Reject H0 unless -1.96 > Ẕ computed > 1.96. The rejection and non-rejection regions are shown in Figure.

Non – rejection zone


Rejection zone
Non
Rejection
zone

-1.96 0 + 1.96
7. Calculation of test statistics
̅
= = 1.67
√ √

39
8. Statistical decision - Do not reject the since 1.67 fall in non- rejection zone.
9. Conclusion – the mean of the population from which the sample may be 100.
10. P – value
P = 2 x 0.0475 = 0.095 or p value is > 0.05

Multiply by 2 due to 2 sided

Exercise 7.3.1

Subjects in a study by Dabonneville et al. (A-9) included a sample of 40 men who claimed to engage in a variety
of sports activities (multisport). The mean body mass index (BMI) for these men was 22.41 with a standard
deviation of 1.27. A sample of 24 male rugby players had a mean BMI of 27.75 with a standard deviation of 2.64.
Is there sufficient evidence for one to claim that, in general, rugby players have a higher BMI than the multisport
men? Let α = 0.01.

Solution
1. Data
Supposing, multisport is 1 and rugby player is 2
So, = 40, ̅ = 22.4, = 1.27
= 24, ̅ = 27.75, = 2.64
2. Assumption
The statistics were computed from two independent samples. Since the population variances are
unknown, we will use the sample variances in the calculation of the test statistic.
3. Hypothesis
: ≥ , : <
4. Test statistics
Since we have small sample and population variance is unknown,
(X1 - X 2 ) - ( 1   2 )0 S 2  (n1  1) S1  (n 2  1) S 2
2 2

T
n1  n2  2
p
S p2 S p2

n1 n2
5. Distribution of test statistic
When the null hypothesis is true, the test statistic is distributed approximately as the standard normal.
6. Decision rule
Let α = 0.01. This is a one-sided test with a critical value of T equal to - 2.388. Reject H0 if T computed ≤
- 2.388.
7. Calculation of test statistics
= 3.6,
(X - X 2 ) - ( 1   2 )0 = = -10.7
T 1
2 2 √
Sp Sp

n1 n2
8. Statistical decision
Reject since T = - 10.7 is in the rejection region.
9. Conclusion
By the calculation, rugby players have a higher BMI than the multisport men.
10. P value
P < 0.0001

Exercise 7.3.5

GarSc~ao and Cabrita (A-13) wanted to evaluate the community pharmacist’s capacity to positively influence the
results of antihypertensive drug therapy through a pharmaceutical care program in Portugal. Eighty-two subjects
with essential hypertension were randomly assigned to an intervention or a control group. The intervention group
40
received monthly monitoring by a research pharmacist to monitor blood pressure, assess adherence to treatment,
prevent, detect, and resolve drug-related problems, and encourage non-pharmacologic measures for blood
pressure control. The changes after 6 months in diastolic blood pressure (pre _ post, mm Hg) are given in the
following table for patients in each of the two groups.

On the basis of these data, what should the researcher conclude? Let α = 0.05.

Solution
1. Data
Supposing intervention group is 1 and control group is 2,
So, = 42, ̅̅̅ = 13.22, = 9.5
=42, ̅̅̅ = 6.44, = 6.8, α = 0.05

2. Assumption
Since the sample size is large and the population variances are unknown, we will use the sample variances
in the calculation of the test statistic.

3. Hypothesis
: = : ≠

4. Test statistics
Since we have large sample and population variance is unknown,
̅ ̅

√ √

5. Distribution of test statistic


When the null hypothesis is true, the test statistic is distributed approximately as the standard normal.

6. Decision rule
Let α = 0.05, since we have 2 sided test and so = 0.025 in each tail. The critical values of z are -1:96.
Reject H0 unless -1.96 > Ẕ computed > 1.96.

7. Calculation of test statistics


Ẕ = 3.8

8. Statistical Decision
Reject since 3.8 > 1.96.

9. Conclusion
The 2 means of the population are not equal.
41
10. P Value = 2 x 0.0001= 0.0002

Exercise 7.4.1
Ellen Davis Jones (A-15) studied the effects of reminiscence therapy for older women with
depression. She studied 15 women 60 years or older residing for 3 months or longer in an assisted living long-
term care facility. For this study, depression was measured by the Geriatric Depression Scale (GDS). Higher
scores indicate more severe depression symptoms. The participants received reminiscence therapy for long-term
care, which uses family photographs, scrapbooks, and personal memorabilia to stimulate memory and
conversation among group members. Pre-treatment and post treatment depression scores are given in the
following table. Can we conclude, based on these data, that subjects who participate in reminiscence therapy
experience, on average, a decline in GDS depression scores? Let α = 0.01

Pre–GDS: 12 10 16 2 12 18 11 16 16 10 14 21 9 19 20
Post–GDS: 11 10 11 3 9 13 8 14 16 10 12 22 9 16 18

Solution
1. Data
Pre–GDS: 12 10 16 2 12 18 11 16 16 10 14 21 9 19 20
Post–GDS: 11 10 11 3 9 13 8 14 16 10 12 22 9 16 18
= Postop – Preop = -1,0, -5, 1, -3, -5, -3, -2, 0, 0, -2, 1, 0, -3, -2,
= - 24
= 1, 0, 25, 1, 9, 25, 9, 4, 0, 0, 4, 1, 0, 9, 4
= 92
2. Assumption
The observed differences constitute a simple random sample from a normally distributed population

3. Hypothesis
: ≥0, <0

4. Test Statistics

5. Distribution of test statistic.


If the null hypothesis is true, the test statistic is distributed as Student’s t with n -1 = 14 degrees of
freedom.

6. Decision rule
Let α = 0.01. The critical value of t is 2.624. Reject H0 if computed t is less than or equal to the critical
value. The rejection and non-rejection regions are shown in Figure.

Rejection zone
Non – rejection zone

2.624
7. Calculation of test statistics
n = 15
̅= = - 1.6

= = 3.83

42
= - 3.2

8. Statistical decision
Reject the since – 3.2 is in the rejection zone.
9. Conclusion
We may conclude that that subjects who participate in reminiscence therapy decline in GDS depression
scores.
10. P value
P < 0.005

Exercise 7.4.3

The purpose of an investigation by Morley et al. (A-17) was to evaluate the analgesic effectiveness of a daily dose
of oral methadone in patients with chronic neuropathic pain syndromes. The researchers used a visual analogue
scale (0–100 mm, higher number indicates higher pain) ratings for maximum pain intensity over the course of the
day. Each subject took either 20 mg of methadone or a placebo each day for 5 days. Subjects did not know which
treatment they were taking. The following table gives the mean maximum pain intensity scores for the 5 days on
methadone and the 5 days on placebo. Do these data provide sufficient evidence, at the .05 level of significance,
to indicate that in general the maximum pain intensity is lower on days when methadone is taken?

Solution
1. Data
=- 27.4, 3.2, 0.4, -3.6, - 6.6, -13.4, -10.6, -6.4, -1.4,- 13.4, -26.6
= -105.8
= 750.76, 10.24, 0.16, 12.96, 43.56, 179.56, 112.36, 40.96, 1.96, 179.56, 707.56
= 2039.64

2. Assumption
The observed differences constitute a simple random sample from a normally distributed population

3. Hypothesis
: ≥0, <0

43
4. Test Statistics

5. Distribution of test statistic.


If the null hypothesis is true, the test statistic is distributed as Student’s t with n -1 = 14 degrees of
freedom.

6. Decision rule
Let α = 0.05. The critical value of t is -1.8125. Reject H0 if computed t is less than or equal to the critical
value. The rejection and non-rejection regions are shown in Figure.

Rejection zone Non rejection zone

-1.8125
7. Calculation of test statistics
n = 11
̅= = - 9.6

= = 102.2
= - 3.15

8. Statistical decision
Reject the since – 3.15 is in the rejection zone.

9. Conclusion
We may conclude that the patients who taken methadone has low pain intensity.

10. P value
P = 0.0007 so, 0.0005 < P < 0.05

Paired Comparison

The objective in paired comparisons tests is to eliminate a maximum number of sources of extraneous variation
by making the pairs similar with respect to as many variables as possible.

di is the difference between pairs of observations ̅ is the sample mean difference is the hypothesized

population mean difference, ̅ = n is the number of sample differences, and sd is the standard deviation of

the sample differences.

44
The Use of Ẕ
If, in the analysis of paired data, the population variance of the differences is known, the appropriate test statistic
is

If the assumption of normally distributed di’s cannot be made, the central limit theorem may be employed if n is
large. In such cases, the test statistic is Ẕ Equation. with sd used to estimate when, as is generally the case. If
we do not use paired observations, we have 2n - 2 degrees of freedom available as compared to n - 1 when we use
the paired comparisons procedure.

Exercise 7.4.1 *****

Ellen Davis Jones (A-15) studied the effects of reminiscence therapy for older women with
depression. She studied 15 women 60 years or older residing for 3 months or longer in an assisted living long-
term care facility. For this study, depression was measured by the Geriatric Depression Scale (GDS). Higher
scores indicate more severe depression symptoms. The participants received reminiscence therapy for long-term
care, which uses family photographs, scrapbooks, and personal memorabilia to stimulate memory and
conversation among group members. Pre-treatment and post treatment depression scores are given in the
following table. Can we conclude, based on these data, that subjects who participate in reminiscence therapy
experience, on average, a decline in GDS depression scores? Let α = 0.01.

Solution
1. Data – the data consist of the reminiscence therapy for 15 older women before and after treatment.

= pre- treatment – post treatment differences.


1, 0, 5, -1, 3, 5, 3, 2, 0, 0, 2, -1, 0, 3, 2
2. Assumption - The observed differences constitute a simple random sample from a normally distributed
population of differences.
3. Hypothesis - In the problem, we want to know if we can conclude that the reminiscence therapy is useful
in decreasing GDS. If it is effective in improving GDS, we would expect the post treatment score to tend
to be lower than the pre-treatment score. If, therefore, we subtract the post treatment score from the pre-
treatment. we would expect the differences to tend to be positive.

4. Test statistics -

45
5. Distribution of test statistic - If the null hypothesis is true, the test statistic is distributed as Student’s t
with n - 1 degrees of freedom.
6. Decision rule - Let α = 0.01. The critical value of t is 2.624. Reject H0 if computed t is greater than or
equal to the critical value. The rejection and non-rejection regions are shown in Figure.

α= 0.01

2.624

7. Calculation of test statistics

= = 1.6

= = = 3.83

= = = 3.137 #

8. Statistical Decision
Reject the since 3.137 is in the rejection zone.
9. Conclusion - subjects who participate in reminiscence therapy are decreasing GDS.
10. P value – p < 0.005

HYPOTHESIS TESTING: A SINGLE POPULATION PROPORTION

Exercise 7.5.1
Jacquemyn et al. (A-21) conducted a survey among gynecologists-obstetricians in the Flanders region and
obtained 295 responses. Of those responding, 90 indicated that they had performed at least one cesarean section
on demand every year. Does this study provide sufficient evidence for us to conclude that less than 35 percent of
the gynecologists-obstetricians in the Flanders region perform at least one cesarean section on demand each year?
Let α = 0.05
Solution
1. Data – the data are obtained from 295 responses of which 90 indicated that they had performed at least
one cesarean section on demand every year. ̂ = = 0.305

46
2. Assumption - The study subjects may be treated as a simple random sample from a population of similar
subjects, and the sampling distribution of ^p is approximately normally distributed in accordance with the
central limit theorem.
3. Hypothesis
: p ≥ 0.35
: p < 0.35
̂
4. Test statistics - Ẕ = 𝒒

5. Distribution of test statistic. If the null hypothesis is true, the test statistic is approximately normally
distributed with a mean of zero.
6. Decision rule
Let α = 0.05. The critical value of Ẕ is 1.645. Reject H0 if the computed Ẕ is < 1.645.
7. Calculation of test statistics
̂
Ẕ= 𝒒
= = = - 1.67 #
√ √

8. Statistical decision
Reject since – 1.67 is less than – 1.645.
9. Conclusion - We can conclude that in the sampled population the proportion who are performed CS is
less than 35 percent.
10. P value – p = 0.0475 so it is less than 0.05.

HYPOTHESIS TESTING THE DIFFERENCE BETWEEN TWO POPULATION PROPORTIONS

47
48
Analysis of Variance (ANOVA)
Hypotheses of One-Way ANOVA

H 0 : μ1  μ 2  μ 3    μi

 All population means are equal

 i.e., no treatment effect (no variation in means among groups)

H : Not all of the population means are the same

 At least one population mean is different

 i.e., there is a treatment effect

 Does not mean that all population means are different (some pairs may be the same)

Total variation can be split into two parts

SST = SSA + SSW

SST = Total Sum of Squares (Total variation)

SSA = Sum of Squares Among Groups (Among-group variation)

= Variation Due to Factor

= Commonly referred to as:

 Sum of Squares Between

 Sum of Squares Among

 Sum of Squares Explained

SSW = Sum of Squares Within Groups (Within-group variation)

= Variation Due to Random Sampling

= Commonly referred to as:

 Sum of Squares Within

 Sum of Squares Error

 Sum of Squares Unexplained

49
k nj
SST   ( X ij  X ) 2
j 1 i 1

SST  ( X 11  X ) 2  ( X 12  X ) 2  ...  ( X ij  X ) 2

SST
MST 
n 1
SST = Total sum of squares

k = number of groups (levels or treatments)

nj = number of observations in group j

Xij = ith observation from group j

̿ = grand mean (mean of all data values)

k
SSA   n j ( X j  X ) 2 (Or) SSB = ( + + )-
j 1

SSA  n1 ( x1  x ) 2  n2 ( x2  x ) 2  ...  nk ( xk  x ) 2

SSA
MSA  / Mean Square Among = SSA/degrees of freedom
k 1
k = number of groups or populations

nj = sample size from group j ( , , …).

Xj = sample mean from group j

̿ = grand mean (mean of all data value)

T = total sum of each group

k nj
SSW    ( X ij  X j ) 2
j 1 i 1

SSW  ( x11  X 1 ) 2  ( X 12  X 2 ) 2  ...  ( X ij  X j ) 2

=∑ -( + + )

SSW
MSW  / Mean Square Within = SSW/degrees of freedom
nk

F Test Statistic

MSA
F
MSW

50
Degrees of freedom

– df1 = k – 1 (k = number of groups)


– df2 = n – k (n = sum of sample sizes from all populations)

Example 1

You want to see if three different golf clubs yield different distances. You randomly select five measurements
from trials on an automated driving machine for each club. At the .05 significance level, is there a difference in
mean distance?

Club 1 Club 2 Club 3 ̅ = 249.2, = 1246 = = =5


254 234 200
263 218 222 ̅ = 226 , = 1130 N = 15
241 235 197 ̅̅̅ = 205.8, = 1029 K=3
237 227 206
251 216 204 ̿ = 227, ∑x = 3405

SSA = 5 (249.2 – 227)2 + 5 (226 – 227)2 + 5 (205.8 – 227)2 = 4716.4

OR

SSB = + + -

= 777651.4 – 772935 = 4716.4

SSW = (254 – 249.2)2 + (263 – 249.2)2 +…+ (204 – 205.8)2 = 1119.6

=∑ -( + + ) = 778771 – 77651.4 = 1119.6

MSA or MSB = 4716.4 / (3-1) = 2358.2

MSW = 1119.6 / (15-3) = 93.3

MSA
F
MSW
F= = 25.275.#

Example 2

Source of Sum of square Degree of Mean square Variance ratio


variation freedom
Treatment 5.05835 (SSA) 2 (k – 1) MSA = F=
Errors SSW (n – k) MSW =
Total 70.47925 (SST) 29 (n – 1)

N – k =27 (n = 30, k = 3)
51
SSW = SST - SSA = 65.4209

MSA = = 5.05835 / 2 = 2.529

MSW = = 65.4209 / 27 = 2.423

F= = 2.529 / 2.423 =1.044 #

The Tukey-HSD Procedure


– Tells which population means are significantly different (e.g.: μ1 = μ2 ≠ μ3)
– Done after rejection of equal means in ANOVA
– Allows pair-wise comparisons
– Compare absolute mean differences with critical range

MSW  1 1 
Critical Range  qc 
2  n j n j' 

where:

 qc = Value from Standardized Range Distribution with k and n - k degrees of freedom for the
desired level of 
 MSW = Mean Square Within
 ni and nj = Sample sizes from groups j and j’

By Example 1,

1. Compute absolute mean differences:

x1  x 2  249.2  226.0  23.2


x1  x 3  249.2  205.8  43.4
x 2  x 3  226.0  205.8  20.2

2. Find the qc value from the appendix table H with k = 3 and (n – k) = (15 – 3) = 12 degrees of
freedom for the desired level of  ( = .05 used here):

qc  3.77

3. Compute Critical Range:


52
MSW  1 1  93.3  1 1 
Critical Range  q c    3.77     16.285 #
2  n j n j'  2 5 5

4. Compare:

x1  x 2  23.2
x1  x 3  43.4 with Critical range (16.285)
x 2  x 3  20.2

5. All of the absolute mean differences are greater than critical range. Therefore there is a
significant difference between each pair of means at 5% level of significance.

………………………………………………………………………………………………………………

Two-Way ANOVA
Assumptions

– Populations are normally distributed


– Populations have equal variances
– Independent random samples are drawn

Sources of Variation

Two Factors of interest: Drug and Age group

r = number of levels of factor A (Drug)

c = number of levels of factor B (Age gp)

n’ = number of replications for each cell

n = total number of observations in all cells (n = rcn’)

Xijk = value of the kth observation of level i of factor A and level j of factor B

53
Two Factor ANOVA Equations

Total Variation:
r c n
SST   ( X ijk  X ) 2
i 1 j 1 k 1

Factor A Variation
r
SSA  cn ( X i..  X ) 2
i 1

Factor B Variation:
c
SSB  rn ( X . j .  X ) 2
j 1

Interaction Variation
r c
SSAB  n ( X ij .  X i..  X . j .  X ) 2
i 1 j 1

Sum of Squares Error:


r c n
SSE   ( X ijk  X ij . ) 2
i 1 j 1 k 1

Grand Mean
r c n

 X
i 1 j 1 k 1
ijk

X
rcn

Mean of i th level of factor A (i  1, 2, ..., r)

c n

 X
j 1 k 1
ijk

X i.. 
cn

Mean of jth level of factor B (j  1, 2, ..., c)

r n

 X ijk
X . j.  i 1 k 1
rn

Mean of cell ij

54
n X ijk
X ij .  
k 1 n

Mean Square Calculations MSA  Mean square factor A 


SSA
r 1

SSB
MSB  Mean square factor B 
c 1

SSAB
MSAB  Mean square interactio n 
(r  1)(c  1)

SSE
MSE  Mean square error 
rc(n'1)

55
Features of Two-Way ANOVA F Test

– Degrees of freedom always add up


 n-1 = rc(n’-1) + (r-1) + (c-1) + (r-1)(c-1)
 Total = error + factor A + factor B + interaction
– The denominator of the F Test is always the same but the numerator is different
– The sums of squares always add up
 SST = SSE + SSA + SSB + SSAB
 Total = error + factor A + factor B + interaction

Exercise 8.3.3

A remotivation team in a psychiatric hospital conducted an experiment to compare five methods for remotivating
patients. Patients were grouped according to level of initial motivation. Patients in each group were randomly
assigned to the five methods. At the end of the experimental period the patients were evaluated by a team
composed of a psychiatrist, a psychologist, a nurse, and a social worker, none of whom was aware of the method
to which patients had been assigned. The team assigned each patient a composite score as a measure of his or her
level of motivation. The results were as follows:
Level of Remotivation method
initial A B C D E Total Mean
motivation (̅̅̅
Nil 58 68 60 68 64 318 63.6
Very low 62 70 65 80 69 346 69.2
Low 67 78 68 81 70 364 72.8
Average 70 81 70 89 74 384 76.8
Total 263 297 263 318 277 1418
Mean (̅̅̅ 64.25 74.25 65.75 79.5 69.25 70.6

56
Solution

The randomized complete block design is the appropriate design for this remotivation team.

Data

As in table. & ∑x = 1418, ∑ = 101634

Assumption

– Populations are normally distributed


– Populations have equal variances
– Independent random samples are drawn

Hypotheses

: = 0, : not all =0

Test statistic

The test statistic is V.R = MSTr / MSE.

Distribution of test statistic


When H0 is true and the assumptions are met, V.R. follows an F distribution with 4 and 12 degrees of freedom.
Decision rule
Let α = . 05. Reject the null hypothesis if the computed V.R. is equal to or greater than the critical F, which we
find in Table G to be 3.26.
Calculation of test statistics

SST = ∑ - = 100854 - = 1166.8 #

SSBl = N Tr ∑ ̅ ̿ =5[ + …….+ ] = 471.8 #

SSTr = N Bl ∑ ̅̅̅ ̿̿̿ = 4 [ 158.2] = 632.8 #

SSE = SST – SSBl – SSTr = 62.8 #

Statistical decision

Since our computed variance ratio, 30.22, is greater than 3.26 , we reject the null hypothesis of no treatment
effects on the assumption that such a large V.R. reflects the fact that the two sample mean squares are not
estimating the same quantity.
ANOVA Table

57
Source SS d.f MS V.R
Treatments 632.8 4 158.2 30.22
Blocks 471.2 3 157.06
Residuals 62.8 12 5.233

Conclusion
Yes, the provided data were sufficient to indicate a difference in mean score among methods.
P value.
P < 0.01
………………………………………………………………………………………………………………

58
Chi-Square Tests
Test of Qualitative Variables
– One of the most frequently employed statistical technique
– Commonly utilized for the analysis of COUNT or FREQUENCY data
– The 2 value is a measure of the extent to which pairs of Observed and Expected frequencies agree.
– 2 value is small if “agreement” is HIGH, while if the “agreement” is low 2 is HIGH
– Direction of agreement, whether positive or negative, has no influence since the differences are squared.
Some formulae indicate squaring of absolute values.
– Chi-square assumes values between 0 and infinity.
– Chi-square distribution derived from normal distribution
– For analysis of count data or frequency data
– The Chi-square test statistic is

(Oi  Ei ) 2
  
2

all cells Ei
where:
Oi = observed frequency in a particular cell
Ei = expected frequency in a particular cell if H0 is true
2 for the 2 x 2 case has 1 degree of freedom
Degree of freedom

Types of Chi-square Tests


Basically three types
– Test of goodness of fit
– Test of independence
– Test of homogeneity
Other extensions
– Fisher’s exact test
– Mc Nemar’s test (Paired comparisons )
– Mantel Haenszel’s test
– Chi-square Test for trend

59
Test for Goodness of Fit
– Use for chi square that test the significance of the distribution of a single variable.
– Uses sample data to test hypotheses about the shape or proportions of a population distribution
– Tests the fit of the proportions in the obtained sample with the hypothesized proportions of the population
– This test enables us to see how well does the assumed theoretical distribution (such as Binomial
distribution, Poission or normal distribution) fit to the observed data.
– The 2 test formula for goodness of fit is:
2 = ∑(o-e) 2 / e
– Chi-square distribution is positively skewed
– Degrees of freedom for Goodness of Fit Test (df = C – 1) . C is the number of categories
– If 2( calculated)> 2 (tabulated), with (k/n-1) d.f, then null hypothesis is rejected otherwise accepted.
– And if null hypothesis is accepted, then it can be concluded that the given distribution follows theoretical
distribution.
Example
Cranor and Christensen (A-1) conducted a study to assess short-term clinical, economic, and humanistic outcomes
of pharmaceutical care services for patients with diabetes in community pharmacies. For 47 of the subjects in the
study, cholesterol levels are summarized in Table 12.3.1. We wish to know whether these data provide sufficient
evidence to indicate that the sample did not come from a normally distributed population. Let α.05

Solution:
1. Data. See Table 12.3.1.

2. Assumptions. We assume that the sample available for analysis is a simple random sample.

3. Hypotheses

H0: In the population from which the sample was drawn, cholesterol levels are normally distributed.
HA: The sampled population is not normally distributed.
4. Test statistic. The test statistic is
2 = ∑(o-e) 2 / e
5. Distribution of test statistics
Approximately as chi-square with k - r degrees of freedom. The values of k and r will be determined
later.
6. Decision rule. We will reject H0 if the computed value of X2 is equal to or greater than the critical value
of chi-square.
60
7. Calculation of test statistic. Since the mean and variance of the hypothesized distribution are not
specified, the sample data must be used to estimate them.

̅ = 198.67

_
8. Statistical decision. When we compare X2 is 10.566 with values of x2 in Appendix Table F, we see that it
is less than x2. = 11.07 so that, at the .05 level of significance, we cannot reject the null hypothesis
that the sample came from a normally distributed population.

9. Conclusion. We conclude that in the sampled population, cholesterol levels may follow a normal
distribution.
10. p value. Since 11:070 > 10:566 > 9:236, .05 < p < .10.

Example

61
The flu season in southern Nevada for 2005–2006 ran from December to April, the coldest months of the year.
The Southern Nevada Health District reported the numbers of vaccine-preventable influenza cases shown in Table
12.3.9. We are interested in knowing whether the numbers of flu cases in the district are equally distributed
among the five flu season months. That is, we wish to know if flu cases follow a uniform distribution.
Solution:
1. Data. See Table 12.3.9.

2. Assumptions. We assume that the reported cases of flu constitute a simple random sample of cases of flu
that occurred in the district.
3. Hypotheses.
H0: Flu cases in southern Nevada are uniformly distributed over the five flu season months.
HA: Flu cases in southern Nevada are not uniformly distributed over the five flu season months.
Let α = .01.
4. Test statistic.
The test statistic is

(Oi  Ei ) 2
2   Ei
all cells

5. Distribution of test statistic


If H0 is true, X2 is distributed approximately as x2 with ( 5-1) = 4 degrees of freedom.
6. Decision rule. Reject H0 if the computed value of X2 is equal to or greater than 13.277.
7. Calculation of test statistic. If the null hypothesis is true, we would expect to observe 200 / 5 = 40 cases
per month
= 97.15

8. Statistical decision. Since 97.15, the computed value of X2, is greater than 13.277, we reject, based on
these data, the null hypothesis of a uniform distribution of flu cases during the flu season in southern
Nevada.
9. Conclusion. We conclude that the occurrence of flu cases does not follow a uniform distribution.
10. p value. 0.0001

62
Test of Independence
– Testing the null hypothesis that in the population the two criteria of classification are independent
– A single sample drawn from a single population
– Observations cross-classified on the basis of two variables of interest
– Calculating Ei based on joint probability law
Calculating Expected Frequencies
The null hypothesis here is that the two criteria are independent.
Then,
If A and B are independent,
– P(A and B) = P(A) x P(B)
– The expected frequency = P x total number
– P(A) and P(B) here are marginal probabilities.
– For rows, marginal probability P(r) = row total/grand total
– For columns, marginal probability P(c) = column total/grand total

nad  bc
2
2 
a  c b  d a  b c  d 

Example
Observed

Farming Others Total

Leptospira+ 31 91 122

Leptospira - 19 359 378

TOTAL 50 450 500

63
Expected

Farming Others Total

Leptospira + 12.2 109.8 122

Leptospira - 37.8 340.2 378


TOTAL 50 450 500
(31  12.2) (91  109.8) (19  37.8) (359  340.2) 2
2 2 2
2     = 42.58#
12.2 109.8 37.8 340.2
Alternatively,
50031 * 359 )  (19 * 91
2
 
2

50 450 122 378 


500 * (11129  1729 ) 2

1037610000 0
 42.58
Interpretation
Ho: The distribution of leptospirosis is independent of occupation.
Ha: The distribution of leptospirosis is NOT independent of occupation.
Decision rule: If calculated χ2 > tabulated χ2
LS: α = 0.05
Result: Calculated χ2 = 42.58; Tabulated χ2df=1= 3.84
Decision: since calculated χ2 > tabulated χ2, Reject Ho
Conclusion: We conclude that the leptospirosis infection is associated with occupation.

Test of Homogeneity
– Two or more populations are identified in advance
– An independent sample drawn from each
– Sampled observations placed in appropriate categories of variables of interest
– There is researcher’s manipulation about marginal probabilities (one set of marginal totals is
fixed)
Example
Narcolepsy is a disease involving disturbances of the sleep–wake cycle. Members of the German Migraine and
Headache Society (A-8) studied the relationship between migraine headaches in 96 subjects diagnosed with
narcolepsy and 96 healthy controls. The results are shown in Table 12.5.2.We wish to know if we may conclude,
on the basis of these data, that the narcolepsy population and healthy populations represented by the samples are
not homogeneous with respect to migraine frequency.

64
Solution:
Data. See Table 12.5.2.
Assumptions. We assume that we have a simple random sample from each of the two populations of interest.
Hypotheses.
H0: The two populations are homogeneous with respect to migraine frequency.
HA: The two populations are not homogeneous with respect to migraine frequency.
Let α = .05
Test statistic
The test statistic is

(Oi  Ei ) 2
2   Ei
all cells

Distribution of test statistic.


If H0 is true, χ2 is distributed approximately as χ2 with (2-1) (2-1) = 1 degree of freedom.
Decision rule
Reject H0 if the computed value of X2 is equal to or greater than 3.841.
Calculation of test
Expected Migrane + Migrane - Total
Subjects 20 76 96
Controls 20 76 96
Total 40 152 192

χ 2 = 0.126
Statistical decision.
Since .126 is less than the critical value of 3.841, we are unable to reject the null hypothesis.
Conclusion.
We conclude that the two populations may be homogeneous with respect to migraine frequency.
p value.
0.05 < p < 0.1

65
Exercise 12.5.1
Refer to the study by Carter et al. [A-9], who investigated the effect of age at onset of bipolar disorder on the
course of the illness. One of the variables studied was subjects’ family history. Table 3.4.1 shows the frequency of
a family history of mood disorders in the two groups of interest: early age at onset (18 years or younger) and later
age at onset (later than 18 years). Can we conclude on the basis of these data that subjects 18 or younger differ
from subjects older than 18 with respect to family histories of mood disorders? Let α = .05.

Data in table
Assumption = example 1.
Hypotheses
H0 is the two population are homogenous
HA is the two population are not homogenous
Test statistics
Distribution of test statistics
d.f = 3
Decision rule
Reject H0 computed χ2 value is ≥ 7.815.
Calculation of teat statistics
Χ2 = 3.622
Statistical decision
Conclusion
P value > 0.1

Exercise 12.5.3
Swor et al. (A-11) examined the effectiveness of cardiopulmonary resuscitation (CPR) training in people over 55
years of age. They compared the skill retention rates of subjects in this age group who completed a course in
traditional CPR instruction with those who received chest-compression–only cardiopulmonary resuscitation (CC-
CPR). Independent groups were tested 3 months after training. Among the 27 subjects receiving traditional CPR,
12 were rated as competent. In the CC-CPR group, 15 out of 29 were rated competent. Do these data provide
sufficient evidence for us to conclude that the two populations are not homogeneous with respect to competency
rating 3 months after training? Let α = .05.

66
Solution
Competent Not competent Total
Traditional CPR 12 15 27
CC – CPR 15 14 29
Total 27 29 56

Calculation steps are similar above.


Critical value of χ2 is 3.841 & calculated χ2 is 0.297. do not reject the H0.
P value is > 0.1

Exercise 12.5.5
In a simple random sample of 250 industrial workers with cancer, researchers found that 102 had worked at jobs
classified as “high exposure” with respect to suspected cancer-causing agents. Of the remainder, 84 had worked at
“moderate exposure” jobs, and 64 had experienced no known exposure because of their jobs. In an independent
simple random sample of 250 industrial workers from the same area who had no history of cancer, 31 worked in
“high exposure” jobs, 60 worked in “moderate exposure” jobs, and 159 worked in jobs involving no known
exposure to suspected cancer causing agents. Does it appear from these data that persons working in jobs that
expose them to suspected cancer-causing agents have an increased risk of contracting cancer? Let α = .05.
Solution
Cancer No cancer Total
High exposure 102 31 133
Moderate exposure 84 60 144
No known exposure 64 157 223
Total 250 250 500

Calculation steps are similar above.


D.f = 2
Critical value of χ2 is 5.991
Calculated χ2 is 83.278.
Reject the H0 & P value is < 0.005 #

67
Fisher Exact Test (Exact Probability Test)
– For 2 x 2 tables
– n < 20 (OR) 20 < n < 40 with any expected frequency < 5 (OR)
– Any cell with expected frequency < 1
– A>B then a/A > b/B

Hypotheses The following are the null hypotheses that may be tested and their alternatives.
1. (Two-sided)
H0: The proportion with the characteristic of interest is the same in both populations; that is, p1 = p2.
HA: The proportion with the characteristic of interest is not the same in both populations; p1≠ p2.
2. (One-sided)
H0: The proportion with the characteristic of interest in population 1 is less than or the same as the proportion in
population2. p1 ≤ p2.
HA: The proportion with the characteristic of interest is greater in population 1 than in population 2; p1 > p2.
Test Statistic The test statistic is b, the number in sample 2 with the characteristic of interest.

……………………………………………………………………………………………………………

68
Correlation

  

n xi2 )  ( xi)2 n yi2 )  ( yi)2


Example:
A sample of 6 children was selected, data about their age in years and weight in kilograms was recorded as shown
in the following table . It is required to find the correlation between age and weight.
r = 0.759 #

Simple linear regression


Regression analysis is used to:
 Predict the value of a dependent variable based on the value of at least one independent variable
 Explain the impact of changes in an independent variable on the dependent variable
– Dependent variable: the variable we wish to explain
– Independent variable: the variable used to explain the dependent variable
– Only one independent variable, X
– Relationship between X and Y is described by a linear function
– Changes in Y are assumed to be caused by changes in X
Simple Linear Regression Equation
The simple linear regression equation provides an estimate of the population regression line
y  a  bx
y = Estimated (or predicted) Y value for observation i
a = Estimate of the regression intercept
b = Estimate of the regression slope
x = Value of X for observation i

69
N  XY  ( X )( Y )
b
N  X 2  ( X )2

a  Y  bX
Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured
in square feet). A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet

By the calculator, y = a + bx
house price  98.24833  0.10977 (square feet)

Interpolation vs. Extrapolation

70
Measures of Variation or Deviation

SST = total sum of squares (Measures the variation of the Yi values around their mean Y)
SSR = regression sum of squares (Explained variation attributable to the relationship between X and Y)
SSE = error sum of squares (Variation attributable to factors other than the relationship between X and Y)
Coefficient of Determination, r2
SSR regression sum of squares
r2  
SST total sum of squares
Inference about the Slope: t Test
t test for a population slope - Is there a linear relationship between X and Y?
Null and alternative hypotheses
H0: β = 0 (no linear relationship)
HA: β ¹ 0 (linear relationship does exist)
Test statistic –
b β
t (d.f = n – 2)
Sb
where:
b = regression slope coefficient
β = hypothesized slope
Sb = standard error of the slope

71
F-Test for Significance
MSR
F
MSE
SSR
MSR 
1
SSE
MSE 
n2

Confidence Interval Estimate for the Slope

b  t(1  )Sb (d.f = n – 2)


2

........................................................................................................................................................................

72
Type of outcome variable determines choice of statistical tests.
Type of outcome Example of outcome variable Type of statistics

Univariate

Numerical Age, Blood glucose Central tendency, Dispersion

Categorical Sex, Disease grading, Frequency and percent


Type of staple food distribution

Bivariate analysis

Two means Student’s t test


e.g. blood glucose determined by two treatment

Numerical Two means (Before-after) Paired t test


e.g. Uric acid level by before and after treatment

 Two means ANOVA


e.g. Blood loss determined by three types of operation

Linear relationship Pearson’s correlation


Numerical e.g. gestational age and birth weight

Prediction based on one variable Simple linear regression


e.g. squared feet and selling price

Two proportions or groups Chi-square test OR


e.g. Treatment success by drug A and B Two proportions z test

Categorical Two proportions (Before-after) McNemar’s Chi-square test


e.g. Prevalence of smoking before and after peer group
health education

> Two proportions Chi-square test


e.g. Smoking prevalence among first year, second year
and third year cadets

Dichotomous Death, cancer, intensive care unit admission Binary logistic regression

Nominal Site of metastasis Multinomial logistic regression

Ordinal Cancer stage Ordinal logistic regression


Disease stage (Nor/PreDM/DM)

Continuous Blood pressure, weight, temperature Multiple linear regression

Rare outcomes Time to rare cancer, number of urinary track infections Poisson regression
and counts

Time to event Time to death, time to cancer Cox regression


(Proportional hazards analysis

73

You might also like