0% found this document useful (0 votes)
18 views

0214 Lecture Notes

The document is a handout on the fundamentals of biostatistics, emphasizing its importance in health sciences for data analysis and decision-making. It covers basic concepts, types of statistics, measurement scales, and the role of biostatisticians in research. The handout serves as a supplementary resource to enhance understanding of biostatistics alongside the main textbook.

Uploaded by

silverhandzzzz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

0214 Lecture Notes

The document is a handout on the fundamentals of biostatistics, emphasizing its importance in health sciences for data analysis and decision-making. It covers basic concepts, types of statistics, measurement scales, and the role of biostatisticians in research. The handout serves as a supplementary resource to enhance understanding of biostatistics alongside the main textbook.

Uploaded by

silverhandzzzz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 316

Fundamentals of Biostatistics

Written by
Dr. Fadal Aldhufairi
Edited by
Prof. Ibrahim Almanjahie
Department of Mathematics
King Khalid University
Reference:
Wayne W. Daniel, Chad L. Cross: Biostatistics: A Foundation for Analysis in the Health
Sciences.

Note: This handout does not replace the main textbook as the primary reference. It serves
as a supplementary resource to enhance understanding and comprehension

January 14, 2025


0
STATISTICS GROUP AT KKU 0214STAT: Fundamentals of Biostatistics 0 / 315
Chapter 1: Introduction to Biostatistics

Learning Objectives
After studying this chapter, students will:

• Understand the basic concepts and terminology of biostatis-


tics, including various kinds of variables, measurement, and
measurement scales.
• Be able to select a simple random sample and other scientific
samples from a population of subjects.
• Understand the processes involved in the scientific method
and the design of experiments.

1
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 1 / 315
Introduction to Biostatistics

Biostatistics plays a pivotal role in health sciences, enabling the


analysis of data to inform:
Clinical decisions.
Public health policies.
Medical research outcomes.

Utility: It provides tools for organizing, analyzing, and interpret-


ing data effectively.
Biostatistics is a branch of statistics that plays a crucial role in
designing experiments, conducting clinical trials, and improving
public health outcomes.

2
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 2 / 315
What is Statistics?

Some Basic Concepts:


Statistics involves the collection, organization, summariza-
tion, and analysis of data.
Data are numbers resulting from measurements or counts.
The ultimate goal of statistics is to draw meaningful infer-
ences from sample data.

3
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 3 / 315
Sources of Data

Some primary sources of data include:


• Routinely Kept Records: e.g., hospital records.
• Surveys: e.g., when routine data are unavailable.
• Experiments: Controlled environments for specific hypothe-
ses.
• External Sources: e.g., published reports or research data
banks.

4
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 4 / 315
Concept of Biostatistics

Biostatistics is the application of statistical methods to biological,


medical, or health-related studies. It aids in designing experiments,
analyzing data, and interpreting results in medical research.

Examples:
Assessing the effectiveness of a new drug.
Analyzing the correlation between smoking and lung cancer.

5
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 5 / 315
Role of a Biostatistician

A biostatistician applies statistical methods to medical data in


research settings. Their role includes:
Designing research studies.
Analyzing data to uncover trends and relationships.
Interpreting results to ensure medical conclusions are based on
sound statistical evidence.

Importance: Biostatisticians ensure that data-driven decisions in


healthcare are accurate and reliable.

6
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 6 / 315
Examples

Some examples of biostatistics include:


• A study discovered that 50% of children who drink milk
regularly have stronger bones.
• Doctors found that 70% of patients with a specific illness recover
within three weeks with treatment.
• A clinical trial found that a new medicine helped 80% of
children feel better.
• A survey found that 60% of people get a flu shot every year.

Note: Statistics help us understand patterns and make decisions


based on data from real-life situations. These examples show
how biostatistics are used to describe things like preferences and
health.

7
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 7 / 315
Uses of Biostatistics

• Biostatistics is a vital tool for making evidence-based decisions in


health and medicine.
• We use statistical methods to evaluate treatment effectiveness, mon-
itor disease outbreaks, and improve patient care.
• For students in health sciences, a solid understanding of statistics
is crucial. It enables you to collect, analyze, and interpret clinical
data, aiding in research and patient outcomes.

In health and medicine, biostatistics helps professionals make ac-


curate diagnoses, develop effective treatments, and contribute to
public health initiatives.

8
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 8 / 315
Why Study Biostatistics?
• Numerical data is essential in healthcare for making evidence-
based decisions and improving patient outcomes. Biostatistics is
widely applied in:
• Clinical trials (e.g., determining the effectiveness of new drugs or
treatments).
• Epidemiology (e.g., tracking disease outbreaks and trends).
• Patient care (e.g., using data to predict outcomes and personalize
treatments).
• Public health (e.g., assessing vaccination programs and health poli-
cies).
• Medical research (e.g., analyzing genetic data to understand dis-
eases).
• Statistical methods help healthcare professionals make critical de-
cisions that directly impact lives.
Biostatistics is the foundation of evidence-based medicine, en-
abling medical professionals to provide the best possible care.
9
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 9 / 315
Who Uses Statistics and Biostatistics?

Statistical methods are essential for professionals in various fields,


especially in healthcare:
Physicians (to evaluate treatment effectiveness, monitor patient
outcomes, and make evidence-based decisions).
Epidemiologists (to study disease patterns, outbreaks, and public
health trends).
Healthcare administrators (to allocate resources efficiently and im-
prove operational performance).
Pharmacologists (to assess drug efficacy and safety through bio-
statistical analysis).

Biostatistics plays a critical role in modern medicine, ensuring


that decisions and policies are based on reliable, scientific evi-
dence.

10
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 10 / 315
Types of Statistics – Descriptive and Inferential

(1) Descriptive Statistics and (2) Inferential Statistics


Descriptive Statistics
Descriptive statistics is the branch of statistics focused on organizing,
summarizing, and presenting data in a clear and understandable way.

Descriptive statistics includes organizing data into


frequency distributions (such as frequency tables),
presenting the data with various types of graphs (such as bar charts,
histograms, pie charts, and box plots),
and calculating important measures like central tendency (mean,
median, mode) and dispersion (range, variance, standard devia-
tion).

11
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 11 / 315
Types of Statistics – Descriptive and Inferential

Some examples include:


A clinical trial collected data on the blood pressure levels of 200
patients before and after treatment. This data is summarized using
measures like the mean and standard deviation to compare treat-
ment effectiveness.
A study on the prevalence of diabetes in a city surveyed 1,000
individuals, finding that 15% are diabetic. This proportion can be
visualized using a pie chart to show the distribution of diabetic vs.
non-diabetic individuals.
In a study on vaccine efficacy, researchers found that 90% of par-
ticipants showed immunity after receiving the vaccine. Descrip-
tive statistics such as percentages and rates are used to summarize
the effectiveness of treatments in clinical studies.

12
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 12 / 315
Types of Statistics – Descriptive and Inferential

2) Inferential Statistics
Inferential statistics is the branch of statistics that uses data from a sam-
ple to make conclusions or decisions about a larger population. It also
involves assessing the reliability of these conclusions.

Remark: Inferential statistics makes predictions, estimates, and gener-


alizations about a population based on sample data, and evaluates how
confident we can be in those results.
Some examples include:
Medical Research: In clinical trials, researchers often use a sam-
ple of patients to make conclusions about the effectiveness of a
new drug. Based on the response of this sample , they infer whether
the drug will work for the larger population of patients.

13
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 13 / 315
Variables: Quantitative and Qualitative

Definition of a Variable: A variable is a characteristic that can


take on different values.

Types of variables include:

Quantitative Variables: Measurable, e.g., height, weight, age.


Qualitative Variables: Categorical, e.g., medical diagnoses,
ethnic groups.

Key Concept: Random variables are influenced by chance.


Discrete Random Variable: Takes distinct, separate values,
e.g., hospital admissions or number of patients.
Continuous Random Variable: Takes any value within a range,
e.g., weight or height.
14
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 14 / 315
Population and Sample

Population: The entire set of entities we are interested in


studying, e.g., all children in a county school system.
Sample: A subset of a population, used for practical mea-
surement, e.g., a selected group of children from the hospital
system.

Remarks:
Statistical terminology can have different meanings in every-
day language versus statistical contexts.
Understanding the basic vocabulary is crucial for effective
communication and analysis.

15
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 15 / 315
Introduction to Measurement

Definition: Measurement refers to the assignment of numbers to


objects or events according to a set of rules.
Measurement scales categorize the results of measurements.
The scales include nominal, ordinal, interval, and ratio.

Remark: Different types of data require different statistical meth-


ods for analysis.

16
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 16 / 315
Nominal Scale

Definition: The lowest measurement scale, used for classifying


observations into mutually exclusive categories.
Examples: Male/Female, Married/Not Married, Medical Diag-
noses
Numbers are used to ”name” or classify the data.

Example
Medical diagnoses: 1 = Flu, 2 = Cold, 3 = COVID-19

17
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 17 / 315
Ordinal Scale

Definition: Observations are ranked according to some criterion,


but the differences between categories are not necessarily equal.
Examples: Socioeconomic status (low, medium, high), Improve-
ment status (unimproved, improved, much improved)

Example
Improvement: 1 = Unimproved, 2 = Improved, 3 = Much Im-
proved

18
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 18 / 315
Interval Scale

Definition: The interval scale allows for the ordering of measure-


ments, where the distance between two measurements is known
and consistent. The zero point is arbitrary.
Example: Temperature (Celsius or Fahrenheit)
Zero does not imply absence of the measured quantity.

Example
Temperature: 0°C is not the absence of heat.

19
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 19 / 315
Ratio Scale

Definition: The highest level of measurement, where both equal-


ity of intervals and ratios can be determined, and there is a true
zero point.
Examples: Height, Weight, Length
Zero represents a complete absence of the quantity being mea-
sured.
Example
Height: 0 cm means no height.

20
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 20 / 315
Summary of Measurement Scales

Nominal Scale: Classification, no order or distance.


Ordinal Scale: Ordered categories, unequal distances.
Interval Scale: Ordered categories, equal distances, arbitrary zero.
Ratio Scale: Ordered categories, equal distances, true zero.

Remarks on measurement scales:


Measurement scales are crucial in selecting appropriate sta-
tistical methods.
Understanding the scale helps in interpreting data and deter-
mining the level of analysis.

21
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 21 / 315
Self-Review Questions

Which of the following is an example of a nominal scale?


1 Temperature in Celsius
2 Socioeconomic status
3 Gender (Male/Female)
What distinguishes an ordinal scale from an interval scale?

22
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 22 / 315
Sampling and Statistical Inference

Statistical inference is the procedure by which we reach a


conclusion about a population based on a sample.
The validity of the inference depends on the sampling method.
A scientific sample, such as a simple random sample, is needed
to make valid inferences.

23
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 23 / 315
Definition: Simple Random Sample

Definition: A sample of size n drawn from a population of size


N is a simple random sample if every possible sample of size n
has the same chance of being selected.

Sampling Methods: With vs. Without Replacement


Sampling with replacement: After each draw, the selected
item is returned to the population.
Sampling without replacement: Once an item is drawn, it
is not returned, ensuring that no item appears more than once
in the sample.

24
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 24 / 315
Example 1.4.1: Simple Random Sample

Example
Gold et al. studied the effectiveness of smoking cessation treatments.
A simple random sample of 10 subjects is drawn from a population of
189 subjects. Their ages are shown in Table 1.4.1.
Random Number Selection:
Subject No. Age
Use a random number table to 1 48
2 35
select a sample. 3 46
4 44
Start at a random point and use 5 43
. .
three-digit numbers to select .
.
.
.
subjects. 189 66

Example: Start at row 21, col- Table: Ages of 189 subjects who
umn 28 in the random number participated in a study on smoking
table and select valid numbers cessation.
(1-189).
25
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 25 / 315
Systematic Sampling: Example 1.4.2

Definition: In systematic sampling, a starting point is selected,


and then every k-th subject is chosen.

Example
If the starting point is subject 4 and the interval k = 18, subjects 4, 22,
40, …, are selected.
Systematic Sampling:
Subject No. Age
The first subject is selected 4
22
44
66
randomly (subject 4). 40
58
47
56
76 52
The sample interval k = 18, . .
. .
which means every 18th sub- . .

ject is chosen. Table: Systematically selected sample


Continue until the desired of 10 subjects.
sample size is achieved.
26
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 26 / 315
Summary of Sampling Methods

Simple Random Sampling: Every subject has an equal chance


of being selected.
Systematic Sampling: Select a random starting point and pick
every k-th subject.
Sampling with Replacement: Reintroduce the subject into the
population after selection.
Sampling without Replacement: Do not reintroduce selected
subjects.

27
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 27 / 315
Remarks on Sampling Methods

Both simple random sampling and systematic sampling


are widely used in biostatistics and healthcare research.
Careful attention to the sampling method is crucial for
drawing valid inferences.
Randomization ensures the reliability of conclusions drawn
from a sample.

28
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 28 / 315
The Scientific Method

The scientific method is a systematic process used to gather,


analyze, and report scientific data.
It is based on empirical data, ensuring results are unbiased,
replicable, and testable.

The key steps of the scientific method include:


Observation: Gathering information through senses or
instruments.
Formulation of Hypotheses: Proposing testable
explanations for the observations.
Designing an Experiment: Creating a test to validate or
refute the hypothesis.

29
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 29 / 315
Key Elements of the Scientific Method

1) Observation
Phenomena are observed to generate questions for further explo-
ration.
Example: Does regular exercise reduce body weight?
2) Formulating a Hypothesis
A hypothesis is a testable statement about the observed phenom-
ena.
Example: ”Exercise reduces body weight” (Research Hypothesis)
Statistical Hypothesis: ”The average weight loss in exercisers is
greater than in non-exercisers.”
3) Designing an Experiment
Random assignment to experimental and control groups ensures a
valid test.
Example: 100 participants assigned to either an exercise group or
a control group.

30
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 30 / 315
Experimental Design Example

Step 1: Observation: Regular exercise may reduce body weight.


Step 2: Hypothesis: ”Exercisers will lose more weight than non-
exercisers.”
Step 3: Experiment:
Randomly assign 100 participants to two groups: exercise (50) and
control (50).
Measure weight loss after a set period.

Example: In an experiment, the average weight loss in the exer-


cise group is compared to that in the control group.

31
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 31 / 315
Key Elements of the Scientific Method

You need to know ...


Measurements must be both accurate and precise to ensure
valid results.
If measurements are not accurate, the results of the experi-
ment may be invalid, even if the measurements are consis-
tent.
Accuracy refers to how close a measurement is to the true
value.
Precision refers to the consistency of measurements.
Even precise measurements can be invalid if they are not ac-
curate.

32
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 32 / 315
Self-Review

1 What are the key steps of the scientific method?


2 How does hypothesis formulation differ from hypothesis testing?
3 Why is experimental design important in biostatistics?
4 What is the difference between accuracy and precision in
measurements?

33
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 33 / 315
Conclusion and Remarks

The scientific method provides a structured framework for con-


ducting research.
Proper experimental design and statistical analysis are crucial for
producing valid, reliable, and replicable results.
In biostatistics, accuracy, precision, and appropriate hypothesis
testing are essential.

“Research findings must be replicated before they are considered


scientifically credible.”

34
Chapter 1: Introduction to Biostatistics 0214STAT: Fundamentals of Biostatistics 34 / 315
Chapter 2: Descriptive Statistics

Learning Objectives:
After studying this chapter, students will:

• understand how data can be appropriately organized and


displayed.
• understand how to reduce data sets into a few useful, de-
scriptive measures.
• be able to calculate and interpret measures of central ten-
dency, such as the mean, median, and mode.
• be able to calculate and interpret measures of dispersion,
such as the range, variance, and standard deviation.
35
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 35 / 315
Grouped Data: The Frequency Distribution

Grouping data helps in summarizing large datasets.


It allows for easier interpretation and calculation of descriptive
measures like percentages and averages.
The main purpose of grouping data is summarization, though it
can result in loss of specificity.
Group data into contiguous, non-overlapping intervals called class
intervals.

Note: Grouping data is a trade-off between simplification and the


loss of detailed information.

36
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 36 / 315
Considerations in Grouping Data

Number of intervals: A rule of thumb is to use between 5 and


15 intervals. Too few intervals lose information, and too many
obscure patterns.

Sturges’s Rule for the number of intervals:

k = 1 + 3.322 log10 (n)

Width of intervals: The width can be determined by dividing the


range R by the number of intervals k:
R
w=
k
Choose interval widths that are convenient (e.g., 5, 10, or multiples
of 10).
37
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 37 / 315
Example 2.3.1: Constructing a Frequency Distribution

To determine the number of class intervals for a frequency distri-


bution, Sturges’s Rule can be applied:

k = 1 + 3.322 log(n)

where k is the number of class intervals and n is the total number


of data points.

Sol: Applying Sturges’s rule to the given data (n = 189):

k = 1 + 3.322(log 189) = 1 + 3.322(2.2764618) ≈ 9

To find the class interval width, divide the range by k:


R 82 − 30 52
= = ≈ 5.778
k 9 9
A class width of 5 or 10 is more convenient.
38
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 38 / 315
Example 2.3.1: Constructing a Frequency Distribution
Using a width of 10, intervals can be constructed starting from the
smallest value (30) to the largest value (82):
30–39, 40–49, 50–59, 60–69, 70–79, 80–89

Remark: The number of intervals is six, fewer than the nine sug-
gested by Sturges’s rule.

Class Interval Frequency


30–39 11
40–49 46
50–59 70
60–69 45
70–79 16
80–89 1
Table: Frequency Distribution of Ages
39
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 39 / 315
The Histogram

A histogram is a bar graph that represents the frequency


distribution.
The horizontal axis represents the class intervals.
The vertical axis represents the frequency (or relative frequency).
Bars are drawn above each class interval, and they should be
connected with no gaps between them.
Example histogram can be constructed using software.

Note: A histogram provides a visual representation of data distri-


bution and is helpful for identifying patterns or trends.

40
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 40 / 315
Midpoint of Class Intervals

The midpoint of a class interval is calculated by:

Lower limit + Upper limit


Midpoint =
2

Example: The midpoint of the interval 30–39 is:


30 + 39
= 34.5
2

Note: The midpoint is useful for estimating the center of the class
interval when summarizing data.

41
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 41 / 315
Relative Frequency

The relative frequency is the proportion of data points in a partic-


ular class interval.
It is calculated by dividing the frequency of a class interval by the
total number of observations.

Example: The relative frequency of the 50–59 class interval is:


70
≈ 0.3704
189
To express it as a percentage, multiply by 100: 0.3704 × 100 =
37.04%

42
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 42 / 315
Relative Frequency

Note: Relative frequency helps in understanding the proportion


of data within each class interval and is often used to represent
data in percentage form.

Grouping data is a powerful tool for summarization, but it must


be done carefully to preserve the underlying information.
The choice of class intervals and widths depends on the dataset
and the desired level of summarization.
Statistical software can aid in creating frequency distributions and
visualizations like histograms.
A good understanding of the data is crucial when deciding on the
appropriate grouping and interval selection.
43
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 43 / 315
Frequency Distribution

The cumulative frequency is the running total of frequencies.


For example, in the height distribution, the cumulative frequency
after the first interval (30–39) is 11, after the second interval
(40–49) it is 46, and so on. The cumulative frequency allows us
to determine the total number of observations that fall within two
or more class intervals.
Class Interval Frequency Cumulative Frequency Relative Frequency Cumulative Relative Frequency
30–39 11 11 0.0582 0.0582
40–49 46 57 0.2434 0.3016
50–59 70 127 0.3704 0.6720
60–69 45 172 0.2381 0.9101
70–79 16 188 0.0847 0.9948
80–89 1 189 0.0053 1.0001

Table: Frequency, Cumulative Frequency, Relative Frequency, and


Cumulative Relative Frequency of Ages

44
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 44 / 315
Self-Review Questions

What is the purpose of grouping data?


How do you determine the number of class intervals for a
frequency distribution?
Explain how to calculate the relative frequency of a class
interval.
What is the difference between cumulative frequency and
cumulative relative frequency?

45
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 45 / 315
Boundary (or True) Class Limits and Histogram
Construction

Some values in the second class interval, when measured


precisely, might be slightly less than 40 or slightly greater
than 49.
Considering the underlying continuity of the variable and
assuming data are rounded to the nearest whole number, the
Boundary limits of the second interval are 39.5 and 49.5.
The Boundary limits for all class intervals are shown in the
table below.

46
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 46 / 315
Boundary Class Limits Table

Boundary Class Limits Frequency


29.5–39.5 11
39.5–49.5 46
49.5–59.5 70
59.5–69.5 45
69.5–79.5 16
79.5–89.5 1
Total 189
Table: The Data of Table Showing Boundary Class Limits

47
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 47 / 315
Histogram Construction

Constructing a graph using class limits as the base of


rectangles ensures no gaps, resulting in the histogram
shown in Figure 1.
The space enclosed by the boundaries of the histogram is
referred to as the area of the histogram, with each
observation contributing one unit of area.
For 189 observations, the total area of the histogram is 189
units.
Each cell contains a proportion of the total area based on its
frequency.
46
Example: The second cell contains 189 of the total area.

48
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 48 / 315
Histogram Example

Figure: Histogram of ages of 189 subjects (from Table).

49
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 49 / 315
Frequency Polygon

A frequency polygon is a graphical representation of a frequency


distribution, depicted as a special kind of line graph. To draw a
frequency polygon:

Place a dot above the midpoint of each class interval on the


horizontal axis, with the height corresponding to the
frequency of the class interval.
Connect the dots using straight lines to produce the
frequency polygon.

The polygon is brought down to the horizontal axis at the ends,


corresponding to the midpoints of hypothetical additional cells at
each end. This ensures the total area is enclosed.

50
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 50 / 315
Frequency Polygon vs Histogram

The total area under the frequency polygon is equal to the area under
the histogram. Figure 2 demonstrates this relationship by showing the
frequency polygon superimposed on the histogram.

Figure: Histogram of ages of 189 subjects (from Table).

51
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 51 / 315
Stem-and-Leaf Display

Stem-and-leaf displays are most effective with small data sets. They
are not generally suitable for use in publications intended for the
general public, but they help researchers and decision makers
understand the nature of their data. In contrast, histograms are more
appropriate for external communication.

Most effective for small data sets.


Not ideal for publications aimed at the general public.
Helpful for researchers and decision makers to understand
data.
Histograms are better for external communication.

52
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 52 / 315
Constructing a Stem-and-Leaf Display

Remember: Stem-and-leaf displays provide detailed insight but


lack the broad accessibility of histograms for non-technical audi-
ences.

To construct a stem-and-leaf display, each measurement is partitioned


into two parts:
The stem (one or more of the initial digits)
The leaf (one or more of the remaining digits)
These partitioned numbers are displayed with the stems in an ordered
column, and the leaves listed to the right.
For example, in a data set containing values such as 72, 74, and 78:
The stem would be 7.
The leaves would be 2, 4, 8.
When the leaves consist of more than one digit, all digits after the first
are omitted.
53
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 53 / 315
Example 2.3.2: Constructing a Stem-and-leaf chart

Stem-and-leaf display of ages of 189 subjects shown in Table 2.2.1


(stem unit = 10, leaf unit = 1)

54
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 54 / 315
Example: Heights of 15 Individuals

Suppose we have the heights (in centimeters) of 15 individuals:

150 155 160 162 168 170 170


172 175 178 180 185 190 192
195

We will create a frequency table using Sturges’ formula.


The first step is to calculate the number of classes (k) using
Sturges’ formula: k = 1 + 3.322 log10 (n) where n is the
number of observations (15 in this case).
Substituting n = 15 into the formula:

k = 1 + 3.322 log10 (15) ≈ 1 + 3.322 × 1.176 = 1 + 3.91 ≈ 5

So, we need 5 classes.


55
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 55 / 315
Step 2: Determine the Range

Note: The number of classes (k) is crucial for creating a mean-


ingful frequency distribution that balances detail and simplicity.

Next, we find the range of the data, which is the difference between
the maximum and minimum values:

Range = 195 − 150 = 45

Now, we can calculate the class width:


Range 45
Class width = = =9
Number of classes 5
So, the class width is 9.

56
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 56 / 315
Step 3: Define the Classes

Based on the class width of 9, we can define the classes:


Class 1: 150 - 159
Class 2: 160 - 169
Class 3: 170 - 179
Class 4: 180 - 189
Class 5: 190 - 199

57
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 57 / 315
Step 4: Calculate the Frequencies

Now, we count how many values fall into each class:


Class 1 (150 - 159): 3 values (150, 155, 160)
Class 2 (160 - 169): 2 values (162, 168)
Class 3 (170 - 179): 3 values (170, 170, 172)
Class 4 (180 - 189): 3 values (175, 178, 180)
Class 5 (190 - 199): 4 values (185, 190, 192, 195)

58
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 58 / 315
Constructing the Frequency Table

We now construct the frequency table:

Class Frequency
150 - 159 3
160 - 169 2
170 - 179 3
180 - 189 3
190 - 199 4

59
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 59 / 315
Boundary Class Limits

True class limits are the actual boundaries of class intervals in a


frequency distribution, accounting for any gaps between consec-
utive intervals.
They represent the continuous range of data values that belong to
a class.
Example: If the class interval is 10 − 19, the boundary or true
class limits are 9.5 − 19.5, assuming the data is rounded to the
nearest whole number.
The lower boundary or true limit is 10 − 0.5 = 9.5, and the
upper boundary or true limit is 19 + 0.5 = 19.5.

60
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 60 / 315
Height Distribution and boundary Class Limits Example

In this example, we examine the height distribution of a group of


individuals. The boundary class limits for each height range are
determined by adjusting the boundaries of the intervals.
Height Range (cm) boundary Class Limits Frequency (f) Midpoint (X)
150-159 149.5 - 159.5 3 154.5
160-169 159.5 - 169.5 4 164.5
170-179 169.5 - 179.5 4 174.5
180-189 179.5 - 189.5 3 184.5
190-199 189.5 - 199.5 1 194.5

61
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 61 / 315
Cumulative Frequency Example

The cumulative frequency is the running total of frequencies.


Example: For the height distribution:
After the first interval (150–159): Cumulative frequency = 3.
After the second interval (160–169): Cumulative frequency = 3 +
4 = 7.
Continuing sequentially, we sum up frequencies over the
intervals.

62
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 62 / 315
Use of boundary Class Limits in Graphs

Histogram:
boundary class limits ensure that bars representing adjacent
intervals touch each other.
The base of each bar corresponds to the range between the lower
and upper boundary class limits.
The height of the bar reflects the frequency of the class.

63
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 63 / 315
Frequency Polygon

Midpoints of the boundary class limits are used as the


x-coordinates of the vertices.
Frequencies are used as the y-coordinates.
Points are plotted and connected with straight lines to provide a
smooth distribution.

64
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 64 / 315
Importance of boundary Class Limits

Using boundary class limits ensures consistent and precise visual


representation of data.
Gaps are avoided in histograms.
Frequency polygons accurately reflect the continuous nature of
the distribution.

Using boundary class limits is essential for maintaining the ac-


curacy of graphical data representations, especially in continuous
distributions.

65
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 65 / 315
Descriptive Statistics Overview

Frequency distributions are useful, but often we need a single


value to summarize data.

Descriptive Measures: Computed from a sample: Statistic.


Computed from a population: Parameter.
Focus areas in this chapter:
Measures of central tendency (mean, median, mode).
Measures of dispersion (next slides).

66
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 66 / 315
What Are Measures of Central Tendency?

Definition: The typical or central value of a dataset, often referred


to as the ”average.”
Conveys information about the center of the data.
Three common measures:
1 Mean (Arithmetic Mean)
2 Median
3 Mode

67
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 67 / 315
Arithmetic Mean

Definition: The most familiar measure of central tendency, often


referred to as the “average.”
Add all the values and divide by the number of values.

Formula for Population Mean:


PN
i=1 xi
µ=
N
xi : individual values
N : total number of values

68
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 68 / 315
Example 2.4.1: Population Mean

Compute the mean age of a population of 189 subjects.

Sol:
48 + 35 + 46 + · · · + 73 + 66
µ= = 55.032
189

Add all ages.


Divide by the total number of subjects (N = 189).

69
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 69 / 315
Sample Mean

Definition: When data is drawn from a sample, the mean is com-


puted using: Pn
xi
x̄ = i=1
n
xi : individual sample values
n: total number of sample values

70
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 70 / 315
Example 2.4.2: Sample Mean

Compute the mean age of a sample of 10 subjects.

Sol:
43 + 66 + 61 + 64 + 65 + 38 + 59 + 57 + 57 + 50
x̄ = = 56
10

Add all sample values.


Divide by the total number of values (n = 10).

71
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 71 / 315
Properties of the Arithmetic Mean

Uniqueness: Only one arithmetic mean exists for a dataset.

Simplicity: Easy to compute and understand.

Sensitivity to Extreme Values:


All data points affect the mean.
Extreme values can distort the mean, making it less
representative.

72
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 72 / 315
Example: Effect of Extreme Values

Consider the charges for a medical procedure:

75, 75, 80, 80, 280

Mean:
75 + 75 + 80 + 80 + 280
= 118
5

The mean is inflated by the extreme value ($280), making it


unrepresentative.

73
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 73 / 315
Median

Definition: The median divides a dataset into two equal parts such
that half the values are less than or equal to it and half are greater
than or equal to it.
Odd number of observations: Median is the middle value.
Even number of observations: Median is the mean of the two
middle values.

Example: For 10 ages: 38, 43, 50, 57, 57, 59, 61, 64, 65, 66
57 + 59
Median = = 58
2

74
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 74 / 315
Properties of the Median

Uniqueness: One unique median for a given dataset.

Simplicity: Easy to calculate.

Resistant to Outliers: Not drastically affected by extreme values


compared to the mean.

75
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 75 / 315
Mode

Definition: The mode is the value that occurs most frequently in


a dataset.
Data: 20, 20, 21, 34, 22, 27, 27, 27 Modes: 20 and 27 (Bimodal)
Data: 10, 21, 33, 53, 54 No mode (all values are unique)

Use in Qualitative Data: For example, in mental health diag-


noses, the most frequent diagnosis is the modal diagnosis.

76
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 76 / 315
Skewness

Definition: Skewness measures the asymmetry of a data distribu-


tion.
Positively Skewed: Tail extends to the right; mean > median.
Negatively Skewed: Tail extends to the left; mean < median.

Formula: √ Pn
n i=1 (xi − x̄)3
Skewness = √
(n − 1) n − 1 s3
Examples:
Positively Skewed: Mean = 10.5, Median = 9.
Negatively Skewed: Mean = 8, Median = 9.

77
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 77 / 315
Symmetry and Normal Distribution
Symmetric Distribution: Left and right halves of the graph are
mirror images.

Normal Distribution: Mean = Median = Mode. Represented


by the bell-shaped curve.

Examples: Normal distribution, t-distribution.

Figure: Three histograms illustrating skewness


78
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 78 / 315
Introduction to Dispersion

Dispersion refers to the variability or spread within a set of data.


Key concepts:
No dispersion: All values are identical.

Greater dispersion: Values are more spread out.

Synonyms for dispersion:

Variation, Spread, Scatter


.

79
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 79 / 315
Visualizing Dispersion
Populations with the same mean can have different levels of
variability.
Examples:
Population A: Less variable, values closer together.
Population B: More variable, values more spread out.
If we denote the range by R, the largest value by xL , and the
smallest value by xS , we compute the range as follows:

Figure: 2.5.1: Two frequency distributions with equal means but different
amounts of dispersion.
80
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 80 / 315
The Range

Definition: The difference between the largest and smallest val-


ues in a dataset.

Formula:
R = xL − xS
Where:
xL : Largest value.
xS : Smallest value.

Example:
R = 82 − 30 = 52

81
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 81 / 315
Advantages and Disadvantages of the Range

Advantages: Simple to compute.

Disadvantages:
Considers only two values.
Limited usefulness for describing the entire dataset.

Alternate Expression:

Range as a pair: [xS , xL ] = [30, 82]

82
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 82 / 315
Variance

Variance: Measures the spread of data around the mean in squared


units.

Sample Variance Formula:


Pn
2 − x̄)2
i=1 (xi
s =
n−1

Finite Population Variance Formula:


PN
i=1 (xi − µ)2
σ2 =
N

Degrees of Freedom: Dividing by n − 1 (instead of n) adjusts for


bias when estimating population variance from a sample.
83
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 83 / 315
Example of Variance Calculation

Dataset: Ages of subjects from Example 2.4.2 (Page 38).


Steps:
Compute the mean (x̄).
Calculate squared deviations ((xi − x̄)2 ).
Find the variance (s2 ).
Dataset: x = {43, 66, 50, . . . }.
Mean: x̄ = 56.

Variance:
(43 − 56)2 + (66 − 56)2 + · · · + (50 − 56)2
s2 =
9
810
s2 = = 90
9

84
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 84 / 315
Standard Deviation

Definition: The square root of the variance, representing disper-


sion in original units.

Formula: sP
√ n
− x̄)2
i=1 (xi
s= s2 =
n−1
Calculation: √
s= 90 ≈ 9.49

The standard deviation is approximately 9.49.

85
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 85 / 315
Coefficient of Variation (C.V.)

Definition: A relative measure of dispersion, expressed as a per-


centage of the mean.

Formula:
s
C.V. = × 100

Use: Compares variability across datasets with different units or


scales.

86
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 86 / 315
Example - Coefficient of Variation

Sample 1:
Mean weight: 145 pounds, Standard deviation: 10 pounds
C.V. = 10
145 × 100 = 6.9%

Sample 2:
Mean weight: 80 pounds, Standard deviation: 10 pounds
C.V. = 10
80 × 100 = 12.5%

Conclusion: The weights of the 11-year-olds (Sample 2) are more


variable relative to their mean than those of the 25-year-olds (Sam-
ple 1).

87
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 87 / 315
Importance of C.V.

Comparison Across Units: Useful for comparing datasets mea-


sured in different units (e.g., pounds vs. kilograms).

Eliminates Unit Bias: The unit of measurement cancels out.

Practical Applications:
Compare variability in serum cholesterol levels (mg/dL)
and body weight (pounds).
Evaluate consistency in different experimental results.

88
Chapter 2: Descriptive Statistics 0214STAT: Fundamentals of Biostatistics 88 / 315
Chapter 3: The Normal Distribution

Learning Objectives: After studying this chapter, the student will

• understand normal continuous distributions and how to use them


to calculate probabilities in real-world problems.
• understand standard normal continuous distributions and how to
use them to calculate probabilities in real-world problems.
• understand the importance and basic principles of estimation.
• be able to calculate interval estimates for a variety of parameters.
• be able to interpret a confidence interval from both a practical
and a probabilistic viewpoint.
• understand the basic properties and uses of the t distribution and
chi-square distribution.
89
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 89 / 315
Understanding the Nature of Continuous Distributions

What is a Continuous Random Variable?


A variable that can take any value within a given range (e.g.,
height, weight, age).
Its distribution is represented by a smooth curve rather than
discrete points.
Key Insights:
Probability Density Function (pdf) f (X): The function that
describes the probability of different outcomes.
The area under the curve between any two points represents the
probability of the variable falling within that range.
As sample size increases and class intervals become smaller, the
distribution approaches a smooth curve (theoretical model).

90
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 90 / 315
Visualization of a Continuous Distribution

Transition to Smooth Curves:


As the number of observations increases, histograms
approximate a smooth curve (normal distribution or other forms).
A smooth curve (e.g., normal curve) provides a better
understanding of the distribution of the continuous random
variable.

Figure 2: Smooth curve


Figure 1: Histogram with large approximating the distribution
sample size and small intervals. (Normal curve).

91
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 91 / 315
Properties of Continuous Distributions

Key Properties:
Total area under the curve = 1.
Relative frequency between a and b: Area under the curve
between a and b (Figure 4.5.3).

Figure 4.5.3: Graph of a continuous distribution showing area


between a and b.

92
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 92 / 315
Properties of Continuous Distributions

Finding Areas:
Use integral calculus to compute areas.
Density function f (x) is integrated over the interval [a, b]:
Z b
P (a ≤ X ≤ b) = f (x)dx
a

Probability of a specific value: P (X = x) = 0, as area above a


single point is 0.

93
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 93 / 315
Definition of Probability Distribution

Definition:

A function f (x) is a probability density function of a continuous


random variable X if:
f (x) ≥ 0 for all x.
Total area under f (x) and above the x-axis equals 1:
Z ∞
f (x)dx = 1
−∞

Probability that X is between a and b:


Z b
P (a ≤ X ≤ b) = f (x)dx
a

94
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 94 / 315
Introduction to The Normal Distribution

The normal distribution is the most important distribution in


statistics.
First published by Abraham De Moivre (1667–1754) on
November 12, 1733.
Often referred to as the Gaussian distribution due to Carl
Friedrich Gauss (1777–1855).
Its graph produces the familiar bell-shaped curve.

95
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 95 / 315
The Normal Density Function

The formula for the normal density function is:

1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞
2πσ
Here:
π = 3.14159 . . . and e = 2.71828 . . . are mathematical constants.
µ: Mean (measure of central tendency).
σ: Standard deviation (measure of dispersion).

96
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 96 / 315
Characteristics of the Normal Distribution

1 Symmetrical about its mean (µ).


2 Mean, median, and mode are equal.
3 Total area under the curve above the x-axis is 1.
4 Approximately:
68% of the area lies within ±1σ of the mean.
95% within ±2σ.
99.7% within ±3σ.
5 Completely determined by the parameters µ (location) and σ
(shape).

97
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 97 / 315
Visualizing the Normal Distribution
The bell-shaped curve is symmetric about µ (mean).
Variations in µ and σ (standard deviation):
Changing µ: Shifts the curve along the x-axis.
Changing σ: Affects the flatness or peakedness of the curve.
Example graph:

Figure: Graph of a normal distribution.

98
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 98 / 315
Empirical Rule (68–95–99.7 Rule)
The areas under the normal curve:
±1σ: 68% of the total area.
±2σ: 95% of the total area.
±3σ: 99.7% of the total area.
Illustration:

Figure: Subdivision of the area under the normal curve

99
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 99 / 315
Remarks

The normal distribution is foundational in statistics.


It is defined by two parameters: µ (mean) and σ (standard
deviation).
Key properties:
Symmetry about the mean.
Total area under the curve equals 1.
Empirical Rule (68–95–99.7 Rule).
Applications: Widely used in probability, data analysis, and
hypothesis testing.

100
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 100 / 315
Graph of the Normal Distribution

The graph of the normal distribution produces the familiar bell-shaped


curve.

Figure: Three normal distributions with different means but the same amount
of variability.

101
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 101 / 315
Effect of Parameters on the Normal Distribution

The normal distribution is determined by two parameters:


µ (mean), which shifts the graph along the x-axis.
σ (standard deviation), which affects the shape (flatness or
peakedness).
The graph’s shape depends on µ (location parameter) and σ
(shape parameter).

Figure: Three normal distributions with different standard deviations but the
same mean.

102
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 102 / 315
The Standard Normal Distribution

The standard normal distribution is a specific normal distribution


with:
Mean µ = 0
Standard deviation σ = 1
The equation of the standard normal distribution is:
1 z2
f (z) = √ e− 2 for −∞<z <∞

The transformation to standard normal form is:
x−µ
z=
σ

103
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 103 / 315
Graph of the Standard Normal Distribution

The graph of the standard normal distribution is:

Figure: The standard normal distribution.

104
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 104 / 315
Finding Probabilities with the Standard Normal Distribution

To find the probability that z lies between two values, z0 and z1 , we


calculate the area under the curve:
Z z1
1 z2
P (z0 ≤ z ≤ z1 ) = √ e− 2 dz
z0 2π
This integral doesn’t have a closed-form solution, but tables are
available for finding these areas.

105
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 105 / 315
Using the Standard Normal Table

To find the area under the curve between −1 and a value z0 , we use
the standard normal table. For example, the shaded area in the graph
below represents the area between −1 and z0 .

Remark: Given the standard normal distribution, we can use the


table to find probabilities. For example, suppose we want to find
the probability that z is between 0 and 1:

P (0 ≤ z ≤ 1) = Area from the standard normal table for z = 1

106
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 106 / 315
The z-score

The z-score standardizes any normal distribution to the standard


normal distribution. The formula is:
X −µ
z=
σ
Where:
x is a value from the original normal distribution.
µ is the mean of the original distribution.
σ is the standard deviation of the original distribution.

The z-value tells you how many standard deviations a value is


from the mean.

107
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 107 / 315
Example 4.6.1

Given the standard normal distribution, find the area under the curve,
above the z-axis, between z = −∞ and z = 2.

Sol: From Table D, the area corresponding to z = 2 is 0.9772.


Thus, the desired area is:

P (z < 2) = 0.9772

Interpretation: The probability that a randomly chosen z falls


between −∞ and 2 is 0.9772, or 97.72%.

108
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 108 / 315
Example 4.6.2

What is the probability that a z value is between z = −2.55 and


z = 2.55?

Sol: From Table D:


The area for z = 2.55 is 0.9946.
The area for z = −2.55 is 0.0054.
Thus, the probability is:

P (−2.55 < z < 2.55) = 0.9946 − 0.0054 = 0.9892

Interpretation: The probability is 98.92%.

109
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 109 / 315
Example 4.6.3

What proportion of z values are between z = −2.74 and z = 1.53?

Sol: From Table D:


The area for z = 1.53 is 0.9370.
The area for z = −2.74 is 0.0031.
Thus, the desired probability is:

P (−2.74 < z < 1.53) = 0.9370 − 0.0031 = 0.9339

Interpretation: The proportion is 93.39%.

110
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 110 / 315
Example 4.6.4

Given the standard normal distribution, find P (z ≥ 2.71).

Sol: From Table D, the area for z = 2.71 is 0.9966.


Thus, the probability is:

P (z ≥ 2.71) = 1 − P (z ≤ 2.71) = 1 − 0.9966 = 0.0034

Interpretation: The probability is 0.34%.

111
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 111 / 315
Statistical Inference

Statistical inference: Statistical inference is the procedure by


which we reach a conclusion about a population based on infor-
mation from a sample.

Statistical inference is crucial for drawing conclusions from sam-


ple data to make predictions or decisions about a population.

Estimation Process:
Calculate a statistic from a sample to approximate a population
parameter.
Example: Hospital administrator estimating the mean age of
admitted patients.
Example: Physician estimating the proportion of patients with
drug side effects.
112
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 112 / 315
Types of Estimates

Point Estimate: A single numerical value used to estimate a pa-


rameter.

Interval Estimate: Two values defining a range that likely in-


cludes the parameter with a specified confidence level.

An interval estimate provides more information than a point esti-


mate because it gives a range within which the parameter lies.

Estimator:
P Rule/formula used to compute an estimate. Example:
xi
x̄ = n as an estimator of µ.

113
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 113 / 315
Sampled vs. Target Populations

Sampled Population: Population from which a sample is drawn.

Target Population: Population about which conclusions are made.

Understanding the difference between the sampled and target pop-


ulations is crucial to correctly interpreting statistical results.
Key Considerations:
When populations differ, statistical inference applies only to the
sampled population.
Example: Assessing rheumatoid arthritis treatment effectiveness
using a clinic sample.

114
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 114 / 315
Random vs. Nonrandom Samples

Random Samples: Strict validity of statistical methods depends


on random sampling.

Nonrandom Samples: Practicality may necessitate convenience


samples (e.g., volunteers, available animals).

Random assignment in experiments allows valid inferences about


treatments, while nonrandom samples may limit the generalizabil-
ity of conclusions.

Example: Random assignment in experiments allows valid infer-


ences about treatments.

115
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 115 / 315
Remarks

Estimation is key to statistical inference, allowing conclusions


about populations.

Two main types of estimates: point and interval.

Understanding the relationship between sampled and target pop-


ulations is crucial.

Randomization enhances the validity of inferences.

116
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 116 / 315
Introduction to Confidence Interval

Researchers wish to estimate the mean (µ) of a normally


distributed population.
A random sample of size n is drawn, and x̄ is computed as a
point estimate of µ.
Due to random sampling, x̄ cannot be expected to equal µ.
An interval estimate provides a range that likely includes µ.

117
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 117 / 315
Sampling Distributions and Estimation

Sampling distribution of the sample mean x̄:


Mean: µx̄ = µ
σ2
Variance: σx̄2 = n
Approximately 95% of x̄ values lie within µ ± 2σx̄ .
Replace µ with x̄ to construct a confidence interval: x̄ ± 2σx̄ .

118
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 118 / 315
Example 6.2.1

Study: A researcher sampled 10 individuals to estimate the average


level of a specific enzyme in a human population. The sample mean
was 22, and the variable is approximately normally distributed with a
known variance of 45. The goal is to estimate the population mean
(µ). Given:
A sample of 10 individuals yields x̄ = 22, and σ 2 = 45.
Compute a 95% confidence interval for µ.

Sol:
s r
σ2 45
σx̄ = = = 2.1213,
n 10
CI = x̄ ± 2σx̄ = 22 ± 2(2.1213) = (17.76, 26.24).

119
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 119 / 315
General Formula

Interval Estimate:
Estimator ± (Reliability Coefficient) × (Standard Error)
For known variance:

x̄ ± z1−α/2 σx̄ (6.2.2)

Reliability Coefficients:
90% CI: z = 1.645
95% CI: z = 1.96
99% CI: z = 2.58

120
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 120 / 315
Example 6.2.2

Study: A physical therapist wants to estimate the mean maximal


strength of a muscle in a group, with 99% confidence. Assuming the
strength scores are normally distributed with a variance of 144, a
sample of 15 subjects had a mean strength of 84.3. Given:
Sample of 15 subjects: x̄ = 84.3, σ 2 = 144.
Compute a 99% confidence interval for µ.

Sol:
r
144
σx̄ = = 3.0984,
15
CI = x̄ ± 2.58σx̄ = 84.3 ± 2.58(3.0984) = (76.3, 92.3).

121
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 121 / 315
Example 6.2.3: Sampling from Nonnormal Populations

Study: A study of 35 patients showed an average lateness of 17.2


minutes for appointments, with a standard deviation of 8 minutes. The
population distribution was nonnormal. What is the 90 percent
confidence interval for µ, the true mean amount of time late for
appointments? Given:
Sample of 35 patients: x̄ = 17.2, σ = 8.
Population distribution is nonnormal.
Compute a 90% confidence interval for µ.

Sol:
8
σx̄ = √ = 1.3522,
35
CI = x̄ ± 1.645σx̄ = 17.2 ± 1.645(1.3522) = (15.0, 19.4).

122
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 122 / 315
Key Concepts

Confidence Coefficient: 1 − α (e.g., 0.90, 0.95, 0.99).


Precision: Reliability factor × standard error.
Use sample variance when population variance is unknown.
For large samples, the Central Limit Theorem ensures x̄ is
approximately normal.

123
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 123 / 315
Introduction to the t Distribution

Confidence intervals for population mean often require the


population variance.
When population variance is unknown, the sample standard
deviation s is used.
For small samples, normal distribution theory is insufficient.
Solution: Use the t distribution, introduced by Gosset
(“Student”).

124
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 124 / 315
Formula for the t Distribution

The statistic t is defined as:

x̄ − µ
t= √
s/ n

t follows the t distribution.


Properties of the t distribution:
1 Mean of 0.
2 Symmetrical about the mean.
3 Variance greater than 1; approaches 1 as sample size increases.
4 Family of distributions defined by degrees of freedom (df =
n − 1).

125
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 125 / 315
Properties of the t Distribution

Range: −∞ to +∞.
Shape: Compared to the normal distribution:
Less peaked center.
Thicker tails.
Approaches the normal distribution as n − 1 → ∞.

Figure: Comparison of t and normal distributions.

126
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 126 / 315
Confidence Intervals Using t

Confidence interval formula:

s
x̄ ± t(1−α/2) · √
n
Reliability coefficient t(1−α/2) is derived from the t table.
Requirements:
Sample drawn from a normal distribution.
Small deviations from normality are acceptable.

127
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 127 / 315
Example 6.3.1: Confidence Interval
Study: Maffulli et al. (A-1) examined early weightbearing and
mobilization after Achilles tendon repair. Among 19 subjects, the
mean isometric strength was 250.8 N with a standard deviation of
130.9 N, used to estimate the population mean. Given:
Mean strength of 19 subjects: x̄ = 250.8.
Standard deviation: s = 130.9.
Degrees of freedom: n − 1 = 18.
Desired confidence level: 95%.

Sol:
s 130.9
Standard error: √ = √ = 30.03.
n 19
t-value: t0.975,18 = 2.1009.
Confidence interval: 250.8 ± 2.1009 · 30.03 = [187.7, 313.9].
128
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 128 / 315
Deciding Between z and t
Use z:
Large sample size (n > 30).
Known population variance.
Use t:
Small sample size.
Unknown population variance.
Sample from a normal or approximately normal distribution.

Figure: Flowchart for deciding between z and t.


129
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 129 / 315
Chi-Square Distribution

The Chi-Square (χ2 ) distribution is used for constructing


confidence intervals for the population variance (s2 ). If samples of
size n are drawn from a normally distributed population, the
following quantity has a Chi-Square distribution:

(n − 1)s2
σ2
Where n is the sample size, s2 is the sample variance, and σ 2 is the
population variance.
The Chi-Square distribution depends on the degrees of freedom
(df = n − 1).

130
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 130 / 315
Confidence Interval for Variance (σ 2 )

To construct a confidence interval for the population variance (σ 2 ),


we use the following steps:

(n − 1)s2
χ2α/2 < < χ21−α/2
σ2
Rearranging to solve for σ 2 :

(n − 1)s2 2 (n − 1)s2
< σ <
χ21−α/2 χ2α/2

This gives us the 100(1 − α)% confidence interval for the population
variance (σ 2 ).

131
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 131 / 315
Confidence Interval for Standard Deviation (σ)

Taking the square root of each term in the confidence interval for σ 2 ,
we obtain the confidence interval for the population standard
deviation (σ):
v v
u u
u (n − 1)s2 u (n − 1)s2
t <σ< t
2χ1−α/2 χ2α/2

This is the 100(1 − α)% confidence interval for the population


standard deviation (σ).

132
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 132 / 315
Example 6.9.1: Gluten-Free Diet Study
Study: In a study where seven subjects with type 1 diabetes were
placed on a gluten-free diet, the IAA levels measured were as follows:

9.7, 12.3, 11.2, 5.1, 24.8, 14.8, 17.7

What are the values of χ21−α/2 and χ2α/2 used to calculate the
95% confidence interval for the population variance?
What is the 95% confidence interval for the population variance
σ2?
What is the 95% confidence interval for the population standard
deviation σ?

Sol: The sample variance s2 = 39.763 and degrees of freedom


df = 6. Using the Chi-Square distribution with df = 6, we find:

χ21−α/2 = 14.449, χ2α/2 = 1.237


133
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 133 / 315
Example: Gluten-Free Diet Study

The 95% confidence interval for σ 2 is:


6 × 39.763 6 × 39.763
< σ2 <
14.449 1.237
16.512 < σ 2 < 192.868
The 95% confidence interval for σ (the standard deviation) is:

4.063 < σ < 13.888

134
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 134 / 315
Precautions and Considerations

While the method for constructing confidence intervals for σ 2 is


widely used, there are some important considerations:
The assumption of normality is crucial. If the population is not
normal, the results may be misleading.
The Chi-Square distribution is not symmetric, meaning the
estimator is not at the center of the confidence interval.
To address this asymmetry, some tables and corrections can be
used to improve the accuracy of the intervals.

135
Chapter 3: The Normal Distribution 0214STAT: Fundamentals of Biostatistics 135 / 315
Chapter 4: Hypothesis Testing

Learning Objectives:
After studying this chapter, the student will

• Understand how to correctly state a null and alternative


hypothesis and carry out a structured hypothesis test.
• Understand the concepts of type I error, type II error, and
the power of a test.
• Be able to calculate and interpret z and t test statistics for
making statistical inferences.
• Understand how to calculate and interpret p-values.

136
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 136 / 315
Introduction to Statistical Hypothesis Testing

Statistical inference consists of estimation and hypothesis


testing.
Hypothesis testing aims to draw conclusions about a
population based on a sample.
Confidence intervals and hypothesis testing can lead to
similar conclusions.

137
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 137 / 315
Definition of Hypothesis

What is Hypothesis Testing?


Hypothesis testing is a statistical method used to make decisions
about the population based on sample data. The goal is to test a claim
or hypothesis about the population parameter.

A hypothesis is a statement about one or more populations. Hy-


potheses often involve population parameters.

Examples:
The average length of hospital stay is 5 days.
A drug is effective in 90% of cases.
Hypothesis testing determines compatibility of statements with
data.

138
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 138 / 315
Key Concepts
Key Concepts in Hypothesis Testing:
Null Hypothesis (H0 ): A statement that there is no effect or no
difference.
Alternative Hypothesis (HA ): A statement that contradicts the
null hypothesis.

Important! Always define your hypotheses clearly before per-


forming the test.

Types of Hypotheses:
Research Hypothesis: The conjecture or supposition that
motivates the research.
Statistical Hypothesis: A statement expressed in a way that
allows evaluation through statistical techniques.
139
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 139 / 315
Types of Hypotheses

Examples:
Case 1: Testing for equality:
Null Hypothesis (H0 ): µ = 50 (e.g., The average length of
hospital stay is 50 days).
Alternative Hypothesis (HA ): µ , 50 (e.g., The average
length of hospital stay differs from 50 days).
Case 2: Testing for an increase:
Null Hypothesis (H0 ): µ ≤ 50 (e.g., The drug is effective
in no more than 50
Alternative Hypothesis (HA ): µ > 50 (e.g., The drug is
effective in more than 50

140
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 140 / 315
Purpose of Hypothesis Testing

Purpose of Hypothesis Testing:


To provide a framework for making decisions using sample
data.
To evaluate whether the evidence supports rejecting the null
hypothesis.
To quantify the probability of observing sample results
under the assumption that H0 is true.

141
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 141 / 315
Common Hypothesis Tests

Common Tests:
Z-test: Used when population variance is known and the
sample size is large.
T-test: Used when population variance is unknown and the
sample size is small.
Chi-square test: Used for testing categorical data.

142
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 142 / 315
Rules for Hypothesis Tests

Rules for Stating Hypotheses:


The alternative hypothesis reflects what you hope or expect
to conclude.
The null hypothesis always includes equality (=, ≤, ≥).
The null hypothesis is tested; it is either rejected or not
rejected.
Null and alternative hypotheses are complementary,
exhausting all possibilities.

143
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 143 / 315
General Formula for Test Statistic

Formula of Test Statistic:


Relevant Statistic − Hypothesized Parameter
Test Statistic =
Standard Error of Relevant Statistic
Key Idea: The test statistic is computed from the sample data.
Decision to reject or not reject H0 depends on the test statistic.
Example of a test statistic:

x̄ − µ0
z= √ ,
σ/ n
where µ0 is a hypothesized value of a population mean.
This test statistic relates to the familiar formula:
x̄ − µ
z= √ (7.1.2)
σ/ n
144
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 144 / 315
Test Statistic

Distribution of Test Statistic:


The distribution is shaped by the behavior of the data across
repeated samples.
Example: If testing the mean using a z-test, the statistic follows a
normal distribution under certain conditions.
x̄−µ
√0
If H0 is true and assumptions are met: z = σ/ n
follows a
standard normal distribution.

Decision Rule and Rejection Region:


Values of the test statistic are divided into:
Rejection Region: Unlikely values if H0 is true.
Non-rejection Region: Likely values if H0 is true.
Significance Level (α): Probability of rejecting a true H0 .
Common values: 0.01, 0.05, 0.10.
145
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 145 / 315
Types of Errors

Type I Error (α): Rejecting a true null hypothesis.

Type II Error (β): Failing to reject a false null hypothesis.

Condition of Null Hypothesis


True False
Fail to Reject H0 Correct action Type II error
Reject H0 Type I error Correct action
Table: Conditions under which Type I and Type II errors may occur.

146
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 146 / 315
Steps in Hypothesis Testing

1 State hypotheses (H0 and HA ).


2 Compute the test statistic.
3 Determine the test statistic’s distribution.
4 State the decision rule.
5 Calculate the test statistic from sample data.
6 Make the statistical decision:
Reject H0 if it falls in the rejection region.
Do not reject H0 otherwise.
7 Conclude based on the decision.

147
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 147 / 315
Conclusion and p-values

p-value: The probability of obtaining a test statistic as extreme


as the one computed, assuming H0 is true.

Making Decisions with p-values:


Decisions are based on the comparison of the p-value with the
significance level α.

Decision Rule:
Reject H0 if p-value < α. This indicates strong evidence against
H0 .
Do not reject H0 if p-value ≥ α. This does not prove H0 , but
suggests insufficient evidence to reject it.

Tip: If the p-value is less than the significance level, reject H0 .


148
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 148 / 315
Precaution

Precaution: Precaution refers to steps or considerations to ensure


valid and reliable results.

Precautions in Hypothesis Testing:


(1) Set the appropriate significance level α (e.g., 0.05).
(2) Understand the null and alternative hypotheses.
(3) Verify assumptions (e.g., normality, independence, equal
variance).
(4) Choose the correct test (e.g., t-test, z-test, chi-square test).
(5) Be cautious with p-values (interpret p-values correctly).

149
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 149 / 315
Precaution

Remarks:
Hypothesis testing does not prove a hypothesis; it evaluates
whether data supports or does not support it.
If H0 is not rejected:
We say it ”may be true,” but we do not claim it is proven.
Accepting H0 implies that the data do not provide strong
evidence against it.

150
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 150 / 315
Introduction to Hypothesis Testing: Single Population Mean

Test if the population mean (µ) differs from a hypothesized value


(µ0 ).

Conditions for the Test:


1 Population is normal or approximately normal.
2 Population variance (σ 2 ) is known.

Test Statistic:
x̄ − µ0
z= √
σ/ n

The z-test is used when the population variance is known and


the sample size is large enough to approximate the normal
distribution.

151
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 151 / 315
Example 7.2.1: Testing the Mean Age

Study: Researchers want to test if the mean age of a population


differs from 30.
Sample size (n) = 10.
Sample mean (x̄) = 27.

Population variance (σ 2 ) = 20 (σ = 20 = 4.47).
Significance level (α) = 0.05.
Sol. Hypotheses:
Null Hypothesis (H0 ): µ = 30
Alternative Hypothesis (HA ): µ , 30 (two-tailed test)
Significance level (α = 0.05):

α/2 = 0.025 (for each tail)

152
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 152 / 315
Example 7.2.1: Testing the Mean Age
Decision Rule: The critical values are:
z-value for α/2 = 0.025:
z = ±1.96
Reject H0 if:
z ≤ −1.96 or z ≥ 1.96
Rejection and Non-Rejection Regions:

Figure: Rejection and non-rejection regions for Example


153
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 153 / 315
Calculation of Test Statistic

Test Statistic Formula:


x̄ − µ0
z= √
σ/ n

Substitute Values:
27 − 30 −3
z= √ = ≈ −2.12
4.47/ 10 1.4142

Decision:
z = −2.12 falls in the rejection region (z ≤ −1.96).
Reject H0 .
Conclusion:
There is sufficient evidence to conclude that the mean age of the
population is different from 30 at the α = 0.05 significance level.
154
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 154 / 315
Understanding the p-value

The p-value represents the probability of observing the test


statistic (or more extreme values) assuming H0 is true.
A small p-value indicates that the observed data is unlikely under
H0 .
Interpretation:
If p ≤ α (significance level), reject H0 .
If p > α, fail to reject H0 .
For a Two-Tailed Test:

p = 2 × P (Z ≥ |zobserved |).

155
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 155 / 315
How to Compute the p-value
Steps to compute the p-value:
1 Identify the test statistic: In this example, the test statistic is

z = −2.12.
2 Determine the type of test: This is a two-tailed test. The

p-value is computed as:


p = 2 × P (Z ≥ |zobserved |).
3 Look up the z-score in the z-table: Find the cumulative
probability for z = −2.12. The cumulative probability is:
P (Z ≤ −2.12) ≈ 0.0170.
4 Compute the p-value: The p-value is:
p = 2 × (0.0170) = 0.0340.
5 Make a decision: If p ≤ α (significance level), reject H0 . For
α = 0.05 and p = 0.0340, we reject H0 .
156
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 156 / 315
EXAMPLE 7.2.2

Problem: Refer to Example 7.2.1. Suppose, instead of asking if they


could conclude that m , 30, the researchers had asked: Can we
conclude that m < 30? To this question we would reply that they can
so conclude if they can reject the null hypothesis that m ≥ 30.

157
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 157 / 315
Introduction to Sampling from a Normally Distributed
Population: Population Variance Unknown

The population variance is often unknown in statistical inference.


When sampling from an approximately normal population with
unknown variance, the test statistic is given by:

x̄ − µ0
t= √
s/ n
Under the null hypothesis H0 : µ = µ0 , this statistic follows a
Student’s t-distribution with n − 1 degrees of freedom.

158
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 158 / 315
Example 7.2.3: MCL and ACL Tear Study
Research Context:
Study by Nakamura et al. on MRI timing for 17 patients with
MCL and ACL tears.
Variable: Days between injury and initial MRI.
Data Summary: Sample mean x̄ = 13.2941, sample standard
deviation s = 8.88654, sample size n = 17.
Subject Days Subject Days Subject Days
1 14 6 0 11 28
2 9 7 10 12 24
3 18 8 4 13 24
4 26 9 8 14 2
5 12 10 21 15 3
16 14 17 9
Table: Number of Days Until MRI for Subjects with medial collateral
ligament (MCL) and anterior cruciate ligament (ACL) tears.
159
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 159 / 315
Example 7.2.3: MCL and ACL Tear Study

Research Question: Does the average number of days between


injury and initial MRI for patients with MCL and ACL tears sig-
nificantly differ from a clinically recommended value?
Sol. Hypotheses:

H0 : µ = 15
HA : µ , 15

Test Statistic:
x̄ − µ0
t= √
s/ n

160
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 160 / 315
Calculation and Statistical Decision
Degrees of Freedom:

df = n − 1 = 16

Test Statistic Calculation:


13.2941 − 15
t= √
8.88654/ 17
−1.7059
=
2.1553
= −0.791

Decision Rule:
α = 0.05, two-tailed test.
Critical t values: ±2.1199.
Do not reject H0 since t = −0.791 falls in the non-rejection
region.
161
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 161 / 315
Rejection and Non-Rejection Regions

Figure: Rejection and nonrejection regions

The test statistic t = −0.791 lies within the non-rejection region.


Conclusion: Insufficient evidence to reject H0 .

162
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 162 / 315
Sampling from Non-Normal Populations

For large samples (n ≥ 30), use the central limit theorem.


Test statistic:

x̄ − µ0
z= √
s/ n
Approximates the standard normal distribution if H0 is true.
Use s as an estimate for σ when population standard deviation is
unknown.

163
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 163 / 315
EXAMPLE 7.2.4: Systolic Blood Pressure in
African-American Men

Study Context: The study focused on symptom recognition and


perception among African-American patients with chest pain in an
emergency department.
Sample size: n = 157
Sample mean: x̄ = 146 mm Hg
Sample standard deviation: s = 27
Goal: To determine if the mean systolic blood pressure for a
population of African-American men is greater than 140, based on a
study by Klingler et al. (A-2).
Data: The systolic blood pressure scores are: x̄ = 146, s = 27,
n = 157.

164
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 164 / 315
EXAMPLE 7.2.4: Systolic Blood Pressure in
African-American Men
Sol. Assumptions:
The data represent a simple random sample of African-American
men reporting similar symptoms.
Systolic blood pressure is not assumed to be normally
distributed; the Central Limit Theorem applies due to the large
sample size.
Hypotheses:
H0 : µ ≤ 140 (null hypothesis)
HA : µ > 140 (alternative hypothesis)
Test Statistic:
x̄ − µ0 146 − 140
z= √ = √ = 2.78
s/ n 27/ 157
Distribution: By the central limit theorem, the test statistic z
approximately follows a standard normal distribution under H0 . 165
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 165 / 315
EXAMPLE 7.2.4: Systolic Blood Pressure in
African-American Men
Decision Rule: Using a significance level α = 0.05, the critical value
is:
z(α) = 1.645

Figure: Rejection and non-rejection regions

Rejection Region: Reject H0 if z ≥ z(α) = 1.645.


166
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 166 / 315
EXAMPLE 7.2.4: Systolic Blood Pressure in
African-American Men

Statistical Decision: The computed z = 2.78 > 1.645, so we reject


H0 .
Conclusion: There is sufficient evidence to conclude that the mean
systolic blood pressure for the population is greater than 140 mm Hg.
p-value:
Compute the p-value:

p = P (Z > 2.78) = 1 − P (Z ≤ 2.78)

Using the z-table, P (Z ≤ 2.78) ≈ 0.9973.

p = 1 − 0.9973 = 0.0027

Since p < 0.05, the null hypothesis is rejected.

167
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 167 / 315
Remarks

The t-test is a versatile tool for hypothesis testing when


population variance is unknown.
Real-world data often involve assumptions of approximate
normality and unknown variance.
The methods discussed are essential for sound statistical
inference in practical scenarios.

168
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 168 / 315
Introduction to Hypothesis Testing: The Difference
Between Two Population Means

Hypothesis testing involving the difference between two


population means determines if the two means are unequal.
Hypotheses may include:

1. H0 : µ1 − µ2 = 0, HA : µ1 − µ2 , 0
2. H0 : µ1 − µ2 ≥ 0, HA : µ1 − µ2 < 0
3. H0 : µ1 − µ2 ≤ 0, HA : µ1 − µ2 > 0

Testing can also focus on differences other than zero.

169
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 169 / 315
Contexts for Testing the Difference

Three scenarios are considered:


1 Sampling from normally distributed populations with known
variances.
2 Sampling from normally distributed populations with unknown
variances.
3 Sampling from populations that are not normally distributed.

170
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 170 / 315
Testing When Population Variances are Known

If two independent samples come from normal populations with


known variances, the test statistic is:

(x̄1 − x̄2 ) − (µ1 − µ2 )


z= r
σ12 σ22
n1 + n2

Under H0 , z ∼ N(0,1) (standard normal distribution).

171
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 171 / 315
Example 7.3.1: Problem

Problem: Researchers want to determine if there is evidence of


a difference in mean serum uric acid levels between:
Down’s syndrome individuals (x̄1 = 4.5 mg/100 ml) and
Normal individuals (x̄2 = 3.4 mg/100 ml).
Additional Data:
Sample sizes: n1 = 12, n2 = 15
Variances: σ12 = 1, σ22 = 1.5

172
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 172 / 315
Example 7.3.1: Solution
Steps in Hypothesis Testing:
(1) Hypotheses:
H0 : µ1 = µ2 , H A : µ 1 , µ2
(2) Test Statistic:
(4.5 − 3.4) − 0
z= q
1 1.5
12 + 15
(3) Calculation:
1.1
z= = 2.57
0.4282
(4) Decision Rule: At α = 0.05, reject H0 if |z| > 1.96.

173
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 173 / 315
Example 7.3.1: p-value Calculation

(5) Decision: Since z = 2.57 > 1.96, reject H0 .


(6) Conclusion: There is evidence to suggest that the mean serum
uric acid levels are different between the two groups.
(7) p-value: p = 0.0102
To calculate the p-value:
1 Compute the area to the right of the observed z-value (z = 2.57)
using the z-table.
2 The cumulative probability for z = 2.57 is approximately 0.9948.
3 The p-value for a two-tailed test is:

p = 2 × (1 − 0.9948) = 2 × 0.0052 = 0.0104.

Since p < α = 0.05, we reject the null hypothesis (H0 ).

174
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 174 / 315
Remarks

Hypothesis tests for two population means are used to evaluate


whether a significant difference exists between the means.
Steps:
1 Define hypotheses.
2 Identify the test statistic and distribution.
3 Compute the test statistic.
4 Make a decision and draw a conclusion.
Example 7.3.1 demonstrates testing under known variances.

175
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 175 / 315
Sampling from Normally Distributed Populations

When the population variances are unknown, two possibilities exist:


The two population variances may be equal.
The two population variances may be unequal.
We consider the case where the population variances are assumed to
be equal.

176
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 176 / 315
Population Variances Equal

When the population variances are unknown but assumed to be


equal, we pool the sample variances using the following formula:

(n1 − 1)s21 + (n2 − 1)s22


s2p =
n1 + n2 − 2
where:
n1 and n2 are the sample sizes.
s21 and s22 are the sample variances.

177
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 177 / 315
Test Statistic for Testing H0 : µ1 = µ2

The test statistic for testing H0 : µ1 = µ2 is given by:

(x¯1 − x¯2 ) − (µ1 − µ2 )


t= r  
1 1
s2p n1 + n2

where:
x¯1 and x¯2 are the sample means.
s2p is the pooled sample variance.
Under the null hypothesis, this test statistic follows a Student’s
t-distribution with n1 + n2 − 2 degrees of freedom.

178
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 178 / 315
Example 7.3.2: Problem

Study: Tam et al. (A-6) investigated wheelchair maneuvering in


individuals with lower-level spinal cord injury (SCI) and healthy
controls (C).
Pressure measurements were recorded under static sitting
conditions.
Data for measurements of the left ischial tuberosity (in mm Hg)
for the SCI and control groups are shown in Table 7.3.1.
We aim to determine if healthy subjects exhibit lower pressure than
SCI subjects.

Control 131 115 124 131 122 117 88 114 150 169
SCI 60 150 130 180 163 130 121 119 130 148

Table: Pressures (mm Hg) Under the Pelvis during Static Conditions

179
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 179 / 315
Example 7.3.2: Solution
Steps in Hypothesis Testing:
(1) Data: The problem statement provides the data.
(2) Assumptions: Two independent simple random samples,
approximately normally distributed data, and equal population
variances.
(3) Hypotheses:

H0 : µC ≥ µSCI , HA : µC < µSCI

(4) Test Statistic: The test statistic follows Equation 7.3.2.


(5) Distribution of Test Statistic: Student’s t-distribution with
n1 + n2 − 2 degrees of freedom.
(6) Decision Rule:
Significance level: α = 0.05
Critical value of t: t0.05,18 = −1.7341
Decision rule: Reject H0 unless tcomputed > −1.7341.
180
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 180 / 315
Example 7.3.2: Solution
Given Information:
Control group: x̄C = 126.1, sC = 21.8, n1 = 10
SCI group: x̄SCI = 133.1, sSCI = 32.2, n2 = 10

Step 1: Pool the sample variances

(n1 − 1)s21 + (n2 − 1)s22


s2p =
n1 + n2 − 2
Substituting the values:

(10 − 1)(21.8)2 + (10 − 1)(32.2)2


s2p =
10 + 10 − 2
First, calculate the squared sample standard deviations:

s21 = 21.82 = 476.84, s22 = 32.22 = 1036.84


181
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 181 / 315
Test Statistic Calculation
Now, substitute:
9(476.84) + 9(1036.84) 13623.12
s2p = = = 756.84
18 18
Thus, the pooled variance is:
s2p = 756.84
The test statistic formula is:
x̄C − x̄SCI
t= q
1 1
sp n1 + n2

Substitute the known values:


126.1 − 133.1
t= √ q
1 1
756.84 10 + 10

First, compute the square root of the pooled variance:



756.84 = 27.47
182
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 182 / 315
Decision Rule and Conclusion
Now substitute into the formula:
−7 −7 −7
t= √ = = = −0.57
27.47 × 0.2 27.47 × 0.4472 12.28

Decision Rule:
The critical value for t at α = 0.05 and df = n1 + n2 − 2 = 18
is −1.7341.
Reject H0 if tcomputed < −1.7341.
Since the computed value of t = −0.57 is greater than −1.7341, we
fail to reject H0 .

Conclusion: There is not sufficient evidence to suggest that the


mean pressure for the SCI group is greater than that for the control
group.

183
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 183 / 315
Population Variances Unequal
When two independent simple random samples have been drawn from
normally distributed populations with unknown and unequal
variances, the test statistic for testing H0 : µ1 = µ2 is given by:

(x¯1 − x¯2 ) − (µ¯1 − µ¯2 )0


t0 = r (7.3.3)
s21 s22
n1 + n2

The critical value of t0 for a significance level α and a two-sided test


is approximately:
w1 t1 + w2 t2
t0(1− α2 ) = (7.3.4)
w1 + w2
where:
s21 s22
w1 = , w2 =
n1 n2
and t1 = t(1− α2 ) and t2 = t(1− α2 ) are critical values for n1 − 1 and
n2 − 1 degrees of freedom, respectively.
184
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 184 / 315
Critical Values for t0 and Decision Rules

Determining t0 :
One-sided test:
Compute t0 using Equation 7.3.4.
Use t1 = t1−α for n1 − 1 degrees of freedom.
Use t2 = t1−α for n2 − 1 degrees of freedom.
Two-sided test:
Reject H0 if:

t0 ≥ t1−α/2 or t0 ≤ −t1−α/2 .

Rejection regions for one-sided tests:


Right-tailed test: Reject H0 if t0 ≥ t1−α .
Left-tailed test: Reject H0 if t0 ≤ −t1−α .

185
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 185 / 315
Example 7.3.3: Aortic Stiffness Index

Study: Dernellis and Panaretou (A-7) examined subjects with


hypertension and healthy control subjects. One of the variables of
interest was the aortic stiffness index.
Data:
n1 = 15, x¯1 = 19.16, s1 = 5.29
n2 = 30, x¯2 = 9.53, s2 = 2.69

Goal: The goal is to test if there is a significant difference between


the mean aortic stiffness index for subjects with hypertension and
healthy control subjects.

186
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 186 / 315
Example 7.3.3: Hypotheses and Test Statistic

Sol.

Hypotheses:

H0 : µ1 = µ2 , H A : µ1 , µ2

Test Statistic: The test statistic is given by Equation (7.3.3):

(x¯1 − x¯2 ) − (µ¯1 − µ¯2 )0


t0 = r
s21 s22
n1 + n2

The degrees of freedom are calculated using Equation (7.3.4).

187
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 187 / 315
Example 7.3.3: Decision Rule

Distribution of the Test Statistic:


The test statistic from Equation 7.3.3 does not follow a
Student’s t-distribution.
Critical values are determined using Equation 7.3.4 instead.

Decision Rule: Calculation of w1 and w2 :

s21 (5.29)2 s22 (2.69)2


w1 = = = 1.8656, w2 = = = 0.2412
n21 15 n22 30

Critical Values: From the t-distribution table at α = 0.05 (two-


tailed): t1 = t0.975,14 = 2.1448, t2 = t0.975,29 = 2.0452

(1.8656 · 2.1448) + (0.2412 · 2.0452)


t0(0.975) = = 2.133
1.8656 + 0.2412
188
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 188 / 315
Example 7.3.3: Calculation of Test Statistic

Calculation of test statistic: By Equation 7.3.3 we compute

(19.16 − 9.53) − 0
t0 = q
(5.29)2 2
15 + (2.69)
30

9.63 9.63 9.63


t0 = √ = √ = = 6.63
1.8656 + 0.2412 2.1068 1.4515

Conclusion: Since the computed value of t0 = 6.63 exceeds the


critical value of ±2.133, we reject the null hypothesis. Thus, we
conclude that the populations represented by the samples differ
with respect to the mean aortic stiffness index.

189
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 189 / 315
Sampling from Populations That Are Not Normally
Distributed

When sampling is from populations that are not normally dis-


tributed, the central limit theorem can be applied if the sample
sizes are large (typically n ≥ 30). This allows for the use of nor-
mal theory, as the distribution of the difference between sample
means will be approximately normal.

190
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 190 / 315
Test Statistic for Large Samples

When two large independent simple random samples are drawn


from populations that are not normally distributed, the test statis-
tic for testing H0 : µ1 = µ2 is:

(x̄1 − x̄2 ) − (µ1 − µ2 )0


z= r (7.3.5)
σ12 σ22
n1 + n2

x̄1 and x̄2 are the sample means,


σ12 and σ22 are the population variances,
n1 and n2 are the sample sizes,
µ1 and µ2 are the population means under the null hypothesis.

191
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 191 / 315
Using the Test Statistic

When H0 is true, the test statistic follows the standard


normal distribution.
If the population variances are known, they should be used.
If they are unknown, sample variances are used as estimates.
Sample variances are not pooled because the equality of
population variances is not assumed for the z statistic.

192
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 192 / 315
Example 7.3.4: IgG Levels in Thrombosis Study

Study: The objective of the study by Sairam et al. was to determine if


people with thrombosis have higher levels of the anticardiolipin
antibody IgG compared to those without thrombosis. The data are
summarized below:
Group Mean IgG Level (ml/unit) Sample Size Standard Deviation
Thrombosis 59.01 53 44.89
No Thrombosis 46.61 54 34.85

Table: IgG Levels for Subjects With and Without Thrombosis

Goal: We will test if the mean IgG level for thrombosis subjects
is higher than that of non-thrombosis subjects.

193
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 193 / 315
Example 7.3.4: IgG Levels in Thrombosis Study

Sol.

Step 1: Hypotheses: The null and alternative hypotheses are:

H0 : µT − µN T ≤ 0

HA : µT − µN T > 0
where:
µT is the mean IgG level for thrombosis subjects,
µN T is the mean IgG level for non-thrombosis subjects.

194
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 194 / 315
Example 7.3.4: IgG Levels in Thrombosis Study

Step 2: Test Statistic: Since the samples are large, we apply the
central limit theorem, and the test statistic is:
59.01 − 46.61
z= q = 1.59
44.892 34.852
53 + 54

Step 3: Decision Rule For a one-sided test with α = 0.01, the


critical value of z is 2.33. The decision rule is:

Reject H0 if zcomputed ≥ 2.33.

Since zcomputed = 1.59, which is less than 2.33, we fail to reject


H0 .

195
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 195 / 315
Example 7.3.4: IgG Levels in Thrombosis Study

Step 4: Conclusion: Based on the data, we conclude that there is


not enough evidence to suggest that people with thrombosis have
higher IgG levels on average compared to those without throm-
bosis.

Step 5: p-value
To compute the p-value, we need to find the probability of obtain-
ing a test statistic at least as extreme as the one computed from
the sample, assuming the null hypothesis H0 is true.
The null hypothesis: H0 : µT ≤ µN T
The alternative hypothesis: HA : µT > µN T
Test statistic computed: z = 1.59

196
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 196 / 315
Example 7.3.4: IgG Levels in Thrombosis Study

Since we are using a standard normal distribution


(z-distribution), the p-value corresponds to the area to the
right of z = 1.59 under the normal curve.
Step 5: p-value
Using standard normal tables or a calculator, we find that the p-
value is 0.0559.

Since the p-value is greater than the significance level α = 0.01,


we fail to reject the null hypothesis H0 .

197
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 197 / 315
Introduction to Paired Comparisons

Paired Comparisons Test: A hypothesis test based on data from


related observations or nonindependent samples.

Key Idea: Paired comparisons are used to control for extraneous


variation by comparing differences within pairs rather than be-
tween independent samples.

Example: Comparing the effectiveness of two sunscreens ap-


plied to opposite sides of a participant’s back.

198
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 198 / 315
Reasons for Pairing

Why Pairing is Used:


Reduces extraneous sources of variation.
Increases confidence in attributing results to the treatment or
intervention.

Remark: Without pairing, differences may be influenced by un-


controlled factors such as skin type or pigmentation.

199
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 199 / 315
Example: Sunscreen Study

Designing the Experiment:


Randomly assign sunscreens A and B to opposite sides of
participants’ backs.
Measure sun damage after exposure.

Objective: Eliminate sources of variation and attribute differ-


ences to the sunscreen effectiveness.

200
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 200 / 315
Hypothesis Testing for Paired Comparisons
Test Statistic:
d¯ − µd0
t= , (1)
sd¯
where:
d¯ = sample mean difference.
µd0 = hypothesized mean difference.
sd

sd¯ = n
.

Degrees of Freedom: n − 1 for paired comparisons.

Paired comparisons tests are powerful for controlling variation.


Ensure proper experimental design to maximize accuracy.

Alternative Approaches: Consider nonparametric tests such as


the Sign Test if data assumptions are not met.
201
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 201 / 315
Example 7.4.1: Gallbladder Function

Gallbladder Ejection Fraction (GBEF): The percentage of bile


that the gallbladder empties into the intestine after a meal.
John M. Morton et al. (A-14) examined gallbladder function before
and after fundoplication—a surgery used to stop stomach contents
from flowing back into the esophagus (reflux)—in patients with
gastroesophageal reflux disease. The authors measured gallbladder
functionality by calculating the gallbladder ejection fraction (GBEF)
before and after fundoplication. The goal of fundoplication is to
increase GBEF, which is measured as a percent.

The goal of the study is to determine if fundoplication increases


GBEF.
The data are shown in Table 7.4.1:

202
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 202 / 315
Solution: Hypothesis Testing

Preop (%) 22 63.3 96 9.2 3.1 50 33 69 64 18.8 0 34


Postop (%) 63.5 91.5 59 37.8 10.1 19.6 41 87.8 86 55 88 40

Table: Gallbladder Function in Patients with Presentations of


Gastroesophageal Reflux Disease Before and After Treatment

We will use hypothesis testing to determine if sufficient evidence is


provided to conclude that fundoplication increases GBEF.

Goal: Reject the null hypothesis if we can conclude that the pop-
ulation mean change in GBEF (µd ) is greater than zero.
Step 1: Data The data consist of the GBEF for 12 individuals, before
and after fundoplication.

203
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 203 / 315
Hypothesis Testing for Fundoplication

Can we conclude that the fundoplication is effective?

This is the same as asking if we can conclude that the population


mean difference µd is positive (greater than zero).

The null and alternative hypotheses are as follows:


H0 : µd ≤ 0 (The population mean difference is zero)
HA : µd > 0 (The population mean difference is greater than
zero)

204
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 204 / 315
Hypothesis Testing for Fundoplication

If we had obtained the differences by subtracting the postoperative


percentages from the preoperative weights (preop − postop), our
hypotheses would be:

H0 : µd ≥ 0
HA : µd < 0

If the question had indicated a two-sided test, the hypotheses would


have been:

H0 : µd = 0
HA : µd , 0

205
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 205 / 315
Assumptions and Hypotheses

The differences in GBEF (Postop − Preop) are: Differences = [41.5,


28.2, -37.0, 28.6, 7.0, -30.4, 8.0, 18.8, 22.0, 36.2, 88.0, 6.0]

Assumptions: The observed differences constitute a simple ran-


dom sample from a normally distributed population of differences.

Step 2: Hypotheses
Null Hypothesis (H0 ): The population mean difference µd = 0
Alternative Hypothesis (HA ): The population mean difference
µd > 0

We are testing if fundoplication is effective in increasing GBEF,


so we expect the postoperative percentages to tend to be higher
than the preoperative ones.

206
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 206 / 315
Test Statistic and Decision Rule

Step 3: Test Statistic


The test statistic is given by the formula:

d¯ − µd0
t= √
sd / n
Where:
- d¯ is the sample mean of the differences.
- sd is the sample standard deviation of the differences.
- n is the number of differences.

We will reject H0 if the test statistic is greater than or equal to the


critical value of t = 1.7959.

207
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 207 / 315
Test Statistic and Decision Rule

Step 4: Decision Rule


If t ≥ 1.7959, reject H0 .
Significance level α = 0.05.

Figure: Rejection and nonrejection regions for Example 7.4.1.

208
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 208 / 315
Calculation of the Test Statistic
Step 5: Calculation of Test Statistic
From the data, the sample mean of the differences is:
216.9
d¯ = = 18.075
12
The sample variance s2d is calculated as:

P ¯2
(di − d)
s2d =
n−1
P 2 P
n di − ( di )2
=
n(n − 1)
12 × 15669.49 − (216.9)2
=
12 × 11
= 1068.0930

Now, the test statistic is:


209
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 209 / 315
Conclusion

18.075
t= q ≈ 1.9159
1068.0930
12

Since t = 1.9159 is greater than the critical value of 1.7959, we


reject the null hypothesis.

We reject H0 and conclude that there is sufficient evidence to suggest


that fundoplication increases the GBEF in patients with
gastroesophageal reflux disease.

This concludes that fundoplication is effective in improving gall-


bladder function, as measured by the GBEF.

210
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 210 / 315
p-value Calculation

p-value: For this test, 0.025 < p < 0.05, since


1.7959 < 1.9159 < 2.2010.

MINITAB provides the exact p-value as 0.041 (Figure 7.4.2).

We may conclude that the fundoplication procedure increases GBEF


functioning.

211
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 211 / 315
Confidence Interval for µd

A 95 percent confidence interval for µd may be obtained as follows:


sd
dˆ ± tα/2 · √
n
r
1068.0930
18.075 ± 2.2010 ·
12

18.075 ± 20.765

(−2.690, 38.840)

This confidence interval provides a range for the true population


mean difference in GBEF.

212
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 212 / 315
The Use of z

If, in the analysis of paired data, the population variance of the


differences is known, the appropriate test statistic is:

dˆ − µd
z= σd

n

It is unlikely that σd will be known in practice. In such cases, t-


tests are more commonly used.

213
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 213 / 315
Assumption and Alternatives

If the assumption of normally distributed differences cannot be made,


the central limit theorem may be employed if n is large. In such cases,
the test statistic is still the t-statistic, but sd is used to estimate the true
population standard deviation.

If neither z-test nor t-test is appropriate for use with available data,
one may consider using nonparametric methods.

Alternative: The sign test, discussed in Chapter 13, is a candidate for


testing a hypothesis about a median difference.

214
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 214 / 315
Disadvantages of Paired Comparisons

The use of the paired comparisons test is not without its problems. If
different subjects are used and randomly assigned to two treatments,
considerable time and expense may be involved in trying to match
individuals on relevant variables.

Another disadvantage is the loss of degrees of freedom when using


paired comparisons. If we do not use paired observations, we have
2n − 2 degrees of freedom, compared to n − 1 when paired.

In deciding whether or not to use the paired comparisons procedure,


one should consider the economics and the benefits of controlling
extraneous variation.

215
Chapter 4: Hypothesis Testing 0214STAT: Fundamentals of Biostatistics 215 / 315
Chapter 5: Simple Linear Regression and Correlation

Learning Objectives:

Obtain a simple linear regression model and use it to make


predictions.
Calculate the coefficient of determination and interpret
tests of regression coefficients.
Calculate correlations among variables.

216
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 216 / 315
Introduction to Simple Linear Regression and Correlation

In analyzing data for the health sciences disciplines, it is often de-


sirable to learn about the relationship between two numeric vari-
ables. Examples include:
Blood pressure and age
Height and weight
Concentration of an injected drug and heart rate
Consumption level of some nutrient and weight gain
Intensity of a stimulus and reaction time
Total family income and medical care expenditures

217
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 217 / 315
What is Regression?

Regression analysis assesses the relationship between variables


to predict or estimate one variable based on another.

Introduced by Sir Francis Galton, “regression” describes how vari-


ables revert toward the mean in heredity studies.

218
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 218 / 315
What is Correlation?

Correlation analysis measures the strength of the relationship be-


tween variables. The term was first used by Galton in 1888. This
chapter focuses on linear relationships, with regression followed
by correlation techniques.

Use of Computers in Analysis:

Regression and correlation analysis benefit from computa-


tional tools. Ensure familiarity with the input and output re-
quirements of the software used for processing data.

219
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 219 / 315
The Simple Linear Regression Model

The simple linear regression model is represented as:

y = b0 + b1 x + ϵ

y: Response variable.
b0 , b1 : Regression coefficients.
ϵ: Error term representing deviations from the mean.

220
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 220 / 315
LINE Assumptions

Helpful acronym: LINE


Linear relationship between variables.
Independent errors.
Normal distribution of errors.
Equal variance of errors.

221
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 221 / 315
Assumptions of Linear Regression: Linearity

Linearity:
A straight line is used to model the relationship between X
and Y .
A non-linear pattern indicates the linear regression model is
not suitable.

Figure: Comparison of Linear and Non-Linear Relationships

222
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 222 / 315
Assumptions of Linear Regression: Independence

Independence:
No autocorrelation:
Error terms (residuals) should not be correlated with each
other.
Errors for one observation should not depend on errors for
another.
Implication:
Violations can lead to unreliable standard errors, invalid
hypothesis tests, and confidence intervals.
Example of Violation:
Time-series data often exhibit autocorrelation.
Detection:
Use the Durbin-Watson test to check for autocorrelation.

223
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 223 / 315
Assumptions of Linear Regression: Normality

Normality:
The error term (ϵ) must follow a normal distribution.

Methods to Check Normality:


Analytical Methods:
Use statistical tests such as:
Kolmogorov-Smirnov Test
Shapiro-Wilk Test
p-value > 0.05 indicates no significant deviation from normality.
Graphical Methods:
Use visualizations such as:
Histogram of residuals
Q-Q plot

224
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 224 / 315
Assumptions of Linear Regression: Equal Variances

Equal Variances (Homoscedasticity):


Residuals should have constant variance across all levels of
the dependent variable.

Figure: Visual Test for Homoscedasticity

Testing Homoscedasticity:
Plot residuals (errors) against the dependent variable (Y ).
225
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 225 / 315
Assumptions of Linear Regression: Equal Variances

Homoscedasticity: If the residuals scatter evenly across the range


of Y , the data exhibit homoscedasticity.

Heteroscedasticity: If the residuals form patterns (e.g., funnel


shape), it indicates heteroscedasticity (unequal variances).

Implications of Heteroscedasticity:
Leads to inefficient estimates of regression coefficients.
May affect hypothesis tests and confidence intervals.

226
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 226 / 315
Graphical Representation of the Regression Model

Bivariate Normal Distribution:


The joint distribution of X (independent variable) and Y
(dependent variable) forms a symmetric, mound-shaped surface.
For a fixed X:
Imagine slicing the 3D surface parallel to the Y -axis.
Reveals the normal distribution of Y values for that X.
For a fixed Y :
Slicing parallel to the X-axis reveals the normal distribution of X
values for that Y .

227
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 227 / 315
Graphical Representation of the Regression Model

Figure: Regression Line and Subpopulations of Y

The regression line represents the means of the Y subpopulations


for each X, while the distribution around this line is normal.
228
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 228 / 315
The Sample Regression Equation

Population Regression Equation: Describes the true relation-


ship between Y (response variable) and X (predictor variable).

Sample Regression Equation: Computed using sample data, it


is used to infer the form of the population regression equation.

Steps in Regression Analysis:


1 Check assumptions underlying linearity.
2 Obtain the regression equation for the sample data.
3 Evaluate the regression equation for strength and utility.
4 Use the equation for prediction and estimation.

229
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 229 / 315
Prediction and Estimation

Use the regression equation to predict the value of Y for a given


X.
Use the regression equation to estimate the mean of the
subpopulation of Y values for a given X.

Key Note: The sample data provide known values of both X and
Y . When using the regression equation, only X values are known.

230
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 230 / 315
Example: Simple Linear Regression Analysis

We will now illustrate the steps involved in simple linear regression


analysis by means of an example.

Step 1: Check if assumptions for linearity are met.


Step 2: Obtain the regression equation.
Step 3: Evaluate the regression equation.
Step 4: Use the equation for prediction and estimation.

231
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 231 / 315
Example 9.3.1: Predicting Deep Abdominal AT from Waist
Circumference
Study:
Deep abdominal adipose tissue (AT) is linked to cardiovascular
disease risks.
Computed tomography (CT) accurately measures deep
abdominal AT but is expensive and involves radiation.
CT is not widely available for routine use by most physicians.
Study Objective:
Researchers aimed to predict deep abdominal AT using simpler
body measurements, like waist circumference.
The study involved healthy men aged 18–42 with no metabolic
diseases requiring treatment.
Data:
Measurements included deep abdominal AT (via CT) and waist
circumference (Table 9.3.1). 232
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 232 / 315
Table 9.3.1: Waist Circumference and Deep Abdominal AT
Subject Waist Circumference (cm) Deep Abdominal AT (cm2 )
1 74.75 25.72
2 72.60 25.89
3 81.80 42.60
4 83.95 42.80
5 74.65 29.84
6 71.85 21.68
7 80.90 29.08
8 83.40 32.98
9 63.50 11.44
10 73.20 32.22

Table: 9.3.1: Waist Circumference (X) and Deep Abdominal AT (Y) for a
Sample of 10 Subjects from 109 (Page 418).

Research Question: Can waist circumference reliably predict deep


abdominal AT?
Analysis:
This is a regression analysis problem:
Dependent variable: Deep abdominal AT.
Independent variable: Waist circumference.
233
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 233 / 315
Scatter Diagram

A first step in studying the relationship between two variables is to


prepare a scatter diagram.

The independent
variable X is plotted on
the horizontal axis.
The dependent variable
Y is plotted on the
vertical axis.
The pattern of points on
the scatter diagram
suggests the nature and
strength of the Figure: Scatter diagram of data shown
relationship. in Table

234
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 234 / 315
Interpreting the Scatter Diagram

From the scatter diagram, we observe:

The points are scattered around an invisible straight line.


Subjects with larger waist circumferences generally have
larger amounts of deep abdominal AT.
The relationship between X and Y appears to be linear,
with a possible 45-degree angle.
However, drawing a line freehand might introduce subjective
judgment errors.

235
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 235 / 315
The Least-Squares Line

The Least-Squares Line is the line that best describes the


relationship between X and Y , minimizing the sum of squared
differences between the observed and predicted values.
Equation of a Straight Line
The general equation of a straight line is:

y = b0 + b1 x

where:
b0 is the y-intercept.
b1 is the slope of the line.

To find the line, we need to calculate the values of b0 and b1 .

236
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 236 / 315
Calculating the Least-Squares Line

The least-squares estimates for b1 (slope) and b0 (intercept) are given


by:
Pn
i=1 (xi − x̄)(yi − ȳ)
b1 = Pn
i=1 (xi − x̄)
2

b0 = ȳ − b1 x̄
where:
xi and yi are the data points.
x̄ and ȳ are the sample means of X and Y , respectively.
These formulas are typically computed using software like MINITAB.

237
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 237 / 315
Example: Calculating the Regression Line

Using software such as MINITAB, the least-squares regression


equation can be obtained.
Enter the data for X and Y into the software.
The output will include the regression equation, which is used
for prediction.

238
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 238 / 315
The Least-Squares Regression Line

From Figure 9.3.2, the linear equation for the least-squares line
describing the relationship between waist circumference (X) and
deep abdominal AT (Y ) is: ŷ = −216 + 3.46x

Since bˆ0 (intercept) is negative, the line crosses the Y-axis below
the origin.
Since bˆ1 (slope) is positive, the line extends from the lower
left-hand corner to the upper right-hand corner of the graph.
For each unit increase in X, Y increases by 3.46 units.
The symbol ŷ represents a predicted value of Y , rather than an
observed value.

The least-squares regression line minimizes the sum of squared


vertical distances between the observed and predicted values of
Y.
239
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 239 / 315
Calculating Coordinates for the Line

To draw the least-squares line, substitute two convenient values of


X into the equation ŷ = −216 + 3.46x.
For X = 70:

ŷ = −216 + 3.46(70) = 26.2

For X = 110:

ŷ = −216 + 3.46(110) = 164

These coordinates, (70, 26.2) and (110, 164), can be used to plot
the least-squares line. The graph in the figure illustrates the orig-
inal data and the least-squares line.

240
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 240 / 315
Simple Linear Regression

Figure: 9.3.2: Original data and least-squares line for Example

241
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 241 / 315
The Least-Squares Criterion

The least-squares line is considered the ”best fit” line for describ-
ing the relationship between the two variables. But what makes it
the best?
The least-squares line minimizes the sum of squared vertical
deviations between the observed data points (yi ) and the line.
The sum of squared deviations is smaller for the least-squares
line than for any other line that can be drawn through the points.

In other words, if we calculate the vertical distance (deviation)


from each observed data point to the least-squares line, square
each of these deviations, and sum them up, the total will be the
smallest for the least-squares line compared to any other line. This
is why it is called the least-squares line.

242
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 242 / 315
The Correlation Coefficient

The bivariate normal distribution has five parameters:


sx , sy (standard deviations)
mx , my (means)
r (population correlation coefficient)
The correlation coefficient r measures the strength of the linear
relationship between X and Y .

Formula for the Correlation Coefficient

A formula for computing r is:


P P P
n x i yi − xi yi
r= q P P P P
(n x2i − ( xi )2 )(n yi2 − ( yi )2 )

243
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 243 / 315
The formula allows computation of r without first computing the
regression coefficient b. Additionally, the correlation coefficient can
also be expressed as:

Sxy
r=
Sx Sy

where Sxy is the covariance between X and Y , Sx is the standard


deviation of X, and Sy is the standard deviation of Y .
The covariance Sxy is given by:

1 X
Sxy = (xi − x̄)(yi − ȳ)
n−1
where x̄ and ȳ are the means of X and Y , respectively.

244
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 244 / 315
Correlation Assumptions

The following assumptions must hold for inferences about the


population to be valid when sampling is from a bivariate distribution:
For each value of X, there is a normally distributed
subpopulation of Y values.
For each value of Y , there is a normally distributed
subpopulation of X values.
The joint distribution of X and Y is a normal distribution, called
the bivariate normal distribution.
The subpopulations of Y values all have the same variance.
The subpopulations of X values all have the same variance.

245
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 245 / 315
Graphical Representation of the Bivariate Normal
Distribution
The bivariate normal distribution is represented graphically in Figure
9.6.1.

In this illustration, if we slice the mound parallel to Y at some


value of X, the cutaway reveals the corresponding normal distri-
bution of Y . Similarly, a slice through the mound parallel to X at
some value of Y reveals the corresponding normally distributed
subpopulation of X.
For the figure below:
(a) A bivariate normal distribution
(b) A cutaway showing a normally distributed subpopulation of Y
for given X.
(c) A cutaway showing a normally distributed subpopulation of X
for given Y .
246
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 246 / 315
Figure: 9.6.1: A bivariate normal distribution.

247
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 247 / 315
Sample Correlation Coefficient

The sample correlation coefficient r will always have the same sign as
the sample slope b.
If r = 1, perfect direct linear correlation.
If r = −1, perfect inverse linear correlation.
If r = 0, no linear correlation.

Table: Absolute Value of r and Strength of Relationship

Absolute Value of r Strength of Relationship


r < 0.3 None or Very Weak
0.3 ≤ r < 0.5 Weak
0.5 ≤ r < 0.7 Moderate
r ≥ 0.7 Strong

248
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 248 / 315
Figure 9.4.6: Scatter diagrams showing (a) direct linear relationship,
(b) inverse linear relationship, and (c) no linear relationship between
X and Y .

249
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 249 / 315
Hypothesis Test for Correlation

To test if there is a correlation between X and Y , we conduct the


following steps:
1 Data: Use the provided data (Example 9.7.1 Page 448).
2 Assumptions: Apply the above assumptions.
3 Hypotheses:
Null hypothesis H0 : r = 0
Alternative hypothesis HA : r , 0

250
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 250 / 315
Test Statistic for Correlation

The test statistic for correlation is:



r n−2
t= √
1 − r2
This follows a t-distribution with n − 2 degrees of freedom when H0
is true.

Decision Rule:
Using a significance level α = 0.05, we find the critical values
for t:
±1.9754 (for 153 degrees of freedom)
If the calculated t-value is outside this range, we reject H0 .

251
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 251 / 315
Output of Example 9.7.1

MINITAB output for Example 9.7.1 using the simple regression

procedure.

252
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 252 / 315
Test Statistic Calculation

The computed t-value for r = 0.848 is:



0.848 × 153
t= √ = 19.787
1 − 0.719
Since t = 19.787 > 1.9754, we reject the null hypothesis.

Conclusion:
Based on the hypothesis test, we conclude that there is a significant
linear correlation between height and SEP levels in the population.
p-value < 0.005
t = 19.787 > 2.6085

253
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 253 / 315
Hypotheses about the slope b1 the test statistic

The Test Statistic: For testing hypotheses about b1 , the test statistic
2 is known is given by:
when σy|x

b̂1 − b1(0)
z= (9.4.8)
σb̂1

where b1(0) is the hypothesized value of b1 and the standard error of


the estimator is v
u
u 2
σy|x
σb̂1 = t P
(xi − x̄)2

Important Note: The hypothesized value b1(0) does not have to


be zero, but in practice, most null hypotheses assume b1 = 0.

254
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 254 / 315
Hypotheses about b1 the test statistic
2 is unknown. In this case, the test statistic becomes:
As a rule, σy|x

b̂1 − b1(0)
t= (9.4.9)
sb̂1

where sb̂1 is an estimate of sb̂1


v
u
u s2y|x
t
sb̂1 = P
(xi − x̄)2

and t follows a Student’s t distribution with n − 2 degrees of freedom.

In most practical situations, the 100(1 - α)% confidence interval


for b1 is given by:
b̂1 ± t(1−α/2) sb̂1

255
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 255 / 315
Example 9.4.2

Refer to Example 9.3.1. We wish to know if the slope of the


population regression line describing the relationship between X and
Y is zero.
Sol:
1. Data: See Example 9.3.1.
2. Assumptions: The simple linear regression model and its
assumptions apply.

256
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 256 / 315
Example 9.4.2
3. Hypotheses:

H0 : b1 = 0 vs. HA : b1 , 0 at α = 0.05

4. Test Statistic: The test statistic is given by Equation (9.4.9).


5. Distribution: The test statistic follows a Student’s t distribution
with n − 2 degrees of freedom under H0 .
6. Decision Rule: Reject H0 if t ≥ 1.9826 or t ≤ −1.9826.
7. Calculation: Using output p.238 in this note, we have

b̂1 − 0 3.4589
t= = = 14.74
sb̂1 0.2347

8. Statistical Decision: Reject H0 because t = 14.74 > 1.9826.


9. Conclusion: The slope of the population regression line is not
zero.
257
Chapter 5: Simple Linear Regression and Correlation 0214STAT: Fundamentals of Biostatistics 257 / 315
Chapter 6: The Chi-square Distribution and the Analysis of
Frequencies

Learning Objectives:

After studying this chapter, the student will:


Understand the mathematical properties of the chi-square
distribution.
Be able to use the chi-square distribution for goodness-of-fit
tests.
Be able to construct and use contingency tables to test
independence and homogeneity.

258
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 258 / 315
Introduction to The Mathematical Properties of the
Chi-Square Distribution

The chi-square distribution is essential in statistical analysis,


used extensively in hypothesis testing and confidence interval
estimation.
It is commonly employed to analyze frequency data, such as
testing the relationship between gender and insurance coverage
type, or geographical area and diagnosis.
In this chapter, we explore the applications and mathematical
properties of the chi-square distribution.

259
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 259 / 315
The Properties of the Chi-Square Distribution

The chi-square distribution can be derived from normal


distributions. If Y ∼ N (µ, σ 2 ) (a normally distributed random
variable), we transform it into a standard normal variable z using:

y−µ
z=
σ

Squaring z gives a variable that follows a chi-square distri-


bution with 1 degree of freedom: χ2 (1) = z 2

For multiple variables, the sum of squared standard nor-


mal variables follows a chi-square distribution with degrees
P
of freedom equal to the number of terms: χ2 (n) = ni=1 zi2

260
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 260 / 315
Mathematical Formula for Chi-Square Distribution
The probability density function (pdf) of the chi-square distribution
with k degrees of freedom is given by:
1
f (u) = uk/2−1 e−u/2 , u>0
Γ(k/2)2k/2
where:
Γ(k/2) is the Gamma function, a generalization of the factorial
function, defined for x > 0 as:
Z ∞
Γ(x) = tx−1 e−t dt
0

e is Euler’s number, approximately 2.71828.


Additional properties:
Mean: k
Variance: 2k
Mode: k − 2 (for k ≥ 2), and 0 for k = 1.
261
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 261 / 315
Introduction to Chi-Square Tests

The chi-square distribution is used for testing hypotheses with


frequency data.
Hypothesis testing procedures include:
Goodness-of-fit tests
Tests of independence
Tests of homogeneity
All chi-square tests can be viewed as goodness-of-fit tests.
These tests compare observed frequencies with expected
frequencies under a specific theory or hypothesis.

262
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 262 / 315
Observed Versus Expected Frequencies

The chi-square statistic is most appropriate for use with categorical


variables, such as marital status, with categories: married, single,
widowed, and divorced. Key concepts:

Observed Frequencies: The number of subjects or objects in the


sample that fall into the various categories of the variable.

Expected Frequencies: The number of subjects or objects in the


sample expected under the null hypothesis.

263
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 263 / 315
Chi-Square Test Statistic

The test statistic for the chi-square tests is given by:

X (Oi − Ei )2
χ2 =
Ei

Where:
Oi is the observed frequency for the i-th category.
Ei is the expected frequency for the i-th category.

The test statistic χ2 follows a chi-square distribution with:

k−r degrees of freedom.

264
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 264 / 315
Perfect Match Between Observed and Expected Frequencies

If the observed (Oi ) and expected (Ei ) frequencies match perfectly


under H0 , the difference is:

Oi − Ei = 0 for all terms.

In this case, the chi-square test statistic is:


X (Oi − Ei )2
χ2 = =0
Ei

Conclusion: Fail to reject H0 .

265
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 265 / 315
Disagreement Between Observed and Expected Frequencies

When Oi and Ei differ:


Oi − Ei , 0

Disagreement increases (Oi − Ei )2 , making χ2 larger.


If χ2 exceeds the critical value from the chi-square distribution:

Reject H0

Larger χ2 =⇒ Poorer agreement =⇒ Reject H0

266
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 266 / 315
Chi-Square Test Statistic

The Decision Rule:

The quantity:
(Oi − Ei )2
Ei
will be small if the observed and expected frequencies are close, and
large if the differences are significant.

The computed value of χ2 is compared with the tabulated chi-square


value for the chosen significance level (α) and degrees of freedom.

Decision: Reject H0 if:

χ2 ≥ tabulated chi-square value at α.

267
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 267 / 315
Chi-Square Test Statistic
Handling Small Expected Frequencies:

In some applications of the chi-square test, one or more expected


frequencies may be small (e.g., less than 5 or even 1). When this
occurs, the approximation of χ2 to a chi-square distribution is not
strictly valid.

Solutions:
Combine adjacent categories to achieve the minimum expected
frequency (suggested to be at least 5).
Reduce the degrees of freedom.

Note: Cochran suggests that for goodness-of-fit tests of unimodal


distributions, the minimum expected frequency can be as low as
1.

268
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 268 / 315
Goodness-of-Fit Test

A goodness-of-fit test is appropriate when one wishes to decide if


an observed distribution of frequencies is incompatible with some
preconceived or hypothesized distribution.

For example, we may wish to determine whether or not a sample


of observed values of some random variable is compatible with
the hypothesis that it was drawn from a population of values that
is normally distributed.

269
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 269 / 315
Procedure for Goodness-of-Fit Test

The procedure for performing a goodness-of-fit test involves the


following steps:

Place the values into mutually exclusive categories or class


intervals.
Note the frequency of occurrence of values in each
category.
Use knowledge of the hypothesized distribution (e.g.,
normal) to determine expected frequencies.
If the discrepancy between observed and expected
frequencies is large, we may conclude that the sample did
not come from the hypothesized distribution.

270
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 270 / 315
Example 12.3.1: The Normal Distribution
Study: Cranor and Christensen (A-1) conducted a study to assess
short-term clinical, economic, and humanistic outcomes of
pharmaceutical care services for patients with diabetes. The
cholesterol levels of 47 subjects are summarized in the table below:
Cholesterol Level (mg/dl) Number of Subjects
100.0–124.9 1
125.0–149.9 3
150.0–174.9 8
175.0–199.9 18
200.0–224.9 6
225.0–249.9 4
250.0–274.9 4
275.0–299.9 3

The goal of this study is to determine whether the cholesterol levels


in the sample population are normally distributed.
271
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 271 / 315
Hypotheses and Test Statistic

Null Hypothesis H0 : In the population from which the sample


was drawn, cholesterol levels are normally distributed.
Alternative Hypothesis HA : The sampled population is not nor-
mally distributed.

The test statistic for the goodness-of-fit test is given by:

X
k
(Oi − Ei )2
χ2 =
i=1
Ei

where:
Oi = Observed frequency in class i
Ei = Expected frequency in class i

272
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 272 / 315
Decision Rule

If H0 is true, the test statistic follows a chi-square distribution with


k − r degrees of freedom, where:
k = number of categories
r = number of parameters estimated

Reject H0 if the computed value of χ2 is greater than or equal to


the critical value of chi-square χ21−α,k−r at the chosen significance
level α.

273
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 273 / 315
Calculation of Test Statistic

The sample mean and standard deviation are used to estimate the
parameters of the hypothesized normal distribution:

x̄ = 198.67, s = 41.31

The next step is to determine the expected frequencies for each class
interval based on these estimates.

The expected frequency for each class interval is determined by


multiplying the expected relative frequency of values in the inter-
val by the total number of values in the sample.

This requires calculating the expected distribution based on the


normal distribution parameters.

274
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 274 / 315
Expected Relative Frequencies

We calculate the expected relative frequencies using the normal


distribution. We first compute the z-scores for the lower and upper
limits of each class interval.
The formula for the z-score is:

x0 − µ
z0 =
σ

where x0 is the value in the class interval, µ is the mean, and σ is the
standard deviation.

When we multiply our total sample size, 47, by 0.0291, we find


the expected frequency for the interval to be 1.4.

275
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 275 / 315
Expected Relative Frequencies

For the class interval 100.0 to 124.9:


100.0 − 198.67
For x0 = 100.0, z0 = = −2.39
41.31
125.0 − 198.67
For x0 = 125.0, z0 = = −1.78
41.31
The area to the left of z = −2.39 is 0.0084, and the area to the left of
z = −1.78 is 0.0375.

The expected relative frequency for the interval 100.0 to 124.9 is


the area between the two z-values:

Expected Relative Frequency = 0.0375 − 0.0084 = 0.0291


This corresponds to 2.91% of the values in the sample.

276
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 276 / 315
Expected Relative Frequency

We should expect 2.91 percent of the values in our sample to be


between 100.0 and 124.9.
When we multiply our total sample size, 47, by 0.0291, we find
the expected frequency for the interval to be 1.4.
Similar calculations will give the expected frequencies for the
other intervals as shown in the table.
For the class interval 125.0 to 149.9:
125.0 − 198.67
For x0 = 125.0, z0 = = −1.78
41.31
150.0 − 198.67
For x0 = 150.0, z0 = = −1.18
41.31

277
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 277 / 315
Expected Relative Frequency

The area to the left of z = −1.78 is 0.0375, and the area to the
left of z = −1.18 is 0.1190.

Expected Relative Frequency = 0.1190 − 0.0375 = 0.0815

We should expect 8.15 percent of the values in our sample to be


between 125.0 and 149.9.
When we multiply our total sample size, 47, by 0.0815, we find
the expected frequency for the interval to be 3.8.
Similar calculations will give the expected frequencies for the
other intervals as shown in the table.

278
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 278 / 315
Observed and Expected Frequencies

(Oi −Ei )2
Class Interval Observed Frequency Oi Expected Frequency Ei Ei
< 100 0 0.4 -
100.0–124.9 1 1.4 0.356
125.0–149.9 3 3.8 0.168
150.0–174.9 8 7.8 0.005
175.0–199.9 18 10.7 4.980
200.0–224.9 6 10.7 2.064
225.0–249.9 4 7.2 1.422
250.0–274.9 4 3.5 0.071
275.0–299.9 3 1.2 1.500
300.0 and greater 0 0.3 -

279
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 279 / 315
Observed and Expected Frequencies

The first, second, and last entries in the last column, for example,
are computed as:
(1−1.8)2
First entry: 1.8 = 0.356
2
Second entry: (3−3.8)
3.8 = 0.168
(3−1.5)2
Last entry: 1.5 = 1.5
other values of (Oi −E
2
i)
The Ei are computed in a similar manner.

280
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 280 / 315
Degrees of Freedom Calculation

Total number of groups or class intervals: 8


Number of restrictions: 3
P P
Ensuring Ei = Oi
Estimating µ (mean) from the sample data
Estimating σ (standard deviation) from the sample data
Degrees of freedom: 8 − 3 = 5

281
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 281 / 315
P P
Remark: Condition Ei = Oi
P P
The condition Ei = Oi ensures consistency between:
P
P Oi : The total observed frequencies from the data.
Ei : The total expected frequencies under H0 .
This is required to:
Ensure the total sample size is the same for observed and
expected distributions.
Maintain fairness in comparing observed (Oi ) and expected (Ei )
frequencies.
This condition is one of the restrictions subtracted from the total
degrees of freedom when performing the test.
Example from the preceding table:
Suppose there are 47 observations across 8 intervals.
P P
Oi = 47, so Ei must also be adjusted to 47.

282
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 282 / 315
Chi-Square Test for Goodness-of-Fit

Chi-Square Statistic Calculation: The chi-square statistic is


calculated as: X (Oi − Ei )2
χ2 = .
Ei
For this dataset, the calculated value is: χ2 = 10.566.
Hypothesis Testing: We compare the calculated chi-square statistic
to the critical value from the chi-square distribution table with 5
degrees of freedom and a significance level of α = 0.05:
χ21−0.05,5 = χ20.95,5 = 11.070.
Since χ2 = 10.566 is less than χ20.95,5 = 11.070, we fail to
reject the null hypothesis.
Conclusion: As a result of the hypothesis test, we fail to reject the
null hypothesis, indicating that the cholesterol levels likely follow
a normal distribution.

283
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 283 / 315
Chi-Square Test for Goodness-of-Fit

P-Value Interpretation: The p-value is the probability of observing


a test statistic as extreme as χ2 = 10.566, assuming the null
hypothesis is true.
To determine the p-value range, we refer to the chi-square
distribution table. For 5 degrees of freedom:

At χ21−0.10,5 = χ20.90,5 = 9.236, the p-value is 0.10.


At χ21−0.05,5 = χ20.95,5 = 11.070, the p-value is 0.05.

Since χ2 = 10.566 lies between χ20.90,5 and χ20.95,5 , the p-value


is bounded as:
0.05 < p < 0.10.
The p-value is not small enough (p > 0.05) to reject the null
hypothesis, so the data are consistent with a normal distribution.

284
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 284 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

Problem: A certain human trait is thought to be inherited according


to the ratio 1:2:1 for homozygous dominant, heterozygous, and
homozygous recessive. An examination of a simple random sample
of 200 individuals yielded the following distribution of the trait:
dominant, 43; heterozygous, 125; and recessive, 32. We wish to know
if these data provide sufficient evidence to cast doubt on the belief
about the distribution of the trait.

Note: This example sets up a chi-square goodness-of-fit test to


assess if the observed distribution deviates from the expected ratio
of 1:2:1.

285
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 285 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

Step 1: Data The data from the example are as follows:


Dominant: 43
Heterozygous: 125
Recessive: 32
Step 2: Assumptions We assume that the data meet the requirements
for the application of the chi-square goodness-of-fit test.

Step 3: Hypotheses The hypotheses are:


Null hypothesis H0 : The trait is distributed according to the ratio
1:2:1 for homozygous dominant, heterozygous, and homozygous
recessive.
Alternative hypothesis HA : The trait is not distributed according
to the ratio 1:2:1.

286
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 286 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

Step 4: Test Statistic The test statistic is:


X (Oi − Ei )2
χ2 =
i
Ei

where, for each i, Oi represents the observed frequency and Ei


represents the expected frequency.

To compute the expected frequencies (Ei ) under the null hypothesis:


1. Null hypothesis ratio: The trait is distributed in the ratio
1 : 2 : 1, meaning:
1 part for homozygous dominant,
2 parts for heterozygous,
1 part for homozygous recessive.

287
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 287 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test
2. Total parts in the ratio: The sum of parts is:

1 + 2 + 1 = 4.

3. Sample size: The total number of individuals in the sample is


n = 200.
4. Proportion for each category: Based on the ratio:
1 2 1 1
Dominant: , Heterozygous: = , Recessive: .
4 4 2 4
5. Expected frequencies: Multiply each proportion by the total
sample size:
Dominant: 14 × 200 = 50
Heterozygous: 12 × 200 = 100
Recessive: 14 × 200 = 50
6. Verification: The sum of expected frequencies matches the total
sample size: 50 + 100 + 50 = 200.
288
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 288 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test
The test statistic is computed as:

(43 − 50)2 (125 − 100)2 (32 − 50)2


χ2 = + + = 13.71
50 100 50
These expected frequencies represent the values we would expect if
the null hypothesis were true.
Step 5: Distribution of Test Statistic If the null hypothesis H0 is
true, the test statistic χ2 follows a chi-square distribution with k − 1
degrees of freedom, where k is the number of categories. In this
P P
example, we have one constraint i Oi = i Ei :

Degrees of freedom = k − 1 = 3 − 1 = 2.

Step 6: Decision Rule For a significance level of α = 0.05, the


critical value for χ20.95,2 is:

χ21−0.05,2 = χ20.95,2 = 5.991.


289
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 289 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

This means that if the calculated value of χ2 exceeds 5.991, we reject


the null hypothesis H0 .

Remark: The degrees of freedom are calculated based on the


number of categories minus one. The critical value helps us deter-
mine whether the computed test statistic is large enough to reject
the null hypothesis at a 5% significance level.

Important: The test statistic χ2 measures how well the observed


data matches the expected data under the null hypothesis. A large
χ2 value suggests a poor fit and leads to rejection of the null hy-
pothesis.

290
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 290 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

In this case: χ2 = 13.71 > 5.991. Since 13.71 is greater than the
critical value, we reject H0 . This suggests that the observed data do
not align with the expected distribution under the 1:2:1 ratio.
Step 7: P-Value The p-value is the probability of observing a test
statistic at least as extreme as χ2 = 13.71 under the assumption that
H0 is true. From the chi-square distribution table for 2 degrees of
freedom:

At χ20.99,2 = 9.210, the p-value is 0.01. This means there is


only a 1% chance of observing a chi-square value greater
than 9.210 under H0 .
At χ20.995,2 = 10.597, the p-value is 0.005. This means there
is only a 0.5% chance of observing a chi-square value
greater than 10.597 under H0 .

291
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 291 / 315
Example 12.3.5: Chi-Square Goodness-of-Fit Test

Since χ2 = 13.71 exceeds χ20.995,2 = 10.597, the p-value is less than


0.005: p < 1 − 0.995 = 0.005.
Interpretation of the P-Value: The extremely small p-value
confirms the decision to reject H0 , supporting the conclusion that the
trait is not distributed according to the 1:2:1 ratio.

Remark: A p-value this small (less than 0.005) provides very


strong evidence against H0 , further confirming the observed data
does not align with the expected distribution under the null hy-
pothesis.

This p-value, being smaller than 0.005, leads to rejecting H0 , em-


phasizing that the trait’s distribution significantly differs from the
expected 1:2:1 ratio.

292
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 292 / 315
Introduction to Tests of Independence

The chi-square distribution is frequently used to test the null


hypothesis that two criteria of classification, when applied to the
same set of entities, are independent.
Two criteria are independent if the distribution of one criterion
is the same regardless of the distribution of the other.
Example: If socioeconomic status and area of residence are
independent, the proportion of families in different
socioeconomic groups should be the same across all areas.

293
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 293 / 315
Contingency Table

The classification of entities according to two criteria can be


shown in a contingency table.
The rows represent levels of one criterion, and the columns
represent levels of the second criterion.
A typical contingency table looks like the following:

Table: Two-Way Classification of a Finite Population of Entities

The table shows the classification of


entities under two criteria, with the Level 1 Level 2 ... Total
Criterion 1 Level 1 N11 N12 ... N1.
rows representing levels of the first Criterion 1 Level 2 N21 N22 ... N2.
Criterion 1 Level 3 N31 N32 ... N3.
criterion and the columns represent- . . . . .
. . . . .
ing levels of the second criterion. . . . . .
Criterion 1 Level r Nr1 Nr2 ... Nr.
The total for each row and column is Total N.1 N.2 ... N
also provided.

294
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 294 / 315
Testing the Null Hypothesis

The null hypothesis is that the two criteria are independent.


If the hypothesis is rejected, we conclude that the two criteria are
not independent.
The observed frequency table (from a sample) will be displayed,
similar to the table shown below:

Table 12.4.2: Two-Way Classification of a Sample of Entities

This table shows the observed fre-


Level 1 Level 2 ... Total
quency for each pair of criteria Criterion 1 Level 1 n11 n12 ... n1.
Criterion 1 Level 2 n21 n22 ... n2.
levels in the sample. It is used Criterion 1 Level 3 n31 n32 ... n3.
. . . . .
to calculate expected frequencies .
.
.
.
.
.
.
.
.
.
under the assumption of indepen- Criterion 1 Level r
Total
nr1
n.1
nr2
n.2
...
...
nr.
n
dence.

295
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 295 / 315
Calculating Expected Frequencies
To calculate expected frequencies under the assumption of
independence, use the formula:
ni. · n.j
Eij =
n

This formula calculates the expected frequency Eij for each cell,
assuming the null hypothesis of independence.

For cell n11 , the expected frequency E11 is computed as:


n1. · n.1
E11 =
n

Repeat this calculation for all cells in the table to obtain the ex-
pected frequencies.
296
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 296 / 315
Observed vs Expected Frequencies
The observed and expected frequencies are compared.
If the discrepancy between observed and expected frequencies is
sufficiently large, the null hypothesis is rejected.
The chi-square statistic χ2 is computed using:

X (Oij − Eij )2
χ2 =
Eij
Where Oij are the observed frequencies and Eij are the ex-
pected frequencies.

This formula computes the chi-square statistic by comparing the


squared difference between observed and expected frequencies,
normalized by the expected frequencies.
297
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 297 / 315
Chi-Square Distribution

If the null hypothesis is true, the chi-square statistic follows a


chi-square distribution with (r − 1)(c − 1) degrees of freedom.
If the computed χ2 is greater than the critical value χ21−α,df from
the chi-square distribution table, the null hypothesis is rejected.
If the computed χ2 is equal to or larger than the tabulated value
of χ2 at a significance level α, we reject the null hypothesis.

This step involves comparing the computed chi-square statistic


with the critical value to determine if the observed data signifi-
cantly deviates from the expected distribution.

298
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 298 / 315
EXAMPLE 12.4.1: Folic Acid Use and Race

In 1992, the U.S. Public Health Service and CDC recommended


400 mg of folic acid daily for women of childbearing age to
reduce neural tube defects.
Researchers examined whether preconceptional use of folic acid
and race are independent.
Data collected from 693 pregnant women calling a teratology
information service.
Race Yes No Total
White 260 299 559
Black 15 41 56
Other 7 14 21
Total 282 354 636
Table: Race of Pregnant Callers and Folic Acid Use

299
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 299 / 315
EXAMPLE 12.4.1: Folic Acid Use and Race

Question: Is there sufficient evidence to indicate a relationship


between race and preconceptional folic acid use?
Sol: To determine this, we will calculate the test statistic, compare it
to the critical value, and analyze the p-value. The hypotheses are
written as follows:
Null Hypothesis (H0 ): Race and preconceptional use of folic
acid are independent.
Alternative Hypothesis (HA ): Race and preconceptional use of
folic acid are not independent.
Significance Level: α = 0.05.

300
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 300 / 315
Chi-Square Test Statistic

P (Oi −Ei )2
Formula for the Test Statistic: χ2 = Ei
Degrees of Freedom: (r − 1)(c − 1) = (3 − 1)(2 − 1) = 2.
Decision Rule: Reject H0 if χ2 ≥ χ21−0.05,2 = 5.991.
Race Yes (Observed, Expected) No (Observed, Expected) Total
White 260 (247.86) 299 (311.14) 559
Black 15 (24.83) 41 (31.17) 56
Other 7 (9.31) 14 (11.69) 21
Total 282 354 636
Table: Observed and Expected Frequencies.

Remark: The table displays the observed and expected frequencies for the race and folic
acid usage categories. These are used to calculate the chi-square statistic.

301
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 301 / 315
Expected Frequencies
Test Statistic (Chi-Square) Calculation: The test statistic χ2 is
calculated as:
(260 − 247.86)2 (14 − 11.69)2
χ2 = + ··· +
247.86 11.69
= 0.59461 + 0.47368 + · · · + 0.45647
= 9.08960.

Critical Value: The critical value for α = 0.05 and degrees of


freedom (df = 2) is: χ20.95,2 = 5.991.
Decision: Compare the computed test statistic to the critical
value: χ2 = 9.08960 > 5.991. Since the computed value
exceeds the critical value, we reject H0 .
Remark: Since the computed test statistic is greater than the critical
value, we reject the null hypothesis and conclude that there is a statisti-
cally significant relationship.
302
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 302 / 315
Expected Frequencies
P-value Bounds: The exact p-value lies between 0.01 and
0.025, based on the chi-square distribution table for df = 2.
This range is obtained by locating where the test statistic
(9.08960) falls relative to the chi-square values at different
significance levels:
χ21−0.025,2 = 7.378 and χ21−0.01,2 = 9.210.

Since 7.378 < 9.08960 < 9.210, we conclude that


0.01 < p < 0.025.
Conclusion: There is sufficient evidence to suggest that race and
preconceptional use of folic acid are not independent. That is,
the relationship is statistically significant at α = 0.05.
Remark: Given that the p-value is between 0.01 and 0.025, and the
significance level α = 0.05, we reject the null hypothesis, concluding
a significant relationship between race and preconceptional use of folic
acid.
303
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 303 / 315
Introduction to Tests of Homogeneity

Key Concepts:
A test of homogeneity examines if multiple samples come from
populations that are homogeneous with respect to a certain
classification.
It differs from a test of independence, where the goal is to
examine if two criteria of classification are independent.
Both tests use the chi-square statistic but differ in sampling
procedures and hypotheses.

Reminder: A test of homogeneity deals with multiple samples,


while the test of independence involves two classifications.

304
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 304 / 315
Calculating Expected Frequencies

Expected Frequencies:
Expected frequencies for each cell are computed as:

(Row Total) × (Column Total)


Eij =
Grand Total

Shortcut: Multiply marginal totals and divide by the grand total.

Definition: The expected frequency is the frequency we would


expect if there were no association between the variables.

Variable Population 1 Population 2 Total

A nA1 nA2 nA.


Example Table: B nB1 nB2 nB.
C nC1 nC2 nC.

Total n.1 n.2 n

305
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 305 / 315
Solution and Conclusion

Note: The expected frequency calculation helps determine how


the observed data compares to the expected data, assuming no as-
sociation between variables.

Step-by-Step Solution:
1 Calculate expected frequencies for each cell.
2 Compute the χ2 statistic using observed and expected
frequencies.
3 Compare χ2 to the critical value (e.g. χ21−α,df ).

Reminder: The χ2 statistic is compared to a critical value to de-


cide whether to reject the null hypothesis.

306
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 306 / 315
Solution and Conclusion

Conclusion:
If χ2 is less than the critical value, fail to reject H0 : Populations
are homogeneous.
If χ2 is greater than the critical value, reject H0 : Populations are
not homogeneous.

Note: A higher χ2 value indicates a larger discrepancy between


observed and expected frequencies, leading to the rejection of the
null hypothesis.

307
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 307 / 315
Example 12.5.1: Narcolepsy and Migraine

Problem: Narcolepsy is a disorder involving disturbances of the


sleep-wake cycle.
Study: The German Migraine and Headache Society conducted a
study on the relationship between migraine headaches and narcolepsy
in 96 subjects diagnosed with narcolepsy and 96 healthy controls.
Data:

Yes No Total
Narcoleptic Subjects 21 75 96
Healthy Controls 19 77 96
Total 40 152 192

308
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 308 / 315
Example 12.5.1: Narcolepsy and Migraine

Question: Are the populations homogeneous with respect to mi-


graine frequency?

Remark: This study will be analyzed using a chi-square test for


homogeneity, comparing the observed frequencies with the ex-
pected frequencies to assess if there is a significant relationship
between narcolepsy and migraine occurrence.

309
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 309 / 315
Hypotheses and Test Statistic

Sol: Hypotheses:
Null Hypothesis (H0 ): The populations are homogeneous with
respect to migraine frequency.
Alternative Hypothesis (HA ): The populations are not
homogeneous with respect to migraine frequency.
Test Statistic:
X (Oi − Ei )2
χ2 =
Ei
Degrees of Freedom:

(r − 1)(c − 1)
where r = number of rows, c = number of columns.

310
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 310 / 315
Solution and Conclusion

Step-by-Step Solution:
1 Calculate expected frequencies for each cell.
96 × 40 96 × 152
E11 = = 20 E12 = = 76
192 192
96 × 40 96 × 152
E21 = = 20 E22 = = 76
192 192
2 Compute χ2 statistic using observed and expected frequencies:

(21 − 20)2 (75 − 76)2 (19 − 20)2 (77 − 76)2


χ2 = + + +
20 76 20 76
= 0.05 + 0.013 + 0.05 + 0.013 = 0.126

3 Compare χ2 to the critical value (e.g., α = 0.05).

311
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 311 / 315
Steps in Hypothesis Testing

In this case, since the computed χ2 statistic (0.126) is less than


the critical value, we fail to reject the null hypothesis, suggesting
that the populations are homogeneous with respect to migraine
frequency.

Distribution of the Test Statistic: Under the null hypothesis


(H0 ), the test statistic χ2 approximately follows a chi-square
distribution with degrees of freedom (df):

df = (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1.

Critical Value (χ21−α,df ): The critical value is determined using


the chi-square distribution table at a significance level α = 0.05
and df = 1: χ20.95,1 = 3.841. This value is the cutoff point for the
rejection region.
312
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 312 / 315
Steps in Hypothesis Testing

Decision Rule: Reject H0 if the computed value of χ2 is equal


to or greater than 3.841 (for α = 0.05).
Calculation of Test Statistic: Based on the MINITAB output,
the test statistic is: χ2 = 0.126.

Remark: The value of χ2 = 0.126 is much smaller than the crit-


ical value of 3.841, suggesting that the null hypothesis cannot be
rejected.

313
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 313 / 315
Steps in Hypothesis Testing
Statistical Decision: Since 0.126 < 3.841, we fail to reject the
null hypothesis.
Conclusion:

The two populations appear to be homogeneous with respect


to migraine frequency.
P-value: From the MINITAB output, the p-value is:

p = 0.722.

Since p > α = 0.05, this further supports failing to reject the


null hypothesis.
Conclusion:

The two populations appear to be homogeneous with respect


to migraine frequency.
314
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 314 / 315
The handout is completed, and all fundamental concepts and
applications necessary for a comprehensive understanding of the
required topics are covered.

315
Chapter 6: The Chi-square Distribution and Frequencies 0214STAT: Fundamentals of Biostatistics 315 / 315

You might also like