0% found this document useful (0 votes)
99 views79 pages

Biostatistics Module

document seven

Uploaded by

Paul Hyman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views79 pages

Biostatistics Module

document seven

Uploaded by

Paul Hyman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

CHALIMBANA UNIVERSITY

DIRECTORATE OF DISTANCE EDUCATION

BIOSTATISTICS AND EXPERIMENTAL DESIGNS

(ABS 3101)
FIRST EDITION 2023

Chalimbana University
School of Mathematics and Science Education
Department of Agricultural Science Education
Private Bag E1
Lusaka

WRITTEN BY: Simushi Liswaniso


Copyright @ 2023 Chalimbana University
First Published © 2023 Chalimbana University
All rights reserved

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying or recording or otherwise without
prior written permission of the publisher.

CHALIMBANA UNIVERSITY
PRIVATE BAG E 1
LUSAKA
Acknowledgment
On behalf of Chalimbana University, the Directorate of Distance Education would like to sincerely
thank Simushi Liswaniso for producing this module. Special thanks also go to the team of the
Directorate of Distance Education team who provided professional and technical support towards
the writing of the module. We hope you will be able to continue to render your services when
called upon in the near future.

i
Table of contents

Acknowledgment ........................................................................................................................................... i
Table of contents .......................................................................................................................................... ii
Module Overview......................................................................................................................................... iv
Assessment ................................................................................................................................................... v
UNIT 1: USE OF STATISTICS IN BIOLOGY & AGRICULTURE ............................................................................ 1
1.1 Introduction ......................................................................................................................................... 1
1.2 Learning outcomes .............................................................................................................................. 1
1.3 Time Frame ......................................................................................................................................... 1
1.4 Content ................................................................................................................................................ 1
1.5 Activity ............................................................................................................................................... 7
UNIT 2 MEASURES OF CENTRAL TENDENCY AND DISPERSION .................................................................... 8
2.1 Introduction ......................................................................................................................................... 8
2.2 Learning outcomes .............................................................................................................................. 8
2.3 Time Frame ......................................................................................................................................... 8
2.4 Content ................................................................................................................................................ 8
2.5 Activity ............................................................................................................................................. 10
UNIT 3. FREQUENCY DISTRIBUTION............................................................................................................ 11
3.1 Introduction ....................................................................................................................................... 11
3.2 Learning outcomes ............................................................................................................................ 11
3.3 Time Frame ....................................................................................................................................... 11
3.4 Content .............................................................................................................................................. 11
UNIT 4: PROBABILITY DISTRIBUTION .......................................................................................................... 15
4.1 Introduction ....................................................................................................................................... 15
4.2 Learning outcomes ............................................................................................................................ 15
4.3 Time Frame ....................................................................................................................................... 15
4.4 Content .............................................................................................................................................. 15
4.5 Activity ............................................................................................................................................. 22
UNIT 5: DATA COLLECTION METHODS........................................................................................................ 23
5.1 INTRODUCTION ............................................................................................................................ 23
5.2 Learning outcomes ............................................................................................................................ 23

ii
5.3 Time Frame ....................................................................................................................................... 23
5.4 Content .............................................................................................................................................. 23
5.5 Activity ............................................................................................................................................. 34
UNIT 6: ESTIMATION AND HYPOTHESIS TESTING ....................................................................................... 35
6.1 Introduction ....................................................................................................................................... 35
6.2 Learning outcomes ............................................................................................................................ 35
6.3 Time Frame ....................................................................................................................................... 35
6.4 Content .............................................................................................................................................. 35
6.5 Activity ............................................................................................................................................. 45
UNIT 7: CONTINGENCY TABLES ................................................................................................................... 46
7.1 Introduction ....................................................................................................................................... 46
7.2 Learning outcomes ............................................................................................................................ 46
7.3 Time Frame ....................................................................................................................................... 46
7.4 Content .............................................................................................................................................. 46
7.5 Activity ............................................................................................................................................. 51
UNIT 8: CORRELATION, REGRESSION AND COVARIANCE ........................................................................... 52
8.1 INTRODUCTION ............................................................................................................................ 52
8.2 Learning outcomes: ........................................................................................................................... 52
8.3 Time Frame ....................................................................................................................................... 52
8.4 Content .............................................................................................................................................. 52
8.5 Activity ............................................................................................................................................. 60
UNIT 9: SIMPLE EXPERIMENTAL DESIGN AND ANALYSIS OF VARIANCE (ANOVA) ..................................... 61
9.1 Introduction ....................................................................................................................................... 61
9.1 Learning outcomes ............................................................................................................................ 61
9.3 Time Frame ....................................................................................................................................... 61
9.4 Content .............................................................................................................................................. 61
9.5 Activity ............................................................................................................................................. 71
REFERENCES ................................................................................................................................................ 72

iii
Module Overview
Introduction
This module serves as your gateway to the indispensable intersection of biology and statistics. In
the realm of biostatistics, data transforms into meaningful insights, empowering researchers to
unlock the secrets of the natural world. Biostatistics is not merely a tool but an essential discipline
that guides the design, analysis, and interpretation of experiments and observations in the realm of
biology, ecology, and life sciences. This module is your key to understanding the principles of
biostatistics and research designs, equipping you with the skills necessary to navigate the
complexities of data-driven decision-making in the field of biology.

Throughout this module, you will embark on a journey that demystifies the core concepts of
biostatistics and explores the diverse landscape of research designs tailored to answer critical
questions in biology. Whether you're a budding biologist, a seasoned researcher, or simply
intrigued by the intricacies of the natural world, this module is designed to be accessible and
engaging. We will delve into topics such as measures of central tendency and dispersion,
hypothesis testing, experimental design, observational studies, sampling techniques, and more. By
the end of this module, you will not only grasp the fundamental principles of biostatistics but also
be well-prepared to design, conduct, and analyze your own biologically-relevant research projects.
So, let's embark on this exciting journey together and unlock the mysteries of the living world
through the lens of biostatistics and research designs.
Rationale
This module has been written to provide training to pre-service and In-service secondary school
teachers of Agricultural Science. The trained teachers are expected to understand biostatistics and
design experiments competitively and innovatively.
Course Objectives
To provide a basic understanding of Biostatistics and Experimental Designs. Biostatistics and
Experimental Designs will be explored from the basics Biostatistics, methods of data collection
and analysis and the experimental designs.
Aim
The main aim of this module is to introduce and impart Biostatistics and Experimental Designs
knowledge to students who are going to study the course.
Learning objectives

iv
Learning outcomes, here under are statements that tell you what knowledge and skills you will
have gained after working successfully through a module.
Knowledge
When you have worked through this module, you should be able to:
▪ Design experiments
▪ Design appropriate data collection tools
▪ Formulate hypotheses and test them
▪ Draw conclusions from designed experiments
Skills
After you have gone through this module, you will be able to:
▪ Identify the different data collection and analysis methods
▪ Identify suitable experimental designs
▪ Acquire skills on how to make scientific conclusions
Time Frame
This module is intended to help you acquire scholarly knowledge and skills on Biostatistics and
Experimental Designs. The study material will be covered in three residential sessions, after which
you will write a final examination. The programme includes course work assessments in form of
tests.
Need help

You are encouraged to form discussion groups, for example a WhatsApp forum where you will be
interacting with your friends and possibly your lecturer as a way of learning and supporting each
other. Any other suggestion that supports your studies is welcome.

Assessment
Continuous assessment:…………………………………………………………………50%

Two theory tests…………………………………………………..25% each

Examination………………………………………………………………………………50%

Teaching hours:
• 2 lecture hours per week.
• 1 Tutorial per week

v
UNIT 1: USE OF STATISTICS IN BIOLOGY & AGRICULTURE
1.1 Introduction
Statistics plays a significant role in our modern world, providing valuable insights into various
aspects of the real world through the study of arithmetical relationships. It is widely employed
across diverse fields, including surveys, public health, sports, education, operations research,
quality control, estimation, and prediction.

This unit explores the meaning of Statistics and Biostatistics, delves into the application of
statistics in life sciences and related fields, and examines the limitations associated with such
applications.

1.2 Learning outcomes


By the end of this unit, students should be able to

• Define the terms statistics and biostatistics


• Talk about how statistics are used in the fields of Agriculture, and other biological sciences.
• Examine continuous and discrete variables.
• Describe the principles of samplings.
• Talk about the various sample techniques and comprehend the advantages sampling
provides as well as its value and purpose.

1.3 Time Frame


• Self-study of 2 hours
• Lecture lesson of 2 hours

1.4 Content
Statistics And Biostatistics

Statistics is a field within mathematics that focuses on the collection, analysis, interpretation,
presentation, and utilization of data for making informed decisions. In other words, it is the science
of learning from data. Statistics plays a critical role in many fields, including science, business,
economics, social sciences, and healthcare, among others. It provides methods for summarizing

1
and describing data, identifying patterns and relationships, testing hypotheses, and making
predictions. By using statistical tools and techniques, researchers and analysts can gain insights
into complex phenomena and make data-driven decisions that are informed by evidence rather
than perceptions or guesswork.

Biostatistics involves the utilization of statistical techniques for analyzing and interpreting data
that is specifically related to biology and health. It involves the design, analysis, and interpretation
of experiments, observational studies, and clinical trials, among other types of research in the
biological and health sciences. Biostatistics aims to provide a rigorous framework for
understanding biological and health-related phenomena, and for making evidence-based decisions
in areas such as public health, medicine, genetics, and environmental science. Biostatistical
methods can be used to study a variety of topics, such as disease outbreaks, development of drug
development, genetic association studies, environmental health, and epidemiology.
Biostatisticians play a critical role in research teams, providing expertise in study design, data
analysis, and interpretation of findings.

Uses of Biostatistics in Agriculture, Medicine and Biology

Statistics plays a crucial role in biology, agriculture, and medicine by providing a framework for
analyzing and interpreting data in these fields. In biology, statistical methods are used to study
evolutionary processes, genetics, ecology, and biodiversity, among other topics. In agriculture,
statistics is used to analyze crop yields, soil characteristics, and animal production, among others,
to improve agricultural practices and food production. In medicine, statistics is used to design and
analyze clinical trials, assess the safety and efficacy of treatments, and identify risk factors for
diseases. Statistical models and methods are also used in epidemiology to track disease outbreaks
and monitor public health. Overall, the use of statistics in these fields helps researchers and
practitioners to make informed decisions and draw reliable conclusions from their data.

For a researcher to conduct research, there is need for them to have an understanding of;

1. What experimental methodology to employ


2. Expected results
3. Guidelines to conform to for a particular experimental method, use of correct statistical
analysis method the analysis of data of biological nature

2
4. How to conduct statistical significance tests for comparing sets of data
5. How to determine the relationship between two variables, whether by utilizing correlation
Discrete and Continuous Variables

The data that statisticians work with to come up with conclusions is of different characteristics and
nature. Biostatisticians gather data for factors that define the event to learn more about random
events. Consequently, a variable is a property of a feature that can take on different values. Broadly,
variables can either be classified as being Quantitative variables or Qualitative variables.

Qualitative variables, also known as categorical variables, are variables that describe
characteristics or attributes that cannot be measured numerically. Instead of numerical values,
qualitative variables are often represented by words, symbols, or codes. Examples of qualitative
variables include gender, ethnicity, hair color, and favorite color. Qualitative variables can be
further classified as nominal or ordinal. Nominal variables are categories that have no inherent
order or ranking, such as eye color or country of origin. Ordinal variables, on the other hand, have
a natural ordering or ranking, such as educational level or income bracket. Qualitative variables
can be analyzed using descriptive statistics, such as frequencies and percentages, and inferential
statistics, such as chi-square tests and logistic regression.

Quantitative variables are variables that are measured or expressed numerically. These variables
can be either discrete or continuous. Discrete variables are whole numbers or counts, such as the
number of babies born on a particular day or the number of motorcycles loaded for export in a day.
On the contrary, Continuous variables can be any value within a particular range like height,
weight, or temperature. Quantitative variables can be further classified as either interval or ratio.
Interval variables have equal units of measurement but no true zero point, such as temperature
measured in Celsius or Fahrenheit. Ratio variables have equal units of measurement and a true
zero point, such as height, weight, and income. Quantitative variables are typically analyzed using
descriptive statistics, such as measures of variability (standard deviation, range,) and measures of
central tendency (mode, mean, and median,) and inferential statistics, such as regression analysis
and hypothesis testing.

3
Sampling

Sampling refers to the procedure of choosing a smaller portion of individuals or items from a larger
population to be included in a statistical analysis. The individuals or items that are selected are
called the sample, and the larger population from which they are drawn is the population of interest.

Obtaining a representative sample that accurately reflects the traits of the population of interest is
the main aim of sampling. This is important because statistical inferences about the population are
based on the information obtained from the sample.

Importance of Sampling

1. Efficiency: Sampling can be more efficient than studying the entire population, especially
when the population size is large. By studying a sample, we can save time and resources
while still obtaining meaningful results.
2. Cost-effectiveness: Sampling is often less expensive than studying the entire population.
3. Accuracy: Sampling can provide accurate estimates of population parameters if the sample
is chosen carefully and is representative of the population.
4. Generalizability: The findings or conclusions drawn from a sample can be extended or
applied to the broader population from which the sample was originally selected.
5. Feasibility: In some cases, studying the entire population is not feasible or even possible.
Sampling allows us to obtain meaningful results in situations where studying the entire
population is not practical.
6. Diversity: Sampling can help ensure that a diverse range of individuals or items are
included in the study, which can improve the generalizability and applicability of the
results.
7. Ethical considerations: In some cases, it may not be ethical or practical to study an entire
population. Sampling can help mitigate ethical concerns and make sure the research is done
responsibly.

4
Sampling Methods

The common sampling methods include random, clustered, stratified, and systematic sampling.

Random sampling is a probability-based sampling method used in biostatistics to select a


representative sample from a population. In the context of random sampling, every individual
within the population has an equal probability of being chosen to be a part of the sample.

Random sampling is preferred in biostatistics because it guarantees that the selected sample
accurately represents the entire population and minimizes the potential for bias. The random
selection process also warrants that each individual in the populace has an equivalent opportunity
of being sampled, making the results more reliable and generalizable. However, it can be
challenging to implement in practice, especially in populations with complex characteristics or
when the population size is large.

Cluster sampling is a probability-based sampling method used in biostatistics to select a


representative sample from a population that is geographically or spatially dispersed. Cluster
sampling involves grouping the population into smaller groups, selecting a random subset of those
groups, and including everyone in that subset in the sample.

When it is neither possible or feasible to sample members of the entire population, such as when
it is big or dispersed across a long area, cluster sampling is frequently utilized. Cluster sampling
is more efficient than simple random sampling, as it reduces the cost and time required for data
collection. However, cluster sampling can lead to increased variability in the sample if there is
significant variation within clusters. To minimize this, it is important to select clusters that are as
similar as possible to each other in terms of the characteristics of interest.

Stratified sampling is a probability-based sampling method used in biostatistics to select a


characteristic sample from a population by first segmenting the population into subgroups, or strata,
according to particular standards, and then choosing a random sample from each stratum

Stratified sampling is preferred in biostatistics because it decreases the chance of bias and makes
sure the sample is representative of the population. By separating the population into strata,
stratified sampling takes into account the variability within each stratum and ensures that the
sample includes a proportional representation of individuals from each stratum. This can be

5
particularly beneficial if the population is assorted or when certain strata are underrepresented in
the population. Stratified sampling also allows for more precise estimates of population parameters
and reduces the variability in the sample. However, stratified sampling can be more complex and
time-consuming than simple random sampling, especially if the population has a large number of
strata or if the strata are difficult to define.

Systematic sampling is a probability-based sampling method used in biostatistics to select a


representative sample from a population by selecting every kth individual from a list or sequence
of individuals in the population.

Systematic sampling is preferred in biostatistics because it is a simple and successful sampling


technique that guarantees that the sample is representative of the population. Systematic sampling
can be particularly convenient if the population is big and it isn’t practical to choose a simple
random sample, or when the population is arranged in a particular order. However, systematic
sampling can introduce bias if there is a periodicity or pattern in the population, such that every
kth individual has a similar characteristic. In such cases, systematic sampling may not be the best
option, and another sampling method, such as simple random sampling or stratified random
sampling, may be more appropriate.

Proportionate sampling is a probability-based sampling method used in biostatistics to select a


representative sample from a population by selecting individuals from each stratum in proportion
to their size in the population. Proportionate sampling is a type of stratified sampling.

Proportionate sampling is preferred in biostatistics because it guarantees that the sample is typical
of the populace and minimizes the danger of bias. By selecting a proportional sample from each
stratum, proportionate sampling takes into account the variability within each stratum and ensures
that the sample includes a proportional representation of individuals from each stratum. This can
be particularly useful when certain strata are over- or under-represented in the population.
Proportionate sampling also allows for more precise estimates of population parameters and
reduces the variability in the sample.

However, proportionate sampling is more complex and laborious when compared to other
sampling techniques, particularly when the population has a huge number of strata or if the strata
are difficult to define. In addition, proportionate sampling may not be appropriate if the scope of

6
the strata in the population is highly variable or if the cost of sampling from each stratum is high.
In such cases, other sampling methods, such as stratified random sampling, may be more
appropriate.

1.5 Activity
1. What is the importance of biostatistics?

2. What is a sample?

3. Differentiate between qualitative and quantitative variables.

4. A researcher wants to estimate the average weight of oranges in a large orchard. Which
sampling technique would be most appropriate, and why?

5. A government agency wants to determine the prevalence of a particular crop disease in a


region. What sampling method would you recommend, and why?

7
UNIT 2 MEASURES OF CENTRAL TENDENCY AND DISPERSION
2.1 Introduction
In biostatistics, measures of central tendency and dispersion are fundamental concepts used to
describe and summarize data from biological experiments and observations. These measures help
researchers understand the central values and variability within a dataset. These measures of
central tendency and dispersion are essential tools in biostatistics to summarize and interpret data,
allowing researchers to draw meaningful conclusions and make informed decisions in the field of
biology and life sciences. This module presents notes on measures of central tendency and
dispersion in biostatistics, along with examples.

2.2 Learning outcomes


By the end of this unit, students should be able to;
I. Define and explain the concept of central tendency in statistics.
II. Differentiate between mean, median, and mode as measures of central tendency.
III. Calculate Measures of Central Tendency:
IV. Identify situations where each measure is most appropriate to use.
V. Calculate the range, variance, and standard deviation for a given dataset.

2.3 Time Frame


• Self-study of 2 hours

• Lecture lesson of 2 hours

2.4 Content
Measures of Central Tendency: Measures of central tendency provide insights into the central
or typical value of a dataset. Common measures of central tendency include:

Mean (Average)

The mean is calculated by summing up all data values and dividing by the number of data points.

Example: Calculate the mean height (in centimeters) of a sample of 10 individuals: 160, 165, 170,
155, 175, 168, 162, 158, 172, 167.

Mean = (160 + 165 + 170 + 155 + 175 + 168 + 162 + 158 + 172 + 167) / 10 = 166.2 cm

8
Median (Middle Value)

The median is the middle value when data is arranged in ascending or descending order. If there
is an even number of data points, the median is the average of the two middle values.

Example: Calculate the median age (in years) of a group of 8 patients: 25, 32, 19, 28, 35, 40, 22,
31.

Arranged in ascending order: 19, 22, 25, 28, 31, 32, 35, 40.

Median = (28 + 31) / 2 = 29.5 years

Mode (Most Common Value)

The mode is the value that appears most frequently in a dataset. There can be multiple modes if
several values have the same highest frequency.

Example: Determine the mode of a dataset representing the number of flowers in different colors
in a garden: Red: 15, Blue: 12, Yellow: 18, Red: 20, Green: 15.

Mode = Red (because it appears most frequently).

Measures of Dispersion

Measures of dispersion provide information about the spread or variability of data points. Common
measures of dispersion include:

Range

The range is the difference between the maximum and minimum values in a dataset. It provides a
simple measure of the spread of data.

Example: Calculate the range of blood pressure (in mm Hg) in a sample of 15 patients: 110, 120,
105, 140, 135, 125, 115, 130, 150, 95, 108, 132, 138, 117, 122.

Range = 150 - 95 = 55 mm Hg

9
Variance

Variance measures the average of the squared differences between each data point and the mean.
It quantifies the overall variability of the data.

Example: Calculate the variance of the weights (in kg) of 20 lab mice: [25, 23, 26, 28, 30, 22, 24,
29, 27, 26, 28, 25, 23, 27, 30, 31, 29, 24, 23, 25].

Variance = 4.2 (rounded to one decimal place)

Standard Deviation

The standard deviation is the square root of the variance. It provides a measure of the spread of
data in the same units as the original data.

Standard Deviation = √(Variance)

Example: For the above dataset, the standard deviation ≈ √4.2 ≈ 2.05 kg (rounded to two decimal
places).

2.5 Activity
1. Calculate the mean (average) of the following dataset representing the number of hours
spent studying by a group of students in a week: 6, 7, 5, 8, 6, 9, 7.
2. Determine the median of the following dataset representing the ages of a sample of 12
people: 22, 30, 27, 24, 31, 29, 35, 25, 26, 28, 32, 23.
3. Find the mode of the following dataset representing the scores (out of 100) of a class on a
recent math test: 85, 90, 78, 90, 92, 78, 88, 92, 85, 75.
4. Calculate the range of temperatures (in degrees Celsius) recorded in a city over a week: 18,
23, 20, 16, 28, 14, 30.
5. Compute the variance of a dataset representing the weights (in grams) of 15 apples: 150,
160, 145, 155, 158, 150, 147, 162, 152, 155, 157, 160, 148, 153, 159.
6. Determine the standard deviation (rounded to two decimal places) of the following dataset
representing the heights (in centimeters) of 10 individuals: 162, 165, 170, 155, 175, 168,
162, 158, 172, 167.

10
UNIT 3. FREQUENCY DISTRIBUTION
3.1 Introduction
When measures for counts are taken, they are row and are difficult to make sense of. As such, they
need to be organized or summarized to have meaning. They may be done as tabulations or
classifications. The researcher must arrange the data in some useful form to describe situations,
make conclusions, or deduce events. The easiest way to organize data is to create a frequency
distribution.

3.2 Learning outcomes


By the end of this unit, students should be able to;
I. Explain what frequency and frequency distribution are.
II. List the various frequency distribution types.
III. Classify information utilizing frequency distributions.
IV. Explain why the distributions are created.
V. Use techniques other than frequency distribution to represent data.

3.3 Time Frame


• Self-study of 2 hours
• Lecture lesson of 2 hours
3.4 Content
The Frequency distribution

Frequency distribution is a statistical concept used to summarize and present data meaningfully
and organized. It is a table that shows how frequently each value or category of a variable occurs
in a dataset. The first step to calculate a frequency distribution is to determine the variable's range
of values or categories. Next, the number of times each value or category appears in the dataset is
counted and recorded in the table. The frequencies can then be presented as absolute numbers or
percentages of the dataset's observations. Frequency distribution can identify a dataset's most
common or rare values or categories, detect outliers, and gain insight into the data distribution.

11
Categorical Frequency

Categorical frequency is a statistical concept used to describe the distribution of categorical


variables in a dataset. To calculate categorical frequency, the first step is to determine the
categories of the variable. Next, the number of observations or individuals in each category is
counted and recorded. Categorical frequency can be presented as absolute numbers or as
percentages of the total number of observations in the dataset. Categorical frequency is useful for
analyzing the prevalence of different categories and identifying any imbalances or biases in the
data. It can also explore relationships between different categorical variables, such as gender and
occupation or education level and income. To calculate the relative frequency, the occurrence
frequency of each number is divided by the sample size and then multiplied by one hundred. This
can be mathematically represented as follows;

𝑓
= 𝑥 100
𝑛

Where f = number of the occurrences of the category class and n = total number of values.

Example

Below are colors of 30 indigenous chickens sampled in Mbala during research. Construct a
frequency distribution for the data.

White, Brown, White, Black, Black, Brown White, Black, White, Brown, Brown, White, Black,
Brown, Black, White, White, Brown Black, Brown White, Brown, Black, Brown, Brown
Black, White Brown Black, White

Solution

Since this data is categorical data with three categories, chicken colors white, black and brown can
be used as classes for distribution.

class Tally Frequency percent


White IIIII III 10 33.33
Brown IIIII IIIII I 11 36.67
Black IIIII III 9 30.00
Total 100
It can, therefore, be concluded that there were more brown chickens in Mbala as brown had the
highest frequency.

12
UNGROUPED FREQUENCY DISTRIBUTION

This is usually a list of measurements that are not categorical. The frequency is constructed based
on individual entries.

Example

Below is data of the thigh lengths of chickens collected in research conducted in Chongwe.

13 13 12 13 14 11 10 12 15 14
12 13 13 12 13 14 11 10 12 15
13 12 12 12 13 14 11 12 10 11
14 13 11 11 14 11 13 13 12 12
11 14 11 12 14 12 14 14 11 12
10 11 12 11 11 10 11 13 12 12
12 10 11 13 10 11 11 11 15 12
15 12 13 14 11 12 12 11 15 13
14 15 15 13 12 13 11 12 14 14

Solution

Calculate the range of this data: highest value minus the lowest number (15-10=5). Since the range
is small, there is no need to form classes.

Frequency Cumulative Relative


Class limits Tally Cumulative Frequency frequency (%)
10 IIIII II 7 83 8
11 IIIII IIIII IIIII IIIII I 21 76 23
12 IIIII IIIII IIIII IIII 24 41 27
13 IIIII IIII IIIII II 17 17 19
14 IIIII IIIII IIII 14 55 16
15 IIIII II 7 90 8

Example

The following data is for the height of a perennial fodder tree after 12 months in centimeters.
Construct frequency tables.

100 150 110 186 115 120 115 167 106 164 161 145 170
194 128 117 168 177 181 193 154 129 130 163 155 171

13
Solution

Find the range: 193-100=93

𝑟𝑎𝑛𝑔𝑒 93
Calculate class width by 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 =18.5, so round off to 19
5

class tally frequency Cumulative freq Relative frequency


100 - 118 IIIII I 6 6 23.08
119 - 137 IIII 4 10 15.38
138 - 156 IIII 4 14 15.38
157 - 175 IIIII II 7 21 26.92
176 - 194 IIIII 5 26 19.23

3.5 Activity

1. What is the purpose of a frequency distribution in biostatistics?


2. How is a frequency distribution table constructed?
3. What information does a frequency distribution provide about a dataset?
4. If you have a dataset with 50 observations, how many classes or intervals would you
typically use to construct a frequency distribution?
5. How can you calculate the class width for a frequency distribution?
6. What is the difference between a grouped frequency distribution and an ungrouped
frequency distribution?
7. Why must a cumulative frequency column be included in a frequency distribution table?

14
UNIT 4: PROBABILITY DISTRIBUTION
4.1 Introduction
Probability, as a fundamental concept in mathematics, refers to the likelihood or chance of an event
happening. It serves as the foundation for inferential statistics. This unit explores three specific
distributions: the Normal, the Poisson, and the Binomial. These distributions play a significant role
in sampling theory.

4.2 Learning outcomes


I. Differentiate between probability distributions and frequency
II. Describe the Normal, Poisson, and Binomial distributions
III. Compute the probabilities in Poisson and binomial probability distributions.

4.3 Time Frame


• Self-study of 2 hours
• Lecture lesson of 2 hours
4.4 Content
A probability distribution is a mathematical function that provides the probability of occurrence
of each possible outcome in a random experiment or event. Probability distributions are used in
various fields, including statistics, physics, finance, and engineering, to describe and model the
behavior of random variables.

There are two main types of probability distributions: discrete and continuous. Discrete probability
distributions are used when the outcome can only take on a finite number of values, such as the
number of heads in a coin toss. Examples of discrete probability distributions include the binomial
distribution and the Poisson distribution.

On the other hand, continuous probability distributions are used when the outcome can take on
any value within a range, such as the height of individuals in a population. Examples of continuous
probability distributions include the normal distribution and the exponential distribution.

A distribution is a scatter of similar values, for as the range of cattle weights. How frequently
specific values within a range of values occur is displayed in a frequency distribution. A probability
distribution is quite similar since it demonstrates the likelihood of various random variable values
within a specific range. As an illustration, if we toss two coins, we might get 0, 1, or 2 "heads."

15
The probability distribution is displayed below if we build a table listing the probabilities of each
value of each random variable.

Number of ‘heads’ Sequential event Probability


0 TT 0.5 x 0.5 = 0.25
1 TH 0.5 x 0.5 = 0.5
HT 0.5 x 0.5
2 HH 0.5 x 0.5 = 0.25
Total 1

Normal distribution.

The normal distribution, or Gaussian distribution, is a continuous probability distribution widely


used in statistics, science, and engineering. The normal distribution is often called a bell-shaped
curve due to its symmetric shape, with the mean (average) of the distribution in the center and the
curve's tails tapering off towards infinity.

The normal distribution is characterized by the mean (μ) and the standard deviation (σ). The
mean is the central value around which the distribution is symmetric, and the standard deviation
measures the spread of the distribution.

Many real-world phenomena, such as heights and weights of individuals, IQ scores, and errors in
measurements, are known to follow a normal distribution. This makes the normal distribution
popular for modeling data in many fields.

The normal distribution has several properties that make it useful in statistical analysis. One
important property is the central limit theorem, which states that the sum of independent and
identically distributed random variables will tend to follow a normal distribution as the number of
variables increases.

16
Graph of normal distribution with Standard deviation

Key features of the curve

The significant feature of the curve lies in the probability area it covers. Erecting perpendiculars
at a distance of 1σ from the mean in both directions, the region enclosed by the curve and these
perpendiculars will constitute roughly 68.26% of the entire area. This suggests that around 68.26%
of all frequencies fall within one standard deviation of the mean. Moreover, the overall probability
represented by the curve's area is 1 or 100%.

The distribution's parameters are the measures of central tendency (μ = population mean) and
dispersion (σ = population standard deviation), which, once estimated for a specific population,
can be utilized to determine the shape of its distribution curve using the normal curve formula.
Generally, we lack knowledge of μ and σ and, therefore, have to gauge them from a sample as x
and s. If the sample size is over 30 observations, then x and s are deemed to be dependable
estimates of the parameters.

17
Standardizing The Normal Curve

Standardizing the normal curve is a process that involves transforming a normally distributed data
set into a standard normal distribution. The standard normal distribution, also known as the Z-
distribution, has a mean of 0 and a standard deviation of 1. This transformation allows us to
compare and analyze different normal distributions more easily.

To standardize a normal distribution, we use a formula called the z-score. The z-score for a
particular data point represents the number of standard deviations that data point is away from the
mean. The formula for calculating the z-score is:

z = (x - μ) / σ

Where:

• z is the z-score

• x is the individual data point

• μ is the mean of the distribution

• σ is the standard deviation of the distribution

By calculating the z-score for each data point in a normal distribution, we can convert the
distribution into the standard normal distribution. This transformation allows us to compare data
from different normal distributions since they will all be measured on the same scale.

Once we have standardized the normal curve, we can use the standard normal distribution table
(also known as the z-table) to determine the probability associated with specific z-scores. The z-
table provides the area under the curve to the left of a given z-score. This allows us to find the
proportion of data that falls within a certain range or the probability of obtaining a particular value
or range of values.

Standardizing the normal curve is particularly useful in statistical analysis and hypothesis testing.
It enables us to make comparisons, calculate percentiles, and estimate probabilities accurately. The
standard normal distribution also serves as a foundation for many statistical techniques and
calculations.

18
In summary, standardizing the normal curve involves transforming a normal distribution into a
standard normal distribution by calculating the z-score for each data point. This process allows for
easier comparison and analysis of different normal distributions and facilitates the calculation of
probabilities and percentiles using the standard normal distribution table.

Poisson Distribution

The Poisson distribution is a discrete probability distribution that models the number of events
occurring within a fixed interval of time or space. It is often used to describe rare events that occur
independently of each other at a constant average rate.

The key characteristics of the Poisson distribution are:

1. Mean and Variance: The mean (λ) of the Poisson distribution represents the average rate
at which events occur within the given interval. The variance of the distribution is also
equal to λ.

2. Independent Events: Each event in a Poisson process is assumed to be independent of other


events. The occurrence of one event does not affect the probability of another event
happening.

3. Fixed Interval: The Poisson distribution applies to situations where the interval of time or
space is fixed, such as the number of phone calls received in an hour or the number of
accidents at a particular intersection in a day.

4. Discrete Events: The Poisson distribution deals with discrete events, meaning the number
of events can only take on non-negative integer values (0, 1, 2, 3, and so on).

The probability mass function (PMF) of the Poisson distribution gives the probability of observing
a specific number of events within the given interval, and it is defined as:

P(X = k) = (e^(-λ) * λ^k) / k!

Where:

• X is the random variable representing the number of events

• k is a non-negative integer representing the number of events observed

19
• e is the base of the natural logarithm (approximately 2.71828)

• λ is the average rate of events occurring within the interval

The Poisson distribution is commonly used in various fields, including telecommunications,


insurance, queueing theory, and reliability analysis. It provides a useful framework for analyzing
and predicting the occurrence of rare events, allowing us to estimate probabilities and make
informed decisions based on the expected frequency of these events.

Example;

In research investigating the distribution of birds roosting on trees, we have a scenario where there
are 200 birds randomly spread out among 500 trees. The question is to determine the probability
of a specific tree containing precisely three birds.

Solution;

200 2
λ =500 =5 =0.4

That is 0.4 birds per tree. Therefore λ = 0.4, while x = 3. We can then substitute in the formula
above.

(2.7183 )−0.4 0.43


P (3; 0.4) = = 0.0072
3!

Thus, there is less than a 1% probability that any given tree will contain exactly three birds.

Binomial Distribution

In the context of probability and statistics, the binomial distribution is a discrete probability
distribution that models the number of successes in a fixed number of independent Bernoulli trials.
It is used when each trial has only two possible outcomes, typically referred to as "success" and
"failure."

20
The key characteristics of the binomial distribution are:

1. Fixed Number of Trials: The binomial distribution applies to situations with a fixed number
of trials or experiments, denoted as n. Each trial is independent and has the same probability
of success.

2. Independent Trials: The outcome of each trial is assumed to be independent of the other
trials. The result of one trial does not affect the probability of success in subsequent trials.

3. Two Possible Outcomes: Each trial can only have two possible outcomes, usually labeled
as success (S) and failure (F), with a known probability of success denoted as p.

The probability mass function (PMF) of the binomial distribution gives the probability of
observing a specific number of successes, denoted as k, in the given number of trials, denoted as
n. It is defined as:

P(X = k) = (n choose k) * p^k * (1 - p)^(n - k)

Where:

• X is the random variable representing the number of successes

• k is a non-negative integer representing the number of successes observed

• n is the fixed number of trials

• p is the probability of success in a single trial

• "n choose k" represents the binomial coefficient, which calculates the number of ways to
choose k successes from n trials

21
4.5 Activity
1. A fair six-sided die is rolled three times. Find the probability distribution of the sum of the
three rolls.

2. The number of cars passing through a certain intersection follows a Poisson distribution
with an average rate of 10 cars per hour. What is the probability that exactly 15 cars will
pass through the intersection in a given hour?

3. In a factory, the probability that a product is defective is 0.05. If a random sample of 200
products is taken, find the probability distribution of the number of defective products in
the sample.

22
UNIT 5: DATA COLLECTION METHODS
5.1 INTRODUCTION
Probability, as a fundamental concept in mathematics, refers to the likelihood or chance of an event
happening. It serves as the foundation for inferential statistics. This unit explores three specific
distributions: the Normal, the Poisson, and the Binomial. These distributions play a significant role
in sampling theory.

5.2 Learning outcomes


I. Identify the different methods of data collection and criterion that we use to select a method
of data collection
Define a questionnaire, identify the different parts of a questionnaire and indicate the procedures
to prepare a questionnaire

5.3 Time Frame


• Self-study of 2 hours
• Lecture lesson of 2 hours
5.4 Content
Data collection techniques allow us to systematically collect data about our objects of study
(people, objects, and phenomena) and about the setting in which they occur. In the collection of
data we have to be systematic. If data are collected haphazardly, it will be difficult to answer our
research questions in a conclusive way.

Various data collection techniques can be used such as:

• Observation

• Face-to-face and self-administered interviews

• Postal or mail method and telephone interviews

• Using available information

• Focus group discussions (FGD)

• Other data collection techniques – Rapid appraisal techniques, 3L technique, Nominal group
techniques, Delphi techniques, life histories, case studies, etc.

23
1. Observation – Observation is a technique that involves systematically selecting, watching and
recoding behaviors of people or other phenomena and aspects of the setting in which they occur,
for the purpose of getting (gaining) specified information. It includes all methods from simple
visual observations to the use of high level machines and measurements, sophisticated equipment
or facilities, such as radiographic, biochemical, X-ray machines, microscope, clinical examinations,
and microbiological examinations. Outline the guidelines for the observations prior to actual data
collection.

Advantages: Gives relatively more accurate data on behavior and activities

Disadvantages: Investigators or observer’s own biases, prejudice, desires, and etc. and needs more
resources and skilled human power during the use of high level machines.

2. Interviews and self-administered questionnaire Interviews and self-administered


questionnaires are probably the most commonly used research data collection techniques.
Therefore, designing good “questioning tools” forms an important and time-consuming phase in
the development of most research proposals.

Once the decision has been made to use these techniques, the following questions should be
considered before designing our tools:

• What exactly do we want to know, according to the objectives and variables we identified earlier?
Is questioning the right technique to obtain all answers, or do we need additional techniques, such
as observations or analysis of records?

• Of whom will we ask questions and what techniques will we use? Do we understand the topic
sufficiently to design a questionnaire, or do we need some loosely structured interviews with key
informants or a focus group discussion first to orient ourselves?

• Are our informants mainly literate or illiterate? If illiterate, the use of self-administered
questionnaires is not an option.

• How large is the sample that will be interviewed? Studies with many respondents often use
shorter, highly structured questionnaires, whereas smaller studies allow more flexibility and may
use questionnaires with a number of open-ended questions.

24
Once the decision has been made Interviews may be less or more structured. Unstructured
interview is flexible, the content wording and order of the questions vary from interview to
interview. The investigators only have idea of what they want to learn but do not decide in advance
exactly what questions will be asked, or in what order.

In other situations, a more standardized technique may be used, the wording and order of the
questions being decided in advance. This may take the form of a highly structured interview, in
which the questions are asked orderly, or a self administered questionnaire, in which case the
respondent reads the questions and fill in the answers by himself (sometimes in the presence of an
interviewer who ‘stands by’ to give assistance if necessary). Standardized methods of asking
questions are usually preferred in community medicine research, since they provide more
assurance that the data will be reproducible. Less structured interviews may be useful in a
preliminary survey, where the purpose is to obtain information to help in the subsequent planning
of a study rather than factors for analysis, and in intensive studies of perceptions, attitudes,
motivation and affective reactions. Unstructured interviews are characteristic of qualitative (non-
quantitative) research. The use of self-administered questionnaires is simpler and cheaper; such
questionnaires can be administered to many persons simultaneously (e.g. to a class of students),
and unlike interviews, can be sent by post. On the other hand, they demand a certain level of
education and skill on the part of the respondents; people of a low socio-economic status are less
likely to respond to a mailed questionnaire.

In interviewing using questionnaire, the investigator appoints agents known as enumerators, who
go to the respondents personally with the questionnaire, ask them the questions given there in, and
record their replies. They can be either face-to-face or telephone interviews.

Face-to-face and telephone interviews have many advantages. A good interviewer can stimulate
and maintain the respondent’s interest, and can create a rapport (understanding, concord) and
atmosphere conducive to the answering of questions. If anxiety aroused, the interviewer can allay
it. If a question is not understood an interviewer can repeat it and if necessary (and in accordance
with guidelines decided in advance) provide an explanation or alternative wording. Optional
follow-up or probing questions that are to be asked only if prior responses are inconclusive or
inconsistent cannot easily be built into self-administered questionnaires. In face-to-face interviews,
observations can be made as well. In general, apart from their expenses, interviews are preferable

25
to self-administered questionnaire, with the important proviso that they are conducted by skilled
interviewers.

3. Mailed Questionnaire Method: Under this method, the investigator prepares a questionnaire
containing a number of questions pertaining the field of inquiry. The questionnaires are sent by
post to the informants together with a polite covering letter explaining the detail, the aims and
objectives of collecting the information, and requesting the respondents to cooperate by furnishing
the correct replies and returning the questionnaire duly filled in. In order to ensure quick response,
the return postage expenses are usually borne by the investigator.

The main problems with postal questionnaire are that response rates tend to be relatively low, and
that there may be under representation of less literate subjects.

4. Use of documentary sources: Clinical and other personal records, death certificates, published
mortality statistics, census publications, etc. Examples include:

1. Official publications of Central Statistical Authority

2. Publication of Ministry of Health and Other Ministries

3. News Papers and Journals.

4. International Publications like Publications by WHO, World, Bank, UNICEF

5. Records of hospitals or any Health Institutions.

During the use of data from documents, though they are less time consuming and relatively have
low cost, care should be taken on the quality and completeness of the data. There could be
differences in

objectives between the primary author of the data and the user. Problems in gathering data It is
important to recognize some of the main problems that may be faced when collecting data so that
they can be addressed in the selection of appropriate collection methods and in the training of the
staff involved.

26
Common problems might include:

• Language barriers

• Lack of adequate time

• Expense

• Inadequately trained and experienced staff

• Invasion of privacy

• Suspicion

• Bias (spatial, project, person, season, diplomatic, professional)

• Cultural norms (e.g. which may preclude men interviewing women)

Choosing a Method of Data Collection

Decision-makers need information that is relevant, timely, accurate and usable. The cost of
obtaining, processing and analyzing these data is high. The challenge is to find ways, which lead
to information

that is cost-effective, relevant, timely and important for immediate use. Some methods pay
attention to timeliness and reduction in cost. Others pay attention to accuracy and the strength of
the

method in using scientific approaches. The statistical data may be classified under two categories,

depending upon the sources.

1) Primary data 2) Secondary data

Primary Data: are those data, which are collected by the investigator himself for the purpose of
a specific inquiry or study. Such data are original in character and are mostly generated by surveys
conducted by individuals or research institutions.

The first hand information obtained by the investigator is more reliable and accurate since the
investigator can extract the correct information by removing doubts, if any, in the minds of the
respondents regarding certain questions. High response rates might be obtained since the answers

27
to various questions are obtained on the spot. It permits explanation of questions concerning
difficult subject matter.

Secondary Data: When an investigator uses data, which have already been collected by others,
such data are called "Secondary Data". Such data are primary data for the agency that collected
them, and become secondary for someone else who uses these data for his own purposes.

The secondary data can be obtained from journals, reports, government publications, publications
of professionals and research organizations. Secondary data are less expensive to collect both in
money and time.

These data can also be better utilized and sometimes the quality of such data may be better because
these might have been collected by persons who were specially trained for that purpose.

On the other hand, such data must be used with great care, because such data may also be full of
errors due to the fact that the purpose of the collection of the data by the primary agency may have
been different from the purpose of the user of these secondary data.

Secondly, there may have been bias introduced, the size of the sample may have been inadequate,
or there may have been arithmetic or definition errors, hence, it is necessary to critically investigate
the validity of the secondary data.

In general, the choice of methods of data collection is largely based on the accuracy of the
information they yield. In this context, ‘accuracy’ refers not only to correspondence between the
information and objective reality - although this certainly enters into the concept -but also to the
information’s relevance. This issue is the extent to which the method will provide a precise
measure of the variable the investigator wishes to study.

The selection of the method of data collection is also based on practical considerations, such as:

1. The need for personnel, skills, equipment, etc. in relation to what is available and the
urgency with which results are needed.

2. The acceptability of the procedures to the subjects - the absence of inconvenience,


unpleasantness, or untoward consequences.

28
3. The probability that the method will provide a good coverage, i.e. will supply the required
information about all or almost all members of the population or sample. If many people
will not know the answer to the question, the question is not an appropriate one.

The investigator’s familiarity with a study procedure may be a valid consideration. It comes as no
particular surprise to discover that a scientist formulates problems in a way which requires for their
solution just those techniques in which he himself is specially skilled.

Types of Questions

Before examining the steps in designing a questionnaire, we need to review the types of questions
used in questionnaires. Depending on how questions are asked and recorded we can distinguish
two major possibilities - Open –ended questions, and closed questions.

Open-ended questions: Open-ended questions permit free responses that should be recorded

in the respondent’s own words. The respondent is not given any possible answers to choose from.

Such questions are useful to obtain information on:

➢ Facts with which the researcher is not very familiar,

➢ Opinions, attitudes, and suggestions of informants, or

➢ Sensitive issues.

For example

“Can you describe exactly what the traditional birth attendant did when your labor started?”

“What do you think are the reasons for a high drop-out rate of village health committee members?”

“What would you do if you noticed that your daughter (school girl) had a relationship with a
teacher?”

Closed Questions: Closed questions offer a list of possible options or answers from which the
respondents must choose. When designing closed questions one should try to:

➢ Offer a list of options that are exhaustive and mutually exclusive

➢ Keep the number of options as few as possible.

29
➢ Closed questions are useful if the range of possible responses is known.

For example

“What is your marital status?

1. Single

2. Married/living together

3. Separated/divorced/widowed

“Have your every gone to the local village health worker for treatment?

1. Yes

2. No

Closed questions may also be used if one is only interested in certain aspects of an issue and does
not want to waste the time of the respondent and interviewer by obtaining more information than
one needs.

For example, a researcher who is only interested in the protein content of a family diet may ask:

“Did you eat any of the following foods yesterday? (Circle yes or no for each set of items)

Peas, bean, lentils Yes No

Fish or meat Yes No

Eggs Yes No

Milk or Cheese Yes No

Closed questions may be used as well to get the respondents to express their opinions by choosing
rating points on a scale.

For example

30
“How useful would you say the activities of the Village Health Committee have been in the
development of this village?”

1. Extremely useful O

2. Very useful Ο

3. Useful Ο

4. Not very useful Ο

5. Not useful at all Ο

Requirements of questions

Must have face validity – that is the question that we design should be one that give an obviously
valid and relevant measurement for the variable. For example, it may be self-evident that records
kept in an obstetrics ward will provide a more valid indication of birth weights than information
obtained by questioning mothers.

Must be clear and unambiguous – the way in which questions are worded can ‘make or break’ a
questionnaire. Questions must be clear and unambiguous. They must be phrased in language that
it is believed the respondent will understand, and that all respondents will understand in the same
way. To ensure clarity, each question should contain only one idea; ‘double-barrelled’ questions
like ‘Do you take your child to a doctor when he has a cold or has diarrhoea?’ are difficult to
answer, and the answers are difficult to interpret.

Must not be offensive – whenever possible it is wise to avoid questions that may offend the
respondent, for example those that deal with intimate matters, those which may seem to expose
the respondent’s ignorance, and those requiring him to give a socially unacceptable answer.

The questions should be fair - They should not be phrased in a way that suggests a specific answer,
and should not be loaded. Short questions are generally regarded as preferable to long ones.

Sensitive questions - It may not be possible to avoid asking ‘sensitive’ questions that may offend
respondents, e.g. those that seem to expose the respondent’s ignorance. In such situations the
interviewer (questioner) should do it very carefully and wisely.

31
Steps in Designing a Questionnaire

Designing a good questionnaire always takes several drafts. In the first draft we should concentrate
on the content. In the second, we should look critically at the formulation and sequencing of the
questions. Then we should scrutinize the format of the questionnaire.

Finally, we should do a test-run to check whether the questionnaire gives us the information we
require and whether both the respondents and we feel at ease with it. Usually the questionnaire
will need some further adaptation before we can use it for actual data collection.

Step1: CONTENT

Take your objectives and variables as your starting point. Decide what questions will be needed to
measure or to define your variables and reach your objectives. When developing the questionnaire,
you should reconsider the variables you have chosen, and, if necessary, add, drop or change some.
You may even change some of your objectives at this stage.

Step 2: FORMULATING QUESTIONS

Formulate one or more questions that will provide the information needed for each variable.

Take care that questions are specific and precise enough that different respondents do not interpret
them differently. For example, a question such as: “Where do community members usually seek
treatment when they are sick?” cannot be asked in such a general way because each respondent
may have something different in mind when answering the question:

➢ One informant may think of measles with complications and say he goes to the hospital,
another of cough and say goes to the private pharmacy;

➢ Even if both think of the same disease, they may have different degrees of seriousness in
mind and thus answer differently;

➢ In all cases, self-care may be overlooked.

The question, therefore, as rule has to be broken up into different parts and made so specific that
all informants focus on the same thing. For example, one could:

➢ Concentrate on illness that has occurred in the family over the past 14 days and ask what
has been done to treat if from the onset; or
32
➢ Concentrate on a number of diseases, ask whether they have occurred in the family over
the past X months (chronic or serious diseases have a longer recall period than minor
ailments) and what has been done to treat each of them from the onset.

Check whether each question measures one thing at a time.

For example, the question, ''How large an interval would you and your husband prefer between
two successive births?'' would better be divided into two questions because husband and wife may
have different opinions on the preferred interval.

Avoid leading questions.

A question is leading if it suggests a certain answer. For example, the question, ''Do you agree that
the district health team should visit each health center monthly?'' hardly leaves room for “no” or
for other

options. Better would be: “Do you thing that district health teams should visit each health center?
If yes, how often?” Sometimes, a question is leading because it presupposes a certain condition.
For example: “What action did you take when your child had diarrhoea the last time?” presupposes
the child has had diarrhoea. A better set of questions would be: “Has your child had diarrhoea? If
yes, when was the last time?” “Did you do anything to treat it? If yes, what?

Step 3: Sequencing of Questions

Design your interview schedule or questionnaire to be “consumer friendly.”

The sequence of questions must be logical for the respondent and allow as much as possible for a
“natural” discussion, even in more structured interviews.

➢ At the beginning of the interview, keep questions concerning “background variables” (e.g.,
age, religion, education, marital status, or occupation) to a minimum. If possible, pose most
or all of these questions later in the interview. (Respondents may be reluctant to provide
“personal” information early in an interview).

➢ Start with an interesting but non-controversial question (preferably open) that is directly
related to the subject of the study. This type of beginning should help to raise the informants’

33
interest and lessen suspicions concerning the purpose of the interview (e.g., that it will be
used to provide information to use in levying taxes).

➢ Pose more sensitive questions as late as possible in the interview (e.g., questions pertaining
to income, sexual behavior, or diseases with stigma attached to them, etc.

➢ Use simple everyday language.

Make the questionnaire as short as possible. Conduct the interview in two parts if the nature of the
topic requires a long questionnaire (more than 1 hour).

Step 4: formatting the questionnaire

When you finalize your questionnaire, be sure that:

➢ Each questionnaire has a heading and space to insert the number, data and location of the
interview, and, if required the name of the informant. You may add the name of the
interviewer to facilitate quality control.

➢ Layout is such that questions belonging together appear together visually. If the
questionnaire is long, you may use subheadings for groups of questions.

➢ Sufficient space is provided for answers to open-ended questions.

➢ Boxes for pre-categorized answers are placed in a consistent manner half of the page.

Your questionnaire should not only be consumer but also user friendly!

Step 5: Translation

If interview will be conducted in one or more local languages, the questionnaire has to be translated
to standardize the way questions will be asked. After having it translated you should have it
retranslated into the original language. You can then compare the two versions for differences and
make a decision concerning the final phrasing of difficult concepts.

5.5 Activity
Design a questionnaire to collect data for your research topic of your choice.

34
UNIT 6: ESTIMATION AND HYPOTHESIS TESTING
6.1 Introduction

Researchers have a variety of questions they want to answer, such as whether a particular feed
ingredient improves the growth performance of livestock or a particular fertilizer enhances plant
growth. Additionally, managers may want to determine if a new management method is more
effective than a traditional one. These inquiries can be addressed using statistical hypothesis testing,
a method for assessing statements about a population and making decisions based on them.

6.2 Learning outcomes

At the end of this unit, you should be able to:


1. Explain the concepts of estimation and statistical hypothesis in the context of
statistics.
2. Come up with the null and alternative Hypothesis.
3. Differentiate and describe the potential results or outcomes of a hypothesis test.
4. Calculate the level of confidence associated with biological data
5. Outline the sequential steps involved in the process of hypothesis testing.
6. Elaborate on the relationship between type I and type II errors in statistical
hypothesis testing,

6.3 Time Frame

• Self-study of 2 hours
• Lecture lesson of 2 hours

6.4 Content

Estimation

Estimation involves utilizing an estimator to generate an approximation of a parameter. Estimation


and hypothesis testing have a close connection. An estimator refers to any statistic employed to
approximate a parameter, while an estimate is any particular statistic value. To illustrate this, the
sample mean x̄ can be utilized to estimate the population mean (μ). A point estimate is obtained
using a single value to estimate a population parameter, such as s = 20. On the other hand, an
interval estimate is derived by employing a range of values to approximate a population parameter.

35
For example, a range of values ranging from 18 to 22 allows for evaluating the estimate, which
differs from a point estimate that relies on a single value.

Statistical Hypothesis

Null and alternative hypotheses are two key concepts in statistical hypothesis testing:

Null Hypothesis: The null hypothesis, denoted by H0, is a statement that assumes there is no
significant difference between a population parameter and a sample statistic. It represents the status
quo or the default position considered to be true unless evidence suggests otherwise. For example,
the null hypothesis for a study on a new medication could be that it does not affect lowering blood
pressure. A statistical hypothesis can either be accepted or rejected, depending on whether the
sample data supports it. If the data closely aligns with the hypothesis, it is accepted, but if it doesn't,
it is rejected. When the sample data contradicts the null hypothesis, the alternative hypothesis is
considered, which is the conclusion reached upon rejecting the null hypothesis.

Alternative Hypothesis: The alternative hypothesis, denoted by Ha, is a statement that contradicts
the null hypothesis and proposes that there is a significant difference between a population
parameter and a sample statistic. It represents the claim or hypothesis that the researcher wants to
support. For example, the alternative hypothesis for the same medication study could be that it
significantly lowers blood pressure.

Except there is adequate proof to support the alternative hypothesis, the null hypothesis is taken to
be true in hypothesis testing. The degree of significance, or alpha, is the likelihood of rejecting the
null hypothesis when it is true. The alternative hypothesis is favored over the null hypothesis,
leading to the rejection of the null hypothesis if the test statistic is inside the rejection zone, which
is established by the alpha level and degrees of freedom.

Suppose a researcher wants to test whether a new diet plan results in significant weight loss. The
null hypothesis could be that the diet plan does not affect weight loss, while the alternative
hypothesis could be that it leads to significant weight loss. If the test statistic falls within the
rejection region at the alpha level of 0.05, the null hypothesis would be rejected, and the researcher
would conclude that the diet plan leads to significant weight loss.

36
Let's consider an example where the aim is to examine the hypothesis that a population mean (μ)
is equal to 22. The hypothesis will be stated as Ho: μ = 22, where μ represents the true value and
22 is the assumed value. As a result, we have three potential alternative hypotheses:

1. H1: μ ≠ 22, which corresponds to a two-tailed test.

2. H1: μ > 22, which corresponds to a right-tailed test.

3. H1: μ < 22, which corresponds to a left-tailed test.

A one-tailed test is when the inequality sign is either ">" for right-tailed or "<" for left-tailed. The
null hypothesis should not be accepted if the test value is within the crucial zone on one side of
the mean.

On the other hand, a two-tailed test involves rejecting the null hypothesis if the test value falls
within either of the two critical regions. The test value is a numerical outcome obtained from a
statistical test.

The Level of Confidence

The confidence level, confidence coefficient, or degree of confidence is the probability level
associated with an interval estimate. It is called "confidence" because it measures the degree of
certainty that a particular estimation method will generate an estimate that includes the true
population parameter, denoted by μ. The commonly used confidence levels in interval estimation
are 90%, 95%, and 99%, as they indicate the level of certainty associated with the estimation
process.

Table 6.1: Probability Levels and Interval Estimates

Probability Confidence Coefficient z- value Form of the interval estimate


levels
0.1 90 1.64 x - 1.64σx < μ < x + 1.64σx
0.05 95 1.96 x - 1.96σx < μ < x + 1.96σx
0.01 99 2.58 x - 2.58σx < μ < x + 2.58σx

37
Where: x = Sample mean

μ = Population mean

σx = standard error of mean x.

Figure 6.1: Normal curve showing acceptance and rejection regions with a significance level (α)
of 0.05

The significance difference refers to the extent of difference between the sample mean (x) and
the hypothesized population mean (μHo) that results in rejecting the null hypothesis. This is because
the difference has a probability of 5% or less of occurring by chance. In a two-tailed test, if the
null hypothesis cannot be accepted, it is concluded that the hypothesized and true values are not
equal.

The critical or rejection region is the range of values of the test statistic that indicates a significant
difference and leads to the rejection of the null hypothesis. This suggests that the degree of
difference between the two means cannot be entirely explained by chance.

Steps in Hypothesis Testing

There are various steps involved in the hypothesis testing process. It starts by outlining the
hypothesis, determining if the data are continuous or discrete, and outlining the null and alternative
hypotheses. The study's design is the following step, which entails picking the best statistical test,
figuring out its significance level, and coming up with a strategy for carrying it out.
38
Following research design, the data is gathered by carrying out the study. After gathering the data,
the next step is to examine it and decide whether or not to reject the null hypothesis. The outcomes
are then summed up.

Type I and Type II Errors

There are four possible outcomes in a hypothesis-testing scenario. That is, two potential outcomes
for bad decisions and two for good ones. In the table below, these are listed.

When the null hypothesis is rejected even when it is true, this is known as a Type I mistake. In
other words, it is the likelihood of incorrectly rejecting a valid hypothesis.

Contrarily, a Type II error occurs when the null hypothesis is accepted despite being wrong. This
entails accepting a fallacious theory as fact.

Ho ACCEPTED H1 ACCEPTED
Ho True Right Decision Type I error
H1 True Type II error Right Decision

When the null hypothesis is rejected even when it is true, this is known as a Type I error. In other
words, it is the likelihood of incorrectly rejecting a valid hypothesis.

Contrarily, a Type II error occurs when the null hypothesis is accepted despite being wrong. This
entails accepting a fallacious theory as fact.

39
TESTING A HYPOTHESIS INVOLVING A MEAN

1. Determine the mean and standard deviations for the sample


2. The null hypothesis is as follows: Ho : x = μ
3. Approximate the population standard deviation (σx) using the sample standard deviation (s)
(σx =S/√n).
4. Determine the range for z. (μ ± z σ)
5. Verify whether the value of x is within the range, and
6. If it is, accept the Ho; otherwise, reject the Ho.

Examples:

Assuming the given sample values are: x = 454, n = 120, standard deviation (S) = 27, μ = 460, and
α = 0.05 or 95, we can proceed with the solution.

1. State the null hypothesis: The null hypothesis is Ho: x = μ, indicating no difference between
the sample mean and the population mean.

2. Calculate the standard deviation: σx = 27/√120 = 2.46

3. Determine the range: Using the formula, μ + 1.96 x 2.46 to μ - 1.96 x 2.46, we get: 460 +
1.96 x 2.46 to 460 - 1.96 x 2.46 The range is 465 to 455.

4. Evaluate the sample mean, x = 454: Since the sample mean does not fall within the range
of 465 to 455, the null hypothesis is rejected. This indicates a significant difference
between the sample mean x and the population mean μ.

Example 2: Given the mean milk yield of 10 cows as 176.10 kg with a standard deviation of 3.88,
we need to estimate the 95% confidence limit for the mean yield of milk.

1. State the parameters: μ = 176.10 kg, α = 0.05, n = 10, σ = 3.88

2. Use the t-distribution: Since the sample size is less than 30 (n = 10), we use the t-
distribution instead. The value for t9 at a 0.05 level is found from the t-distribution table
and is determined to be 2.262.

3. Calculate the standard error: σ / √n = 3.88 / √10 = 1.23

40
4. Determine the confidence limits: L1 = μ - tn-1 (σ / √n) = 176.10 – 2.262 (1.23) = 173.318
L2 = μ + tn-1 (σ / √n) = 176.10 + 2.262 (1.23) = 178.882

5. Interpret the results: At a 95% confidence level, the true population mean (μ) will lie
between the limits of 173.318 and 178.882. Since the mean yield of tomatoes is 176.10 kg
and falls within this interval, it is considered the true mean.

Testing A Hypothesis Involving Two Means


A representative group of 158 mixed-breed animals with an average age of 15 months were
examined in terms of their numerous physical traits. A 94-person control group that was pure but
had comparable management and age was also examined. The sample averages and standard
deviations of two traits shared by the two groups are provided in the table below.

Crossbreed purebred

Birth weight x1 = 7.04, S1 = 1.2 x2 = 7.19, S2 = 0.9

One year weight x1 = 23.3, S1 = 2.8 x2 = 21.9, S2 = 3.0

(a) Apply the information to verify the null hypothesis: Ho: For

i. birth weight and

ii. one-year weight, 1 = 2, at = 0.05

(b) What inferences can be made based on the findings in (a) i and ii?

Solution:

State the null hypothesis:

There is no difference between cross breed and pure breeds at birth weight and one year weight in
the two populations.

The Z formula for comparing two means is given by:

Where: x̅1 = first sample's sample mean

x̅2 = the second sample's sample mean

41
µ1= The initial population's mean

µ2= The second population's mean

σ12= is the first population's variance in size.

σ22= Population variation in the second population

n1= sample size for the first group

n2= Sample size for the second group

Therefore,

(a) i. Birth weight: Z = (7.04 – 7.19)/(1.22/158 + 0.92/94) = - 0.15/0.133 = -1.13


ii. One year weight: Z = (23.3 – 21.9)/( √2.82/158 + 3.02/94) = 3.65

Decision:
1. The null hypothesis is accepted since the Z-score (-1.13) for birth weight falls within the
non-critical region (which is between -1.96 and +1.96). This indicates that the disparity
between crossbreeds and pure breeds in terms of birth weight is not statistically significant.
2. The null hypothesis is rejected as the Z-score (3.67) for one year weight lies within the
critical region (outside the range of -1.96 to +1.96). Therefore, there is a statistically
significant difference between cross breeds and pure breeds regarding one-year weight.

(b) Based on the above decisions, it can be concluded that a notable characteristic of crossbred
heifers is a significant weight gain during the first year of life.

TWO-TAILED TEST – When σx is known (‘n’ more than 30)


Conclusion: The null hypothesis is rejected because the assumed value differs from the actual
value.

Assume, for example, that a population's mean is 500 and that σ = 50. Conduct a two-tailed test
with α=0.01 if a sample of 36 has a mean of 475.

42
Solution

These are the hypotheses: Ho: = 500 is the null hypothesis.

Alternative Hypothesis is μ ≠ 500

The z-value is 2.58 when α= 0.01

Accept the Ho if the Critical ratio (CR) falls between ± 2.58. OR Reject the Ho and accept the H1
if CR is greater than + 2.58 (CR > +2.58) or less than -2.58 (CR < -2.58).

Compute the CR

CR= (x – μHo)/ σ/√n

=( 475 – 500)/ (50/√36)

= -3.00

Conclusion: We reject the null hypothesis (Ho) and accept the alternative hypothesis (H1) because
the CR value (-3.00) is smaller than -2.58. It is important to note that the z-distribution is applicable
in determining rejection regions only when the sample size (n) is greater than 30. However, if the
sample size is 30 or less, the sampling distribution takes the form of t-distribution.

Two-Tailed Test – When σx is unknown (‘n’ less than 30)

Assuming that the following information is provided: n = 13, S = 50, x = 608, and 0.05. Conduct
a two-tailed test.

Solution:

Ho: μ= 612.

Ho: μ≠ 612

We choose the t-distribution because n is less than 30 (i.e., n = 13). our α = 0.05 since the n is less
than 30 (i.e., n = 13).

The degrees of freedom (DF) are first calculated as n – 1, or 13 – 1 = 12.

The t-value from the table at 2 DF and α = 0.05 is 2.179, which can be written as (t0.05/2 =.025) t.025
= 2.179
43
In accordance to the decision rule, accept the Ho if the Critical ratio (CR) is between ±2.179,. OR
If the CR is greater than +2.179 or less than -2.179, reject the Ho and accept the H1.

σx = S/√n 5/√13 = 1.39.

CR = (x - Ho)/(S/n) = (608 - 612)/1.39 = -2.88 as a result.

Conclusion: Given that the calculated critical region (CR) is smaller than -2.179, we can conclude
that the Null Hypothesis is rejected in favor of the Alternative Hypothesis.

Left – Tailed Test


supposing that the following null hypothesis is presented to us; Ho : μ = 100. Use a left-tailed test
using the parameters 0.05, 15, 36, and 88.
Solution:
Describe the premise: Ho: equals 100 and H1: equals 100
The z-value at α = 0.05 is -1.64.
Accept the Ho if CR is more than or equal to -1.64. If CR is less than -1.64, reject Ho and accept
H1.
CR = (88 – 100)/( 15/√36) = -4.8
We reject the Ho and accept the H1 since the CR value (-4.8) is less than -1.64.

Student’s T-Distribution (The T-Test)


A statistical probability distribution called the Student's t-distribution is frequently applied in
hypothesis testing. When the sample size is small, and the population standard deviation is
unknown, this modified variant of the normal distribution is utilized.

The t-test is a statistical test that compares two means and determines if they are significantly
different from one another using the t-distribution. A single population's mean or the difference
between the means of two populations are the two topics the t-test uses to investigate.

The one-sample t-test and the two-sample t-test are the two different forms of t-tests.

An individual population's mean is tested using the one-sample t-test. A one-sample t-test for
instance, might be used to evaluate whether the sample mean is significantly different from 6 feet
or to test whether the average height of all pupils in a school is 6 feet.

44
To evaluate assumptions regarding the variance between the means of two populations, the two-
sample t-test is employed. A two-sample t-test, for instance, would be used to evaluate whether
the difference in the sample means is statistically significant like if we were to test whether there
is a significant difference in the average height of male and female pupils in a school.

In both scenarios, the t-test is employed to calculate a t-value, which is then compared to a critical
value determined by the degrees of freedom and the desired significance level. If the t-value
exceeds the critical value, the null hypothesis is rejected, and the alternative hypothesis is accepted.

When the population standard deviation is unknown and the sample size is smaller than 30, the z-
test is not suitable for hypothesis testing. Instead, the t-test is utilized.

The t-test is a statistical test specifically designed for determining the mean of a population. It is
applied when the population follows a normal or approximately normal distribution, the population
standard deviation (σ) is unknown, and the sample size (n) is less than 30.
x–μ
Formula for t-test: tn-1 = 𝑠
√n

Where, x, μ, s, and n are Sample mean, Population mean, Sample standard deviation, and sample
size, respectively. The process of testing hypotheses using the t-test is similar to that of the z-test,
with the only difference being the utilization of the t-table instead of the z-table.

6.5 Activity

1. Given Ho: μ = 151, provide a list of the three probable alternative hypotheses.
2. From a population of housefly wing lengths, a sample of 35 has a mean of 44.8 and a
known population standard deviation of 3.90. Calculate the confidence limits at 95% and
99%.
3. What are the consequences or implications of making a Type II error in hypothesis testing?
4. Given a population mean of 500 and a known population standard deviation of 50, if a
sample of 36 has a mean of 475, perform a two-tailed test with a significance level (α) of
0.01.

45
UNIT 7: CONTINGENCY TABLES
7.1 Introduction

Contingency tables are commonly created to examine the connection between multiple variables
used for classification. Researchers often seek to determine whether these variables are
independent or if there is an association between them. The chi-square (χ2) test provides a means
to assess the hypothesis of independence, allowing researchers to determine if the variables are
indeed independent or not (referred to as the Independence test).

7.2 Learning outcomes

By the conclusion of this unit, you will have the capacity to:

1. Provide a definition of the chi-square distribution.

2. Identify the characteristics and procedures involved in computing chi-square.

3. Elucidate the significance of the goodness-of-fit test.

4. Outline the steps in conducting the goodness-of-fit test and employ them to make statistical
conclusions concerning the population.

7.3 Time Frame

• Self-study of 2 hours
• Lecture lesson of 2 hours

7.4 Content

Chi-Square Distribution

Chi-square is a statistical test commonly used to determine the association between two categorical
variables. It is a non-parametric test, meaning it does not make any assumptions about the
underlying distribution of the variables being tested. The test calculates the difference between the
observed and expected frequencies under the independence assumption. This difference is then
squared, summed across all categories, and divided by the expected frequencies to yield the chi-
square statistic. The resulting statistic can then be compared to a critical value from the chi-square
distribution to determine whether the association between the variables is statistically significant
or not.
46
The chi-square goodness-of-fit test relies on certain assumptions, including:

The data is collected from a randomly selected sample.

Each category in the data should have an expected frequency of 5 or greater.The chi-square value,
represented by χ2, is calculated using the formula:

Where:

❖ The chi-square test statistic is denoted as Χ2


❖ The summation operator (Σ) signifies taking the sum of a series of values.
❖ represents the observed frequency in the data.
❖ E represents the expected frequency in the data.
A significantly small Chi-Square test statistic indicates a strong agreement between the collected
data and the expected data.

❖ A significantly large Chi-Square test statistic indicates a poor match between the data. If
the chi-square value is large, it leads to the rejection of the null hypothesis.
Properties of Chi-Square

1. Chi-square deals with normally distributed observations to estimate variance.


2. As the sample size (n) increases, Chi-square tends to become normal. It has n - 1 degrees
of freedom.
3. Chi-square is non-symmetrical.
4. Chi-square can take any value from zero to infinity.
5. Chi-square is always positive and additive.
6. Chi-square can be one-way or two-way classification.

47
Chi-square testing

Chi-square testing involves comparing the differences between the sample frequencies and the
expected frequencies of occurrences or percentages. The following are steps in the general Chi-
square testing procedure:

1. Indicate both the null and alternative hypotheses.


2. Choose the significance level to be employed.
3. Obtain random samples and record the observed frequencies.
4. Compute the expected frequencies if the null hypothesis is true.
5. Compute the Chi-square value using the observed and expected frequencies.
6. Compare the calculated Chi-square value with the Chi-square table value at the
specified level of significance.

Example

During a study evaluating the efficacy of three distinct rat traps, the researchers recorded the
number of rats caught in each type of trap throughout the duration of the experiment as shown:

Type 1 Type 2 Type 3 Total


20 54 30 104

The frequencies observed in this case are 20, 54, and 30. We are concerned with establishing
whether the higher number of rats caught in trap type 3 represents a significant difference or if the
difference could be attributed to chance variation or sampling error. Additionally, we want to
assess whether the distribution of frequencies among the types of traps is even (homogenous).

Based on these considerations, we can formulate the following hypotheses:

Ho: The recorded frequencies are similar, and any deviation can be explained by chance variation
or sampling error.

H1: The observed frequencies deviate from an expected homogenous distribution in a manner that
cannot be accounted for by sampling error alone.

48
If Ho is true, we can determine the expected frequencies. In this case, if the frequencies followed
a homogenous distribution, we would expect the 104 rats to be equally distributed among the three
trap designs, resulting in an expected frequency of 34.67 rats for each design.

To calculate the Chi-square (χ2) statistic, we can summarize the frequencies as follows:

Type Observed Expected Obs. – Exp. (Obs. – Exp.)2/Exp.


frequency frequency
1 20 34.67 -14.67 6.21
2 54 34.67 19.33 10.78
3 30 34.67 -4.67 0.63
χ2=17.62

The calculated χ2 = 17.62 can then be likened to the critical or table χ2 value at 0.05 or 0.01 levels
of significance.

The degrees of freedom for this test can be calculated as n - 1, where n is the number of categories.
We have 3 categories in our case, so the degrees of freedom are 3 - 1 = 2. Referring to the Chi-
square table for 2 degrees of freedom, the critical value is 5.99 at a significance level of 0.05 (5%)
and 9.21 at a significance level of 0.01 (1%).

Our estimated Chi-square value of 17.62 is more than the critical value at both the 0.05 and 0.01
significance levels when we juxtapose these critical values against the computed Chi-square value.
Consequently, we may say that the discrepancy between the observed and predicted frequencies is
significant in statistical terms. With a Chi-square value of 17.62 and a significance level of p=0.05,
it simply implies that the experiment showed that trap type 2 was more effective at capturing rats
than the remaining two trap types examined.

49
Example

The progeny resulting from a specific cross exhibited colors in the ratio of 9:3:4, with the colors
being Red, Black, and White, respectively. Assuming that the experiment produced 75, 33, and 39
offspring in those respective categories, can we determine if the theory is supported?

Red Black White Total


75 33 39 147
9 3 4 16

The expected frequencies are calculated as follows:

Red: 9/16 x 147 = 83

Black: 3/16 x 147 = 28

White: 4/16 x 147 = 37

(75 – 83)2 (33 – 28)2 (39 – 37)2


𝜒2 = + +
83 28 38

χ2 = 3.36

The D.F n – 1

3–1

=2

The obtained Chi-square value of 3.36 is less than the critical Chi-square value of 5.99 from the
table at a significance level of α = 0.05 and degrees of freedom (d.f.) equal to 2. Hence, we can
conclude that the observed number of offspring in the three colors is consistent with the provided
ratios. In other words, we fail to reject the null hypothesis.

50
7.5 Activity

1. When can you apply Chi-Square?


2. Is a Chi-square left tailed test or two test?
3. Calculate the chi-square statistic, given a random sample of 600 Zampens has yielded the
following:
212 of the Zampens are blue.
147 of the Zampens are orange.
103 of the Zampens are green.
50 of the Zampens are red.
46 of the Zampens are yellow.
42 of the Zampens are brown.
Calculate the chi-square statistic.

51
UNIT 8: CORRELATION, REGRESSION AND COVARIANCE
8.1 INTRODUCTION
Correlation and regression are additional branches of inferential statistics that aim to ascertain the
presence of a connection between two or more numerical or quantitative variables. They involve
examining whether two characteristics are linked by studying them simultaneously across all
individuals in a population. For example, a researcher might want to investigate the correlation
between height and yield in maize plants or may seek to determine if there is a relationship between
body weight and shank length in chickens. Hence, correlation and regression analyses are
employed to quantify the association between two variables in a data set involving two
characteristics.
8.2 Learning outcomes:
At the end of this Unit, you should be able to:
1. Explain the terms regression, bivariate distribution and correlation.
2. Describe the different kinds of correlations and regressions.
3. Describe any potential connections between the variables.
4. For a bivariate data set, create a scatter diagram.
5. Calculate correlation and regression
8.3 Time Frame
• Self-study of 2 hours
• Lecture lesson of 2 hours
8.4 Content
Correlation

Correlation is a statistical method used to assess the strength and intensity of a linear association
between two continuous variables in a bivariate normal distribution drawn from the same
population.
When analyzing a sample of data, the correlation coefficient is computed to quantify the magnitude
and direction of the linear relationship between the variables, denoted by the symbol "r". On the
other hand, the population correlation coefficient, represented by the Greek letter ρ (rho), describes
the correlation in the entire population.

52
Formula
Σ((Xᵢ − X̄ )(Yᵢ − Ȳ))
√(Σ(Xᵢ − X̄ )²)(Σ(Yᵢ − Ȳ)²)

Or
Σxy
√(Σx 2 )(Σ𝑦 2 )
Where:

Xᵢ and Yᵢ are individual data points of the two variables.


X̄ and Ȳ are the means of the X and Y variables, respectively.
Σ represents the sum over all data points.
This formula calculates the covariance of X and Y by multiplying the difference between each
data point and its respective mean for both variables. It is then divided by the product of the
standard deviations of X and Y, which is calculated by summing the squared differences between
each data point and its respective mean for each variable and taking the square root.

The resulting correlation coefficient, "r," will be between -1 and 1. A value of -1 indicates a perfect
negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
Regression
Regression is a statistical technique used to characterize the relationships between variables,
whether they are positive or negative, linear or nonlinear. It provides insights into how two
variables exhibit a significant correlation and are associated with each other through a regression
line.
A scatter diagram's data points are best fit by a regression line, commonly referred to as the least-
squares line. It always traverses the X, Y point. Based on a known value for the other variable, we
are able to estimate or predict the value of the first variable using the regression line. The following
equation represents the general formula for a fitted regression line.

Y = a + bX

53
Where:

Y represents the dependent variable, which is the variable being predicted.


X1, X2, X3, ..., represent the independent variables used to predict Y.
a represents the y-intercept, which is the value of Y when all independent variables are set to 0.
b1, b2, b3, ..., represent the slopes, indicating the change in Y for a unit change in each respective
independent variable, while holding other variables constant.
In multiple linear regression, where there are more than one independent variable, the formula is
extended as:
Y = a + b₁X₁ + b₂X₂ + ... + bₙXₙ

Where:
Y represents the dependent variable
X₁, X₂, ..., Xₙ represent the independent variables
a represents the y-intercept
b₁, b₂, ..., bₙ represent the coefficients/slopes associated with each independent variable

The regression coefficient is estimated as:

Σ((X − X̄ )(Y − Ȳ))


𝑏=
Σ((X − X̄ )²)
Where:

b represents the regression coefficient (slope)


X represents the independent variable
Y represents the dependent variable
X̄ represents the mean of the independent variable
Ȳ represents the mean of the dependent variable
Σ denotes the sum of the calculations over all data points

54
In multiple linear regression, the regression coefficient (b) for each independent variable is
calculated using similar principles, but the formula incorporates all the independent variables and
their respective means.

OR
Σxy
𝐴=
Σx²

The variable (Y) in a regression analysis that is not easily manipulated or altered is known as the
dependent variable. The independent variable (X), on the other hand, is a variable that may be
changed or controlled. It is important to remember that deciding what variable is dependent or
independent is often not simple and is occasionally a judgment call.
For instance, consider a researcher examining the relationship between the amount of fertilizer
applied and crop yield. In this scenario, crop yield would be considered the dependent variable
since it is the outcome of interest that is being investigated. Conversely, the amount of fertilizer
applied would be considered the independent variable since it can be controlled or adjusted by the
researcher.

𝑛(Σxy)−(Σx)(Σy)
Regression b = 𝑛(Σx2 )(Σy2 )

Or
Σxy
b=Σx2

Scatter diagram/plot
A scatter diagram visualizes the ordered pairs (x, y) consisting of the independent variable X and
the dependent variable Y. It provides a graphical depiction of the relationship between x and y,
allowing us to observe any patterns or trends among the points. The presence of a clear pattern
indicates a closer relationship between the two variables.

55
For instance, let's create a scatter diagram to represent the given data.
Subject Age Pressure
x y
1 32 96
2 43 128
3 48 120
4 56 135
5 61 143
6 67 144
7 70 154

Types of correlation and regression


Correlation can be classified as simple or multiple. The focus of simple relationships is on the
relationship between two variables. For example, a researcher might be interested in investigating
the relationship between height and weight in a rat population.

56
On the contrary, multiple relationships entail investigating more than two variables. A biologist,
for example, could want to look into how varying feed protein levels, amount of feed supplied per
day, and hours of lighting each day affect snail growth.
There are two types of regression analysis: linear regression and curvilinear regression. Linear
regression encompasses both basic linear regression (two variables) and multiple linear regression
(more than two variables). Curvilinear regression encompasses various forms, such as quadratic,
and exponential, logarithmic.
In simple relationships, there can be positive or negative correlations. A positive correlation
indicates that both variables increase or decrease together. In contrast, a negative correlation
implies that as one variable increases, the other decreases, and vice versa.
Simple correlation and simple linear regression can be further described as:
1. Positive correlation: Where a rise in a particular variable is associated with a rise in another
variable, by varied degrees.
2. Negative correlation: If a rise in a single variable is results in a drop in another variable, to
varied degrees.
3. Perfect correlation: Whenever a change in one of the variables is precisely mirrored by a
shift in another. If the two variables rise at the same time, there exists a perfect positive
correlation; when one falls as another rises, there is a perfect negative correlation.
4. High correlation: Whenever a change in a single variable closely matches a change in the
other.
5. Low correlation: Whenever a change in a single variable is only marginally mirrored by a
change in the other.
6. Zero correlation: Whenever no relationship among changes in a particular variable and
changes in the another, indicating that the variables are unrelated.

57
SPURIOUS CORRELATION
When interpreting the correlation coefficient, denoted as 'r', it is crucial to recognize that a high
correlation does not necessarily imply a direct causal connection between the variables. In such
cases, the correlation is referred to as spurious or nonsense correlation. This phenomenon can
occur through two mechanisms:

(a) Indirect connection: The variables may not have a direct relationship with each other but may
be influenced by a common underlying factor or a third variable, leading to a false appearance of
correlation.

(b) Coincidences: Random chance can sometimes create a series of coincidental associations
between variables, giving the impression of a correlation when none truly exists.

58
Covariance
When assessing the association between two variables, we can refer to the resulting evaluation as
the covariance ('Cov') of the variables. Analyzing covariance helps to account for variability.
Covariance is determined by calculating the average of the products of the deviations of each
paired variable from the overall mean of the respective variable.
𝑦(𝑥𝑖−𝑥)(𝑦𝑖−𝑦)
Cov xy= n

WORKED EXAMPLE.
Given the provided values of length (X) and yield (Y) for five banana plants, estimate and interpret
the linear association between these variables using the deviation method.
Length (X) 3 5 7 6 3
Yield (Y) 11 13 16 14 6

Find the means for X and Y.


Mean of X = 3 + 5 + 7 + 6 + 3 / 5 = 4.8
Mean of Y = 10 + 12 + 15 + 13 + 5 / 5 = 12
Deviation from Squares of Products
mean deviates of
deviates
Length Yield
(X) (Y) X Y X Y XY
3 11 -1.8 -1 3.24 1 1.8
5 13 0.2 1 0.04 1 0.2
7 16 2.2 4 4.84 16 8.8
6 14 1.2 2 1.44 4 2.4
3 6 -1.8 -6 3.24 36 10.8
Total 12.8 58 24

59
Then we can compute the correlation (r) value.

𝟐𝟒
= =0.997
√(𝟏𝟎)(𝟓𝟖)

The r value of 0.997 is highly. That entails that the yield increases with the increase length of the
banana plant.
The degrees of freedom (d.f) are computed as:
d.f. = n – 2
The d.f. = 5 – 2 =3
Based on this comparison, we can conclude that the linear association between the length and yield
of the Banana plants is statistically significant. In other words, there is a significant positive
relationship between the length of the plants and their yield, which holds true for approximately
95% of the population.
This implies that, in general, as the length of the Banana plants increases, the yield also tends to
increase, and this relationship is deemed statistically significant based on the analysis performed.
8.5 Activity
1. What does a correlation coefficient of -0.75 indicate about the relationship between two
variables? If the correlation coefficient is 0.20, can we conclude that there is a strong
relationship between the variables? Why or why not?
2. How does a negative correlation differ from a positive correlation in terms of the direction
of the relationship?
3. What is the purpose of regression analysis in statistics?
4. Explain the concept of the coefficient of determination (R-squared) in regression. What
does it tell us about the goodness of fit of the regression model?
5. If the regression equation is y = 2x + 3, what is the predicted value of y when x = 5?
6. Define covariance and explain its significance in the context of two variables.
7. What does a positive covariance between two variables indicate about their relationship?
8. Calculate the covariance between two variables with the following data points: X = [1, 2,
3, 4, 5] and Y = [3, 5, 7, 9, 11].

60
UNIT 9: SIMPLE EXPERIMENTAL DESIGN AND ANALYSIS OF VARIANCE (ANOVA)

9.1 Introduction
Over time, a vast array of experimental patterns has been developed, catering to various situations.
There are two primary aspects to consider when discussing experimental designs. Firstly, it
involves determining the appropriate structure for the experiment, taking into account both
knowledge of the experimental subject and the specific research questions being posed. Secondly,
it entails analyzing the collected data, typically summarized using an analysis of variance
(ANOVA) table. Designing and analysis of an experiment significantly impact each other, as each
design has a preferred method of data analysis that aligns with it. In essence, for every design,
there exists a single suited approach to evaluating the data.

9.1 Learning outcomes


I. Understand what between-group and within-group variability consist of and represent.
II. Understand the role of between-group and within-group variability in testing differences
between group means.
III. Understand what ‘ANOVA’ stands for, and why.
IV. Understand why, in testing the difference between means, the inferential statistic is called
the F-ratio.
V. Understand the characteristics of the theoretical distribution of F-ratios.
9.3 Time Frame
• Self-study of 2 hours
• Lecture lesson of 2 hours

9.4 Content
Experimental Design
Experimental design focuses on the methodologies used to construct and analyze comparative
experiments, particularly those that involve comparing the effects of different factors or treatments.
Understanding the advantages and disadvantages of various experimental designs is crucial for
effective experiment planning. This understanding helps in assessing the "experimental error,"
which serves as a basis for testing the differences between treatments. It is important to consider

61
how experimental design, treatment variations, and the number of replications impact the "residual
degrees of freedom" and determine whether one-tailed or two-tailed statistical tables should be
used.
The purpose of experimental design is to ensure that experiments yield the maximum level of
precision relative to the effort invested. In any design, it is essential to clearly define the roles of
the treatments and have a thorough understanding of the experiment's objectives.
An experiment adheres to a methodical process for gathering data under regulated conditions,
allowing generalizations about the population under study.

An experimental unit refers to a single replicate of a single treatment, which can be a plant, animal,
or a group of plants or animals. It is also referred to as a plot.

Essential Principles Of Experimentation

1. Randomization: This refers to the random assignment of treatments to the available


experimental subjects.
2. Replication / Reproducibility: It is important for an experiment to be designed in a way
that allows for replication or reproduction. Replication helps reduce errors and improves
accuracy. (Note: Errors are variations among the subjects that cannot be attributed to the
variation caused by treatments, blocks, or other controlled factors in the experiment).
Increasing the number of replications typically leads to the cancellation of errors that affect
individual treatments.
3. Homogeneity/Sensitivity: A homogeneous experimental setup is one that exhibits
uniformity in its materials and therefore does not require control of local variations.
Sensitivity refers to the ability to estimate the effects of treatments accurately, allowing for
valid conclusions to be drawn.

TYPES OF EXPERIMENTAL DESIGN


The Completely Randomized Design (CRD) is the most basic form of design where treatments
are randomly assigned to a group of plots. It is suitable when the experimental material is the same
but the experimental units or treatments are different. In CRD, replication of the experiment is
done multiple times. The main purpose of CRD is to ensure that the assignment of treatments is

62
completely random, thus avoiding bias and reducing inherent differences among the experimental
units or treatments. As an example, consider a CRD having 6 treatments and replicated thrice, the
allocation would be randomized as follows.
A D F
B C E
C B A
D A C
E F B
F E D

The one-way analysis of variance (ANOVA) is used In the analysis of Completely Randomized
Design (CRD) results. Examples of such straightforward experiments include examining the
impact of various values inclusion levels of feed ingredients in diets (treatments) on the body
weight gain of broilers (experimental material) of the same sex, age etc. Another instance would
be assessing the effects of different levels fertilizers (treatments) on the yield of cotton
(experimental material).
Advantages of CRD:
1. Simplicity: CRD is straightforward and easy to implement, making it suitable for beginners
or situations where simplicity is preferred.
2. Randomization: The random allocation of treatments helps minimize bias and ensures that
any differences observed are likely due to the treatments rather than other factors.
3. Efficiency: CRD allows for efficient use of resources as it requires fewer experimental
units compared to other designs.
Disadvantages of CRD:
1. The design itself provides less information compared to other more advanced layouts.

63
Completely Randomized Block Design (CRBD) is an experimental design that aims to control
the variability caused by certain nuisance variables or factors. In a CRBD, the experimental units
are divided into blocks based on some characteristic that is expected to affect the response variable.
Within each block, the treatments are randomly assigned to the experimental units.
The primary objective of using an RBD is to reduce the impact of the nuisance variables on the
treatment comparisons. By blocking, the experimental units within each block become more
homogeneous, allowing for a more precise evaluation of treatment effects. This design helps
increase the sensitivity of detecting treatment differences by minimizing the variability caused by
the nuisance variables.
RBD is mostly suitable when there are identified sources of variation that might affect the response
variable and when it is impractical or not feasible to completely eliminate these sources. It is
commonly applied in agricultural research, where field variability or differences in soil fertility
can significantly influence the experimental outcomes.
Overall, CRBD improves the precision and validity of experimental results by accounting for the
effect of nuisance variables through blocking and randomization.

Block Plots Fertility Trend

I A B C D

II D A B C

III C D A B

IV B C D A

In Completely Randomized Block Design (RBD), there are two distinct factors contributing to the
variation, namely Treatment and Blocks. Consequently, a two-way analysis of variance (ANOVA)
is employed to analyze the data collected from CRBD experiments. Illustrative examples of such
experiments encompass investigations into the response of quails from different age groups to
varying levels of dietary protein, as well as the response of five different maize varieties to three
levels of Indole-Acetic-Acid (IAA).

64
Advantages and Disadvantages of RBD
1. By selecting blocks of plots with similar characteristics, the Randomized Block Design
(RBD) mitigates the impact of heterogeneous material, resulting in a reduction of residual
variance.
2. The number of blocks or treatments is not limited, but it is essential to have an equal
number of plots per treatment within each block.
3. In the event of accidental yield losses, the analysis can still be conducted without
significant complications, although certain adjustments may be necessary.

Procedure for a randomized block experiment:


Phase 1:
➢ Determine the total number of rows and columns in the experiment and verify the grand
total by adding the row and column totals.
➢ Calculate the correction factor for the experiment by dividing the grand total squared by
the number of plots.
➢ Calculate the total sum of squares by subtracting the correction factor from the sum of
squared values.
Phase 2:
➢ Construct the analysis of variance (ANOVA) table with the headings: Source of variation,
degrees of freedom (d.f.), sum of squares, mean square, F-value, and P-value.
➢ Include the sources of variation as Treatments, Blocks, Residual, and Total in the ANOVA
table.
➢ Allocate degrees of freedom as n - 1 for the number of Treatments and Replicates, and the
product of these two degrees of freedom for the Residual. Verify that the three degrees of
freedom add up to n - 1 for the Total.
➢ Calculate the sum of squares of deviations for Treatments by squaring and adding the
Treatment totals.
➢ Calculate the sum of squares of deviations for Replicates by squaring and adding the
Replicate totals.
➢ The "Residual" sum of squares of deviations represents the remaining portion of the "Total"
sum of squares of deviations.

65
End Phase:
➢ Calculate the mean square for Treatments, Replicates, and Residual by dividing each sum
of squares of deviations by its respective degrees of freedom.

Latin Square Experimental Design


The Latin square experimental design is an experimental design used in research to control
variation caused by two sources of variability: rows and columns. It is particularly useful when
there are two sources of potential variation, and the objective is to minimize the confounding
effects between them.

In a Latin square design, the number of treatments, rows, and columns are all equal. Each treatment
is assigned exactly once to each row and each column, ensuring that no treatment is repeated in
the same row or column. This arrangement guarantees that each treatment appears in a unique
combination with every other treatment within the rows and columns.

The Latin square design helps reduce the impact of confounding factors by incorporating blocking
or partial replication within the experimental design. It allows for more precise estimation of
treatment effects by accounting for both row and column effects.

This design is commonly used in agricultural research and other fields where there is a need to
control multiple sources of variation. It helps minimize bias and provides a structured approach to
experimental investigations.

C A B
D B C
A C D
B D A

Simple Factorial Experiment


A simple factorial experiment is a type of experimental design that investigates the effects of
two or more factors (independent variables) on a response variable. It involves systematically

66
varying the levels of each factor to examine their individual and combined effects on the response
variable.
In a simple factorial experiment, the factors are typically manipulated at two or more levels,
creating combinations of factor levels called treatment conditions. The experiment is designed in
such a way that each treatment condition is tested and replicated multiple times to ensure statistical
reliability.
The primary advantage of a simple factorial experiment is that it allows researchers to examine
both the main effects of each factor (the individual impact of each factor on the response variable)
and the interaction effects (how the combination of factors influences the response variable). By
analyzing these effects, researchers can gain insights into the relationships and dependencies
among the factors and their impact on the response variable.
Simple factorial experiments are widely used in various fields, including psychology, biology,
engineering, and social sciences, to understand the effects of multiple factors and their interactions
on the outcome of interest. The design provides a structured and efficient approach to studying
complex relationships between variables.

Analysis Of Variance (ANOVA)


ANOVA, which stands for Analysis of Variance, is a statistical method employed to assess and
compare the average values of two or more groups or treatments in an experiment. It assesses the
amount of variation in the data and determines whether the observed differences among the groups
are statistically significant or simply due to chance.
ANOVA partitions the total variability in the data into different components: the variability
between groups (explained variance) and the variability within groups (unexplained variance). By
comparing these two sources of variance, ANOVA allows researchers to determine if the observed
differences among the groups are larger than what would be expected by random variation alone.
ANOVA produces an F-statistic and p-value that are used to make inferential conclusions. If the
p-value is below a predetermined significance level, typically 0.05, it indicates that the differences
among the groups are statistically significant, and the null hypothesis of equal means can be
rejected.
ANOVA is widely used in various fields, such as experimental research, social sciences, biology,
and business, to analyze data from experiments with multiple groups or treatments. It provides a

67
powerful tool for determining the presence of significant differences among groups and allows
researchers to make informed conclusions about the effects of independent variables on the
dependent variable of interest.

Assumptions in ANOVA
In ANOVA, the following assumptions must be made or else its appropriateness becomes
questionable.

1. The samples are selected in a random manner, and each sample is unrelated to the other
samples.
2. The populations being studied exhibit a distribution that closely approximates the normal
curve.
3. The populations, from which the sample values are derived, share the same population
variance (σ2). That is (σ12 = σ22 = σ32 = ………….σk2), the variances of all the populations
are equal.

Hypotheses in ANOVA
1. In Anova, the null hypothesis (Ho) stipulates that the independent samples are taken from
populations with similar means: Ho : μ1 = μ2 = μ3 = …………….μk.
Where k populations total under investigation.
2. In Anova, the alternative hypothesis (H1) is that population means are not equal
H1 : μ1 ≠ μ2 ≠ μ3 ≠ ……...μk.
The conclusions
In the ANOVA test, conclusions regarding the null hypothesis (Ho) are determined based on the
computed variance ratio, also known as the F-ratio.
If the computed F-ratio value is less than the table value, the null hypothesis (Ho) is accepted,
indicating that the results are not statistically significant.
If the computed F-ratio value is greater than the table value, the null hypothesis (Ho) is rejected,
and the alternative hypothesis (H1) is accepted, suggesting that the results are statistically
significant.

68
EXAMPLES:
The table below shows the number of seeds for five varieties of garden egg to three level of
Fertilizers.
IAA\varieties A B C D E
I 3 5 10 7 8
II 2 4 7 4 5
III 4 5 8 6 7

1. Null hypothesis statement: there is no significant variation between the number of seeds in
five different garden egg varieties and the levels of IAA.
2. Total calculation from the given table: Please provide the table or data to be analyzed in
order to calculate the totals accurately.

IAA\varieties A B C D E
I 3 5 10 7 8 33
II 2 4 7 4 5 22
III 4 5 8 6 7 30
Total 9 14 25 17 20 85

Then calculate the Correction Factor (CF) as:


CF = GT2/N = 852/15 = 481.7
Calculate the Mean Squares (SS)
BLOCKss = (332 + 222 + 302/5) – CF = 12.9
VARIETIESss = (92 + 142 +………..+ 202)/3 – CF = 48.6
TOTALss = (32 + 52 +………...+72) – CF = 65.3
ERRORss = TOTALss – (BLOCK + VARIETIESss) = 3.8
Then calculate:
BLOCKMS = BLOCKss/ BLOCKDF = 12.9/2 = 6.45
VARIETIESMS = VARIETIESss/VARIETIESDF = 48.6/4 = 12.15
Block F-value = BlockMS /ErrorMS = 6.45/0.475 = 13.58
Varieties F-value = VarietiesMS /ErrorMS = 12.15/0.475 = 25.58

69
The F-values obtained from the calculations are compared to the F-distribution table, considering
their respective degrees of freedom.
SOURCE DF SS MS F
Block 2 12.9 6.45 13.58**
Varieties 4 48.6 12.15 25.58**
Error 8 3.8 0.475
Total 14 65.3
The presence of ** indicates that the values are extremely significant.

Conclusion:
As the F-values demonstrate high significance, we can reject the null hypothesis, indicating that
the three levels of IAA have a significant impact on the seed number of the five varieties of garden
egg.

Example: Please fill in the missing values in the Anova table provided and then proceed to draw
your conclusions based on the analysis.
Source Sum of squares (SS) Degrees of freedom (DF) Mean squares (MS) F-ratio
Varieties 123.44 * * *
Residual * 15 *
Total 210.21 18

RESIDUALss = 210.21 – 123.44 = 86.8


VARIETIESDF = 18 – 15 = 3
VARIETIESMS = 123.44/3 = 41.15
RESIDUALMS = 86.8/15 = 5.79
F- ratio = 41.15/5.79 = 7.11

70
Sum of squares
Source (SS) Degrees of freedom (DF) Mean squares (MS) F-ratio
Varieties 123.44 3 41.15 7.11
Residual * 15 5.79
Total 210.21 18

Conclusion:
The calculated variance ratio of 7.12 exceeds the critical values from the table at both the 5% level
(3.29) and the 1% level (5.42). This indicates a significant difference among the varieties, leading
us to reject the null hypothesis that the varieties are identical.

9.5 Activity
1. What is your understanding of 'Experimental design'?
2. Which type of experimental design would be most appropriate for the Anova analysis:
(a) One-way classification or (b) Two-way classification?
3. Outline and briefly discuss the fundamental principles involved in experimental design.
4. What is the significance of employing homogeneous materials in an experiment?
5. From the Anova table below

Source of variation d.f. Sum of Mean Square F


Squares
Block 2 16 8 11.94
Treatment 5 145.8 29.17 43.54
Residual/Error 10 6.67 0.67
Total 17 168.5

a. What is the type of design is used above?


b. State the number of treatments being studied?
c. State the number of observations in the above experiment?
d. What conclusions do you Draw based on the F-values.

71
REFERENCES
Colton, T. ( 1974). Statistics in Medicine, 1st ed. ,Little, Brown and Company(inc), Boston, USA.

2. Bland, M. (2000). An Introduction to Medical Statistics, 3 rd ed.University Press, Oxford.

3. Altman, D.G. (1991). Practical Statistics for Medical Research, Chapman and Hall, London.

4. Armitage, P. and Berry, G. (1987). Statistical Methods in Medical Research, 2nd ed. Blackwell,
Oxford.

5. Michael, J. (1999). Medical Statistics: A commonsense approach, 3rd ed . Campbell and David
Machin.

6. Fletcher, M. (1992). Principles and Practice of Epidemiology, Addis Ababa.

7. Lwanga, S.K. and Cho-Yook T. (1986). Teaching Health Statistics , WHO, Geneva

8. Gupta C.B. (1981). An Introduction To Statistics Methods, 9th Ed. Vikss Publishing House Pvt
Ltd, India.

9. Abramson J.H. (1979). Survey Methods In Community Medicine, 2nd Ed. Churchill
Livingstone, London and New York.

10. Swinscow T.D.V (1986). Statistics At Square One. Latimer Trend and Company Ltd, Plymouth,
Great Britain.

11. Shoukri M.M And Edge V.L (1996). Statistical Methods for Shelath Sciences. CRC Press,
London and New York.

12. Kirkwood B.R. (1988). Essentials of Medical Statistics. Blackwell Science Ltd. Australia

13. Spieglman. An Introduction to Demography.

14. Davies A.M And Mansourian (1992). Research Strategies For Health. Publicshed On Behalf
of The World Health Organization. Hongrefe and Huber Publishers, Lewiston, NY

72

You might also like