1 Introduction
1 Introduction
Biostat
10/20/24 1
Introduction
10/20/24 Biostat 2
What is statistics?
• The scientific study of numerical data
based on variation in nature.
• A set of procedures and rules for
reducing large masses of data into
manageable proportions allowing us
to draw conclusions from those data.
3
Statistics…
• Statistics the science of collecting,
summarizing, presenting,
interpreting data, and of using them
to test hypotheses.
4
• Biostatistics is statistics applied
to biological and health
problems
10/20/24 Biostat 5
Uses of Biostatistics
• Assessment of health status
• Resource allocation
• Vaccination uptake
• Magnitudes of a disease/condition
• Assessing risk factors
• Making diagnosis and choosing an appropriate
treatment
10/20/24 Biostat 6
Types of Statistics
1. Descriptive statistics:
• Ways of organizing and summarizing data
• Methods for identifying the important features
of a set of data and extracting useful
information
10/20/24 Biostat 7
Types of Statistics
2. Inferential statistics:
• Methods used for drawing conclusions
about a population based on the
information contained in a sample of
observations drawn from that population
10/20/24 Biostat 8
Types of statistics
• Descriptive Statistics
– Collection,
– organization,
– summarization, and
– presentation of data.
• Inferential Statistics
– Generalizing from samples to populations using
probabilities.
– Performing hypothesis testing,
– Determining relationships between variables,
– Making predictions. 9
Why study statistics in health?
10
Roles of statistics
• In clinical medicine
– Making clinical diagnosis
– Determining Rx and prognosis
– Handling variations (defining normal values and
normal ranges)
• In public health
– Community diagnosis
• In Research
– Designing and undertaking clinical & public health
research 11
Limitations of statistics
1. Statistics doesn’t deal with single
(individual) value.
– It deals only with aggregate values
2. Statistics can’t deal with qualitative
characteristics
– Deals with data which can be quantified
3. Statistical conclusions are not universally
true
– Context specific
4. Statistical interpretations require high degree
of skill & understanding of the subject.
12
Data
• Data are numbers which can be
measurements or can be obtained by counting
• The raw material for statistics
• Can be obtained from:
– Routinely kept records
– Surveys
– Counting
– Experiments
– Reports
10/20/24 Biostat 13
Types of Data
1. Primary data: collected from the items or
individual respondents directly by the
researcher for the purpose of certain study.
10/20/24 Biostat 14
Population and Sample
Target population:
• A collection of items that have
something in common for which we
wish to draw conclusions at a
particular time
10/20/24 Biostat 15
Population and Sample
Study (Sampled) Population:
• The subset of the target population that
has at least some chance of being
sampled
• The specific population from which data
are collected
10/20/24 Biostat 16
Population and Sample
Sample:
. A subset of a study population, about
which information is actually
obtained.
. The individuals who are actually
measured and comprise the actual
data.
10/20/24 Biostat 17
Sample
Study Population
Target Population
10/20/24 Biostat 18
Generalizability
• is a two-stage procedure: we need
to able to generalize from the
sample to the study population and
then from the study population to
the target population
10/20/24 Biostat 19
Draw conclusions
Collect information from a about a rather
comparatively SMALL sample
LARGE population
10/20/24 Biostat 20
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.
10/20/24 Biostat 21
Descriptive Statistics
Descriptive Statistics
• Techniques used to organize and
summarize a set of data.
pressure
Methods of Data Organization and
Presentation
Frequency Distributions
• Ordered array: A simple arrangement of
individual observations in order of magnitude.
• The actual summarization and organization of
data starts from frequency distribution.
• Frequency distribution: A table which
involves a listing of all values of the studied
variable and how many times each value is
observed.
• Tables make easier to see how the data are
distributed
a) Qualitative variable: Count the number of
cases in each category.
- Example1: The ICU type of 25 patients
entering intensive care unit at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other
Frequency Relative Frequency
ICU (How often) (Proportionately
Type often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
Example 2:
A study was conducted to assess the
characteristics of a group of 234 smokers by
collecting data on gender and other variables.
Gender, 1 = male, 2 = female
Sturge’s rule:
K 1 3.322(logn)
L S
W
K
where
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
Example:
• Leisure time (hours) per week for 40 college
students:
23 24 18 14 20 36 24 26 23 21 16 15 19 20 22
14 13 10 19 27 29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
K = 1 + 3.322 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
Time Relative Cumulative
(Hours) Frequency Frequency Relative
Frequency
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
• Cumulative frequencies: When frequencies of
two or more classes are added.
• Histogram
• Box plot
Continuous
• Scatter plot data
• Line graph
• Others
1. Bar charts (or graphs)
• Categories are listed on the horizontal axis (X-
axis)
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate)
• The height of each bar is proportional to the
frequency or relative frequency of
observations in that category
Bar chart for the type of ICU for 25 patients
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together
• The different bars should be separated by
equal distances
• All the bars should rest on the same line
called the base
Example: Construct a bar chart for the
following data.
Distribution of patients in hospital by source of referral
Source of referral No. of patients Relative freq.
Other hospital 97 5.1
General practitioner 769 40.3
Out-patient department 623 32.7
Casualty 256 13.4
Other 161 8.5
Total 1 906 100.0
Distribution of patients in hopital X by source of referal, 1999
769
800
700 623
600
No. of patients
500
400
300 256
200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of referal
2. Sub-divided bar chart
• If there are different quantities forming the
sub-divisions of the totals, simple bars may
be sub-divided in the ratio of the various
sub-divisions to exhibit the relationship of
the parts to the whole.
• The order in which the components are
shown in a “bar” is followed in all bars used
in the diagram.
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003
100 Mixed
P. vivax
80 P. falciparum
60
Percent
40
20
0
August October December
2003
3. Multiple bar graph
• Bar charts can be used to represent the
relationships among more than two
variables.
• The following figure shows the relationship
between children’s reports of breathlessness
and cigarette smoking by themselves and
their parents.
Prevalence of self reported breathlessness among school
childeren, 1998
35
Breathlessness, per cent
30
25
20
15
10
5
0
Neither One Both
Parents smooking
We can see from the graph quickly that the prevalence of the
symptoms increases both with the child’s smoking and with
that of their parents.
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.
CHA
Type of source
HC
Reading
Training femal
male
e
Campaign
Anti FGMC
CAT
0 10 20 30 40 50
Percent
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
5. Stem and Leaf Plot
• A quick way to organize data to give visual
impression similar to a histogram while
retaining much more detail on the data.
• 43, 28, 34, 61, 77, 82, 22, 47, 49, 51, 29, 36,
66, 72, 41
2 2 8 9
3 4 6
4 1 3 7 9
5 1
6 1 6
7 2 7
8 2
6. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been turned into
graphs.
• To construct a histogram, we draw the interval
boundaries on a horizontal line and the frequencies on
a vertical line.
• Non-overlapping intervals that cover all of the data
values must be used.
• Bars are then drawn over the intervals in such a way
that the areas of the bars are all proportional in the
same way to their interval frequencies.
Example: Distribution of the age of women at the time of
marriage
Age group 15 - 19 20- 24 25 -29 30-34 35-39 40-44 45-49
No. of women 11 36 28 13 7 3 2
Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
Histogram for the ages of 2087 mothers with
<5 children, Adami Tulu, 2003
700
600
500
400
300
200
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0
N1AGEMOTH
7. Frequency polygon
• A frequency distribution can be portrayed
graphically in yet another way by means of
a frequency polygon.
• To draw a frequency polygon we connect
the mid-point of the tops of the cells of the
histogram by a straight line.
Frequency polygon for the ages of 2087 mothers with <5
children, Adami Tulu, 2003
700
600
500
400
300
200
N1AGEMOTH
It can be also drawn without erecting rectangles by joining the top
midpoints of the intervals representing the frequency of the classes as
follows:
Age of women at the time of marriage
40
35
30
No of women
25
20
15
10
0
12 17 22 27 32 37 42 47
Age
8. Ogive Curve
• Some times it may be necessary to know the number of
items whose values are more or less than a certain
amount.
• We may, for example, be interested to know the no. of
patients whose weight is <50 Kg or >60 Kg.
• To get this information it is necessary to change the
form of the frequency distribution from a ‘simple’ to a
‘cumulative’ distribution.
• Ogive curve turns a cumulative frequency distribution in
to graphs.
• Are much more common than frequency polygons
Cumulative Frequency and Cum. Rel. Freq. of Age
of 25 ICU Patients
10-19 3 12 3 12
20-29 1 4 4 16
30-39 3 12 7 28
40-49 0 0 7 28
50-59 6 24 13 52
60-69 1 4 14 56
70-79 9 36 23 92
80-89 2 8 25 100
Total 25 100
Cumulative frequency of 25 ICU patients
Example: Heart rate of patients admitted to
hospital Y, 1998
Heart rate No. of patients Cumulative frequency Cumulative frequency
Less than Method(LM) More than Method(MM)
54.5-59.5 1 1 54
59.5-64.5 5 6 53
64.5-69.5 3 9 48
69.5-74.5 5 14 45
74.5-79.5 11 25 40
79.5-84.5 16 41 29
84.5-89.5 5 46 13
89.5-94.5 5 51 8
94.5-99.5 2 53 3
99.5-104.5 1 54 1
Heart rate of patients admited in hospital Y, 1998
60
50
40
Cum. freqency
30
20
10
0
54.5
59.5
64.5
69.5
74.5
79.5
84.5
89.5
94.5
99.5
104.5
Heart rate
LM MM
9. Box and Whisker Plot
• It is another way to display information when
the objective is to illustrate certain locations in
the distribution.
• Can be used to display a set of discrete or
continuous observations using a single vertical
axis – only certain summaries of the data are
shown
• First the percentiles (or quartiles) of the data
set must be defined
• A box is drawn with the top of the box at the
third quartile and the bottom at the first
quartile.
• The location of the mid-point of the
distribution is indicated with a horizontal line
in the box.
• Finally, straight lines, or whiskers, are drawn
from the centre of the top of the box to the
largest observation and from the centre of the
bottom of the box to the smallest observation.
• Percentile = p(n+1), p=the required percentile
• Arrange the numbers in ascending order
A. 1st quartile = 0.25(n+1)th
B. 2nd quartile = 0.5(n+1)th
C. 3rd quartile = 0.75(n+1)th
D. 20th percentile = 0.2(n+1)th
C. 15th percentile = 0.15(n+1)th
Example: Percentage super saturation of bile for 31 men and 29
women
Men Women
Subject Age % Super saturation Subject Age % Super saturation
1 23 40 1 40 65
2 31 86 2 33 86
3 58 11 3 49 76
4 25 86 4 44 89
5 63 106 5 63 142
6 43 66 6 27 58
7 67 123 7 23 98
8 48 90 8 56 146
9 29 112 9 41 80
10 26 52 10 30 66
11 64 88 11 38 52
12 55 137 12 23 35
13 31 88 13 35 55
14 20 80 14 50 127
15 23 65 15 47 77
16 43 79 16 36 91
17 27 87 17 74 128
18 63 56 18 53 75
19 59 110 19 41 82
20 53 106 20 25 89
21 66 110 21 57 84
22 48 78 22 42 116
23 27 80 23 49 73
24 32 47 24 60 87
25 62 74 25 23 76
26 36 58 26 48 107
27 29 88 27 44 84
28 27 73 28 37 120
29 65 118 29 57 123
30 42 67
31 60 57
160
140
120
100
80
60
40
20
Men Women
Box and whisker plots for percentage saturation of bile
• The percentage saturation of bile is a bit
more spread out among women with
range 35 to 146 but we see also that the
mid-points of the distributions are
almost the same and that most of the
spread in values in women occurs in the
upper half of the distribution.
10. Scatter plot
• Most studies in medicine involve measuring more
than one characteristic, and graphs displaying the
relationship between two characteristics are
common in literature.
• When both the variables are qualitative then we
can use a multiple bar graph.
• When one of the characteristics is qualitative and
the other is quantitative, the data can be displayed
in box and whisker plots.
• For two quantitative variables we use
bivariate plots (also called scatter plots or
scatter diagrams).
140
120
Saturation of bile
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80
Age
• The graph suggests the possibility of
a positive relationship between age
and percentage saturation of bile in
women.
11. Line graph
• Useful for assessing the trend of particular situation
overtime.
• Helps for monitoring the trend of epidemics.
• The time, in weeks, months or years, is marked along the
horizontal axis, and
• Values of the quantity being studied is marked on the
vertical axis.
• Values for each category are connected by continuous
line.
• Sometimes two or more graphs are drawn on the same
graph taking the same scale so that the plotted graphs are
comparable.
No. of microscopically confirmed malaria cases by species
and month at Zeway malaria control unit, 2003
2100
No. of confirmed malaria cases
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
Line graph can be also used to depict the relationship between
two continuous variables like that of scatter diagram.
8
7
Blood zidovudine
6
concentration
5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Time since administration (Min.)
12
10
0
Antepartum Intrapartum Postpartum
Pre-eclampsia Eclampsia
Remember:
A graph is a tool.
It is not artwork to
hang above your sofa!
It is more important that it is
easy to correctly interpret
than it is that it is pretty!
TANK YOU …..