Biostat Compiled
Biostat Compiled
DR ABDUL RAUF
ASSISTANT PROFESSOR
The word statistics is defined in three different ways. Firstly,
“Statistics is numerical fact systematically collected with some
definite object”.
For example: figures 60,62,64,65,68 are not statistics, but
heights of the students in the class will form this data a
STATISTICS, e.g.
a) Statistics of death & birth,
b) Statistics of educational institutions in Sargodha etc...
Secondly:
Statistics is defined as “the science of systematic collection,
classification, tabulation, presentation, analysis and
interpretation of numerical data”.
Thirdly:
“ Any numerical quantity (Such as Mean, Median, Mode &
Standard deviation) computed or collected from a SAMPLE is
known as Statistic” (singular).
Study of statistics in relation to
biological science (such as biological,
social and environmental factors) is
known as “BIO-STATISTICS”.
Study of statistics in relation to health
and disease of human population and
different factors related to them, is
known as “Health statistics".
Study of statistics in relation to the vital events of life such as
birth , marriages, deaths, divorces, etc is known as “vital
statistics’’ , which in turn is a branch of “Demography” which
deals with study of human populations.
1. Helps in effective comparisons
between two groups or two
countries.
2.Helps in measurement of health
status of a community in terms of
rates, ratios, proportions etc. which
in turn helps in comparison with
other countries and helps to study
the influencing factors.
For examples, the prevalence of typhoid fever is higher among
people of poor socio-economic status, living in unhygienic
areas with unsafe water supply, not protected through
immunizations and so on. Thus by a systematic analysis of
the factors related to the disease, a health worker or a health
administrator can define the problems in terms of contrast.
3.Helps in estimating the magnitude of a health problem.
4.Helps in analyzing the causes of the public health problems,
including epidemics, to the public health personnel.
5.Helps in monitoring & evaluation of the control measures and
also in introducing midcourse correction measures, where
ever necessary.
6.Helps in health planning and management
7.Helps in research purposes
Thus biostatistics, if properly recorded constitutes “Eyes and
Ears’’ of a health worker otherwise it would be like “sailing in
a ship without compass’’.
An inherent feature of all biological observations is their
variability e.g. every individual varies with one another
.Similarly each group of individuals is different from other
groups. For example, the pulse rate, hemoglobin level, the
number of white cells varies from person to person . Again
this varies from one group to other . e.g. pulse rate among
infants varies from that of old age group.
VARIABLE:
Any numerical (N) value which varies from one individual to
another is called variable OR
It is a characteristic or attribute that varies from person to
person, from place to place & from time to time.
e.g. Height of students in the class.
Weight of School boys.
Variables are usually represented by last English letter X, Y, Z.
. Other examples: Prices of goods
Number of children in family.
CONSTANT:
Any numerical quantity which is fixed
OR
“constant is any fixed quantity that has a single value”
OR
“A quantity which can assume Only One Value is called
constant”
e.g. π = 22/7 (3.14), ‘g’= 9.8 m/second
Variables may be Qualitative, Quantitative,
Continuous, Discrete, Dichotomous (Binary) & Polyotomous.
QUALITATIVE VARIABLE
A characteristic which varies only in quality from one individual
to another individual is called “qualitative variable” e.g.
beauty, intelligence, severity of disease, color, ABO blood
group, gender. It is also called as attribute or categorical
variable.
QUANTITATIVE VARIABLE
Characteristic which can be measured numerically & varies from
one individual to another individual e.g. height, weight, B.P,
temperature of patients, hemoglobin level, blood sugar level,
mid-arm circumference,
Body mass index(BMI),Serum cholesterol level.
DISCRETE VARIABLE
A variable is called discrete variable if it can take some selected
values in a given interval
OR
A variable whose value is taken from some counting process
e.g. number of patients in a ward, rooms in a house , trees in
a row
CONTINUOUS VARIABLE
If the variable takes any value within an interval that variable is
called continuous variable.
OR
The variable whose value is taken from some measuring
process e.g. B.P, temperature, height & weight of patients
DICHOTOMOUS(BINARY) VARIABLE
It is variable that has only two possible value.
Examples
Gender
weight more than 80 kg
Obesity
Rh blood group
POLYOTOMOUS VARIABLE
It is a variable that has more than two possible values.
Examples
ABO blood group
Weight
Height
Nominal scale Metric Scale
Based on NOM(names), Based on ME(measurement)
no specific order e.g. In terms of quantities
Race/ethnicity, Blood glucose
Religion, Mid –arm circumference
Sex of child/gender Hemoglobin level
ABO blood group Weight, height,
Country of birth Blood pressure
Type of anemia Pulse rate
DATA
BY
DR.ABDUL RAUF
BIO-STATISTICS
DATA
Data is plural of word datum. A set of numerical observations is
called “data”
like height of students ,temperature of patients
blood pressure of patients ,number of Para-
medical staff
DATA MATRIX: It is a kind of platform at which we
present primary data.
BIO-STATISTICS
INFORMATION
Organized or optimized form of data is called
“information”
DATA REDUCTION
The process of converting raw data into manage- able form so
that some statistical analysis could be done.
BIO-STATISTICS
TYPES OF DATA
Primary data: The data which is collected for the first time, OR
The data which has not gone through statistical
machine is called primary data.
Secondary data: The data which has already been
collected previously OR
The data which has gone through statistical machine is called
secondary data.
BIO-STATISTICS
TYPES OF DATA
Bi-variate data have exactly 2 pieces of information for each
item. If only 2 information are taken simultaneously from
individuals e.g. salt intake & BP of patients, height & weight of
students
Multivariate data have 3 or more pieces of information for each
item. If more than two-
fold information are taken simultaneously from
individuals, e.g. age, weight & height of patients
BIO-STATISTICS
TYPES OF DATA
Nominal data= data based on nominal scale or variable is called
nominal data.
Ordinal data= data based on ordinal scale is called ordinal data.
Ratio / interval data=data based on ratio/interval scale.
BIO-STATISTICS
TYPES OF DATA
Time series data= A set of ordered data values observed at
successive points in time is called time series data .e, g
measurement of hourly temperature of patient
Cross–sectional data=A set of data values observed at a fixed
point in time is called cross-sectional data . e. g measurement
of temperature in morning of a group of patient.
BIO-STATISTICS
TYPES OF DATA
Ungrouped data=the data in the original form (with out
frequency) are referred to as ungrouped data
Grouped data=data presented in the form of frequency
distribution are called grouped data
BIO-STATISTICS
SOURCES OF DATA
The main sources of collection of medical or health related data
are as
1- Experiments/trials
2-Surveys/observations
3-Records/registration
4-Clinical practice
5-External sources
BIO-STATISTICS
COLLECTION OF DATA
1. Direct personal observations
2. Indirect personal observations
3. Questionnaire method
4. Data collection through enumerators
5. Data collection through local sources
6. Electronic media
BIO-STATISTICS
CLASS= the interval of values with in the given data is called
class.
CLASS INTERVALS= when upper limit of a class does not coincide
with lower limit of next class it is called class intervals. 0-10,
11-20, 21-
30, 31-40.
Class boundaries= when upper limit of a class coincides with
lower limit of next class, it is called class boundaries.0-10, 10-
20, 20-30,
BIO-STATISTICS
Class limits=the smallest & largest values of any given class of a
frequency distribution are called class limits. 20-40,
Class magnitude= the difference between two class limits or
class intervals is called class magnitude. 40-20=20
Frequency(class frequency)=the frequency is defined as the no.
of observations in any class & is denoted by “f”.
BIO-STATISTICS
FREQUENCY DISTRIBUTION
The arrangement of statistical data according to size or
magnitude is called “ frequency distribution”. There are 3 types
of frequency distribution.
1.Individual series
2.Discrete series
3.Contineous series
BIO-STATISTICS
INDIVIDUAL SERIES= In individual series items are arranged singly
according to their size e , g
Marks x= 14 16 13 17 19 15
DISCRETE SERIES= In discrete series items are capable of exact
measurement. They are separate, complete and are arranged
according to their size e,g
Marks: X: 10 20 30 40 50
No. of students: f : 2 5 17 6 4
BIO-STATISTICS
CONTINUOUS SERIES
In continuous series items are not capable of exact measurement
but varies from one point to other and are arranged according
to their size. e,g
Marks C.B 5-10 10-15 15-20 20-25
No. of students “f” 3 9 10 7
Individual series----------- ungrouped data
Discrete series+ continuous series----- grouped data
BIO-STATISTICS
FREQUENCY DISTRIBUTION
It is arrangement of statistical data to their respective
frequencies e,g
Age groups: 0-10 10-20 20-30 30-40 40-50
No. of patients: 5 8 15 18 13
BIO-STATISTICS
STEPS OF CONSTRUCTION
1. Determine the number of classes.
2. Determine the magnitude of classes.
3. Construct the class limit.
4. Locate values in each limit.
BIO-STATISTICS
1.Determine the number of classes.
No. of classes depends upon the no. of items given with in the data
however the no. of classes should not less than 5 & not more than
20. The no. of classes can also be calculated by following formula.
k= 1+3.22 log n
If n=100 observations then k= 1+ 3.22 log 100
K=1+3.22(2) k= 7.44= 8 classes
BIO-STATISTICS
2.Determine the magnitude of classes.
To find the magnitude of class first find the range(difference
between max:& min: value ) then divide the range by the
no. of classes you require
BIO-STATISTICS
3. Construct the class limits
The class limit should be close to the minimum & maximum
value as possible
4. Locate values in each class limit
Counting of items against each class can be done in 2 ways.
= By actually listing of elements
= By tally sheet method
The marks of 80 MBBS students in Bio-statistics are as under
Make a frequency distribution, grouping in interval of 5 marks, e.g. 50-54 ,55-59 etc
68 84 75 82 68 90 62 88 76 93
73 79 88 73 60 93 71 59 85 75
61 65 75 87 74 62 95 78 63 72
66 78 82 75 94 77 69 74 68 60
96 78 89 61 75 95 60 79 83 71
79 62 67 97 78 85 76 65 71 75
65 80 73 57 88 78 62 76 53 74
86 67 73 61 72 63 76 75 85 77
KEY MAXIMUM VALUE=97, MINIMUM VALUE=53
C-I ITEMS f
50-54 53 1
55-59 59, 57 2
60-64 62, 60, 61, 62, 63, 60, 61, 60, 62, 62, 63 11
65-69 68, 68, 65, 66, 69, 68, 67, 65, 65, 67 10
70-74 73, 73, 71, 74, 72, 74, 71, 71, 73, 74, 73, 72 12
75-79 75, 76,79, 75, 75, 78, 78, 75, 77, 78, 75, 79, 79, 78, 76,76, 21
78, 76, 76, 75, 77
80-84 84, 82, 82, 83, 80, 81 6
85-89 88, 88, 85, 87, 89, 85, 88, 86, 85 9
90-94 90, 93, 93, 94 4
95-99 95, 96, 95, 97 4
Summation 80
C - I Tally sheet f
50 - 54 I 1
55 -59 II 2
60 - 64 IIII IIII I 11
65 - 69 IIII IIII 10
70 - 74 IIII IIII II 12
75 -- 79 IIII IIII IIII IIII I 21
80 -- 84 IIII I 6
85 --- 89 IIII IIII 9
90 --- 94 IIII 4
95 --- 99 IIII 4
∑ 80
DATA
CLASSIFICATION
The process or method of arranging the heterogeneous data into
homogenous classes or groups is called classification.
By homogenous we mean that like should go with like and unlike
should go with unlike.
Generally there are two types of classification.
(a)- Classification according to attribute.
(b)- Classification according to class interval.
DATA
(a) Classification according to attribute.
When statistical data is classified on basis of descriptive
characteristics , such classification is called classification
according to attribute.
This can be done in two ways.
1.Simple classification.
2.Manifold classification.
DATA
Simple classification
In simple classification only one attribute is
taken into account , which is further
subdivided into two classes and no more e.gif
require to study the poverty of a place then
the only attribute can be further subdivided
into two classes as,
1.The people who are poor.
2.The people who are not poor.
DATA
Manifold classification
If require to study more than one attribute then we make the use
of manifold classification , in which each attribute is further
subdivide into two classes and no more e.g if require to study
the poverty of a place sex-wise then the two attributes poverty
and sex can be classified as.
1.Males who are poor.2.Females who are poor.
3.Male who are not poor.4.Females who are not.
DATA
(b)CASSIFICATION ACCORDING TO CLASS INTERVAL
When the data is classified on the basis of the numerical
characteristics, it is known as classification according to class
interval. e, g if require to study the weight of blood sugar
patients then the patients who have attained the weight from
110 lbs to 120 lbs are placed in one group& who have attained
weight from 120 lbs to 130 lbs are placed in 2nd group& so on.
TABULATION
The process or method of arranging statistical data into rows&
columns is called tabulation.
There are 4 types of tabulation.
1. One- way table 2. Two- way table
3. Three- way table 4. Higher-order table
ONE- WAY TABLE
The table, which provides only one information in one table, is
called one way table.
Sargodha 1950
Khushab 1200
Mianwali 1600
TWO-WAY TABLE
The table, which provides two inter related information in one
table, is called two- way table
14
12
10 ITALY
UK
8
USSR
6 USA
0
DIAGRAMATIC REPRESENTATION OF DATA
MULTIPLE BAR DIAGRAM
If require to have more than one interrelated informations in one
diagram, then we make the use of Multiple bar diagram.
In multiple bar diagrams the simple bars are placed side by side
to provide more than one information in same diagram . For
the sake of distinction the bar should be colored.
DIAGRAMATIC REPRESENTATION OF DATA
9
8
7
6
5 Series 1
4 Series 2
Series 3
3
2
1
0
Category 1 Category 2 Category 3 Category 4
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED BAR DIAGRAM
If require to have more than one information in the same
diagram we make the use of sub- divided bar diagram.
Sub-divided bar diagram are those in which each bar represents
the total of the components and then it is divided according to
the size of each item . The various sub-divisions are than
colored for the purpose of distinction.
35
30
25
20 Series 3
Series 2
15
Series 1
10
0
Category 1 Category 2 Category 3 Category 4
DIAGRAMATIC REPRESENTATION OF DATA
TWO DIMENSIONAL DIAGRAM
In two dimensional diagram two dimensions i.e. length and
breadth are taken account which are represented by square ,
rectangle or circle.
As length and breadth are taken into account that is why two
dimensional diagrams are called area diagram.
Area diagram can be further sub-divided into 2 categories.
a)Simple area diagram . b)Sub-divided area diagram
DIAGRAMATIC REPRESENTATION OF DATA
Simple area diagrams are,
a)Square diagram
b)Rectangle diagram
c)Circle diagram
Sub-divided diagrams are,
a) Sub-divided/component square diagram
b) Sub-divided/component rectangle diagram
c) Sub-divided circle diagram(pie diagram)
DIAGRAMATIC REPRESENTATION OF DATA
SIMPLE RECTANGLE DIAGRAM
In simple rectangle diagram two dimensions i.e. length and
breadth are taken into account because the area of rectangle is
equal to product of its length and breadth . There are two
method of drawing simple rectangle.
a)Keeping the width equal i.e. constant and their height i.e.
length proportional to the size of figuers . b)Keeping the length
equal i.e. constant and their width proportional to the size of
figuers
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED RECTANGLE DIAGRAM
In order to compare two or more quantity as well as their
components we make the use of sub-divided rectangle
diagram . These diagrams are generally drawn to compare the
budgets of various families.
DIAGRAMATIC REPRESENTATION OF DATA
SUB-DIVIDED RECTANGLE DIAGRAM
For construction of the sub-divided rectangle diagrams , the
following steps are involved
1.Convert each component into percentage of corresponding total.
2.Take the length of rectangle equal to 100 and width proportional to
given total.
3.Divide each length according to computed percentage.
4.Using colour to distinguish the various subdivision of each
rectangles.
DIAGRAMATIC REPRESENTATION OF DATA
PIE DIAGRAM
It is an improvement over a bar diagram. It is a circular diagram,
in which the frequencies of observations are shown as sectors
or wedges in a circle, the size of each sector being proportional
to the frequency. Degrees of angle denote the frequency and
area of sector gives comparative difference at a glance.
“Pie means a piece or a sector”
PIE-DIAGRAM
To draw a pie diagram first a circle is drawn. The radius is
marked. A second radius clockwise is drawn at an angle with
first radius , depending upon the angle for the sector which can
be calculated by following formula
no.of observations
Angle of any sector= ×360°
total no.of observation
PIE-DIAGRAM
The sectors should be arranged clockwise either in ascending or
descending order of magnitude. It is often necessary to
indicate the percentage for easy comparison.
The pie diagram can be made more attractive by giving a 3
dimensional effect to it. Each sector can be sliced out from the
main diagram to highlight the fact.
PIE-DIAGRAM
Sectors are then outlined by coloring or shading.
Thus a pie diagram is more attractive.
Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
9%
10%
23% 58%
PIE-DIAGRAM
Sales
4th Qtr
9%
3rd Qtr
10%
23%
9%
19%
10%
58%
PIE-DIAGRAM
Sales
3rd Qtr
10% 4th Qtr
9%
9%
10%
23% 58%
PIE-DIAGRAM
The results of a study of domestic accidents were as follows. Out
of 180 domestic accident, 60 occurred in the kitchen, 50 in bed
room, 40 under the staircase& 30 in the bed room.
PIE-DIAGRAM
Degree
Angle for accident in kitchen= 60/180×360=120°
_______________bed room=50/180×360=100°
__________under staircase=40/180 ×360=80°
____________drawing room=30/180×360=60 °
PIE-DIAGRAM
Percentage
%of accidents in kitchen=60/180×100=33.3%
__________ bed room =50/180×100=27.7%
______under staircase =40/180×100=22.2%
________drawing room =30/180×100=16.8%
DIAGRAMATIC REPRESENTATION OF DATA
PICTOGRAM/PICTURE DIAGRAM
It is a visual representation of statistical data by mean of
pictures. This method is used to impress a lay man, who can
not understand the orthodox charts. The pictures are drawn in
horizontal lines, each picture indicating an unit of 10, 20, 30
etc happenings. The number of pictures in each row gives an
idea of frequency of the attribute.
DIAGRAMATIC REPRESENTATION OF DATA
CARTOGRAM/MAP DIAGRAM
It is a visual representation of statistical data by means of signs &
symbols on the map and is prepared to show geographical
distribution of frequencies of characteristic. This is commonly
used to represent geographic distribution of disease and
deaths of public health importance. These are of two types –
a)spot maps, b)shaded maps.
CARTOGRAM/MAP DIAGRAM
SPOT MAP
In this type the distribution of disease frequency is represented
in the form of dots or spots, each dot representing an unit
number of 10, 20, 30 etc in the area map prepared. Such maps
show at a glance areas of high frequency (clustering of spots)
or low frequency. Clustering of spots may indicate a common
source of infection or a common risk factor shaped by all cases.
CARTOGRAM/MAP DIAGRAM
Spot map help the epidemiologists to study the place
distribution, source / reservoir of infection and behavior of a
disease.
Two different colored dots may be marked on the map to show
attacks and deaths, in the area. Maps prepared on weekly or
monthly basis help in monitoring changes in the magnitude of
epidemics over a period of time and also direction of their
spread.
CARTOGRAM/MAP DIAGRAM
SHADED MAP
These are used to indicate variability in the incidence and
prevalence of diseases in different parts of the world / country
or from time to time. These maps also help in evaluating
progress achieved in reducing the burden of diseases over a
period of time.
GRAPHIC REPRESENTATION OF DATA
A graph is a device used for representing statistical data in a
simple, clear and effective manner. A graph consists of curves
or straight lines. Graph is used to study the relationship
between two variables. Graph of frequency distribution are,
1)Histogram,2)Frequency polygon,3)Frequency
curve,4)Cumulative frequency curve(OGIVE),5)Scatter/dot
diagram,6)Line chart/graph
HISTOGRAM
It is a graphic representation of a frequency distribution table in
which the vertical axis represent the frequency& the horizontal
axis the class interval.
It consist of a series of bars adjoining to each other, length of
each bar is being proportional to the frequency and width to
the class interval.
HISTOGRAM
Histograms are ideally suited to represent the distribution of
anthropometric values like height, weight, mid-arm
circumference, etc.
They can also represent other types of continuous data series
such as blood pressure pulse rate, hemoglobin level, etc.
Histograms provide a better understanding of quantitative data
of continuous type than frequency tables.
GRAPHIC REPRESENTATION OF DATA
Histogram
GRAPHIC REPRESENTATION OF DATA
FREQUENCY POLYGON
It is a line diagram that represents a frequency
distribution table. It can be obtained by joining mid
points(dots)of the heads(heights)of histogram, each dot
represents two character-
istics; class interval as indicated on horizontal-axis and class
frequency on vertical axis. Joining the dots gives a curve with
many angle
Hence the name “frequency polygon”
GRAPHIC REPRESENTATION OF DATA
THIS type of diagram is useful especially, when it necessary to
compare two or more frequency distributions. The curves for
different distributions should be drawn with different types of
lines on the same graph paper for easy comparison, which is
not possible through histograms because of overlapping of
rectangles result in confusion.
GRAPHIC REPRESENTATION OF DATA
3 Frequency Polygon
2.5
1.5
0.5
0
GRAPHIC REPRESENTATION OF DATA
FREQUENCY POLYGON
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
GRAPHIC REPRESENTATION OF DATA
FREQUENCY CURVE
When no. of observations is very large and group interval is
reduced, the frequency polygon tends to lose its angulations,
resulting in a smooth curve, known as “Frequency curve”
GRAPHIC REPRESENTATION OF DATA
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
GRAPHIC REPRESENTATION OF DATA
CUMULATIVE FREQUENCY CURVE
Cumulative frequency
C—I f Cumulative frequency
10-20 7 7
21-30 5 12
31-40 10 22
41-50 9 29
51-60 3 32
∑= 32
GRAPHIC REPRESENTATION OF DATA
CUMULATIVE FREQUENCY CURVE
It is a line diagram, representing the cumulative frequency
distribution of quantitative data.
To draw it, an ordinary frequency table in quantitative data has
first to be converted into a cumulative frequency table
The curve is plotted by taking the variable on x-
Axis and cumulative frequency on y-axis.
GRAPHIC REPRESENTATION OF DATA
From ogive, median value of the characteristics (variable) can
also be calculated.
GRAPHIC REPRESENTATION OF DATA
LINE DIAGRAM
In this diagram vertical axis represents the magnitude and
horizontal axis represents time
Thus this diagram provides a simple, easily understandable and
highly effective means of understanding the trend or behavior
of event over a period of time, e.g. rising or falling or
fluctuations, such as birth rate, death rate, population rate etc.
GRAPHIC REPRESENTATION OF DATA
The class interval may be one month, one year, one decade.
Since line diagrams do not occupy any space several lines may be
projected on the same graph for comparing the trends of
interrelated events.
Multiple line diagrams can coexist only if they share the scales
given at two axes of the graph.
6
0
Category 1 Category 2 Category 3 Category 4
GRAPHIC REPRESENTATION OF DATA
SCATTER DIAGRAM OR DOT DIAGRAM
When observations for two variables(e. g weight & mid arm-
circumference or weight & height) are made in each of the
individuals in a group, it helps to study the relationship
between two variables. One variable is represented on x-axis
and another variable on y-axis. Perpendiculars drawn from
these readings meet, to give one scatter point.
GRAPHIC REPRESENTATION OF DATA
There will be as many points as there are indivi-
duals in the observation. When all the points are plotted, the
diagram gives the picture of scatter. Hence the name “scatter
diagram”
(Dot diagram). The direction of scatter helps to determine
presence or absence of the association
GRAPHIC REPRESENTATION OF DATA
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3
BIO-STATISTICS
x̄ = ∑x/n
So Mean =1200/10=120 mm of Hg
MEAN
For grouped data
x̄=∑ fx /∑ f
Example
Median
Definition:- Median is defined as a central value in a data or
distribution, when arranged in a definite order, ascending or
descending which divides the data or distribution into two
equal halves. One half comprising of values greater than and
other half smaller than it. It is easy to locate when there are
odd number of values. When there are even number of values
the median is taken as the average of two central values of
data
Median
Example---ungrouped data
In a hospital ward, following are the number of days of stay of
patients.
13,42,8,9,7,3,6,52,8,2,11,11,10,9
Advantages:-
• It eliminates the effect of extreme values
• It is easy to calculate & understand
• Only the values of the middle item need to be known
Median
Disadvantages:-
• If you change the extreme value median does not have any
effect.
• It can not be calculated unless the data is arranged in some
order (ascending order or descending order)
Median
Median has advantages over the mean as explained with
example i.e. salary of 07 workers per day is
Rs. 5,5,5,7,10,20,102=154 so= mean is 22 But Median=7. The
income of 7th man having Rs. 102 has seriously affected the
mean but is has not affected the median so in this case median
is more near to truth and therefore more representative than
Mean.
Median
GRAPHIC METHOD FOR LOCATING THE MEDIAN The median
can also be calculated from the curve “Ogive”. Median can be
located by following procedure.
On y-axis n/2nd frequency is located, from this a line parallel to
x-axis is drawn to meet the curve. From the point of
intersection, a line parallel to y-axis is drawn to meet the x-
axis. The point of intersection on x-axis is median value of the
observations
BIO-STATISTICS
Mode = 3
MODE
Example
Following is the ages of 10 medical students:
18, 18, 19, 19, 20, 20, 20, 21, 22, 23
What is the “mode”?
mode = 20 years of age
MODE
Example
To check the accuracy of the clinical diagnosis of malaria, blood
slides of 33 patients were examined for malaria parasites.
There were three possible results:
Negative P.
falciparum P
vivax
MODE
Example
The results are presented in the following frequency distribution.
Negative 19
P. falciparum 13
P. vivax 1
Total 33
What is the mode?
The mode is “Negative.”
MODE
Example
Health personnel from 148 different rural health institutions were
asked the following question.
“How often have you run out of drugs for the treatment of malaria in
the past two years?”
This was a closed question with the following possible answers.
Never
1 to 2 times (rarely),
3 to 5 times (occasionally),
more than 5 times (frequently)
MODE
. The numbers of responses in each category were totaled to give the
following frequency distribution.
• Never 47
• Rarely 71
• Occasionally 24
• Frequently 6
Total 148
What is the mode?
The mode is “rarely.”
MODE
Example
82 clinics in one district were asked to submit the number of
patients treated for malaria in one month. The researchers
presented both the frequency distribution and percentages (or
relative frequencies) as follows
MODE
NUMBER OF NUMBER OF RELATIVE
PATIENTS CLINICS FREQUENCY
0 to 19 25 31%
20 to 39 3 4%
40 to 59 5 6%
60 to 79 11 14%
80 to 99 19 24%
100 to 119 10 12%
120 to 139 4 5%
140 to 159 3 4%
Total 80* 100%
MODE
Data form two clinics are missing
Note: Usually you do not include missing data in the calculation
of percentages
However, the number of missing data (e.g., people who did not
respond to a question) is a useful identification of the
adequacy of your data collection. Therefore, this number
should be mentioned, as a note to your table.
MODE
What is the mode?
The mode is “O to 19”, as this outcome is recorded most
frequently (25 times out of 80).
MODE
There can be more than one mode for a series of data. In a
distribution with two most frequent values, there will be 2
modes: Bimodal distribution
Mode= average of 2 modes
MODE
Grouped data
fm – f1
Mode= l + Xh
( fm – f1) +(fm – f2)
l = lower class boundary of modal group
fm = frequency of modal group
f1 = proceeding frequency of modal group
f2 = following frequency of modal group
h = class interval of modal group
MODE
Example
Following are the number of men in various age groups with
some form of paid employment in a village. The age recorded
for each man is the number of completed years lived.
Calculate “mode”
Age = 14-20, 21-30, 31-40, 41-50, 51-60, 61-70
men = 12 14 26 35 23 5
MODE
age f Class boundaries
14 – 20 12 13.5 – 20.5
21 – 30 14 20.5 – 30.5
31 – 40 26 30.5 - 40.5
41 – 50 35 40.5 -50.5
51 – 60 23 50.5- 60.5
61 – 70 5 60.5 – 70.5
71 - 90 1 70.5 – 90.5
MODE
fm – f1
Mode = l + Xh
(fm – f1) + (fm – f2)
35 – 26
=40.5+ x 10
(35 – 26) + (35 – 23)
9 9
= 4o.5 + x 10 ,= 40.5 + x 10
9 + 12 21
Mode = 40.5+90 = 44.8
21
MODE
In distribution with extreme values
Most affected measure of central tendency;
MEAN
Least affected measure of central tendency;
MODE
Most preferable measure of central tendency;
MEDIAN
MODE
Example
The incidence of malaria in an area is
20,20,50,56,60,5000,678,898,345,456
Incidence in ascending order is
20,20,50,56,60,345,456,678,898,5000
Mean= ∑ x/n= 7583/10= 758.3
Median= average of 5th & 6th value=(60+345)/2
Median= 202.5
Mode= 20
MODE
ADVANTAGES
It is easy to calculate
It is least influence by extremes of values
It is the only average that that can be applied to qualitative data
MODE
DISADVANTAGES
It may not exist in a small group of values
It cannot be subjected to mathematical treatment
CENTRAL TENDENCY IN VARIOUS DISTREBUTION
NORMAL(GAUSSIAN)
DISTRIBUTION MEAN = MEDIAN = MODE
RIGHT(POSITIVE) SKEW
DISTRIBUTION MEAN > MEDIAN > MODE
LEFT(NEGATIVE) SKEW
DISTRIBUTION MEAN < MEDIAN < MODE
Mode = 3median – 2mean
If median is 5 & mean is 4
What is the mode?
Mode= 3(5) – 2(4)
= 15 – 8
=7
THANK YOU
BIO-STATISTICS
MEASURES OF DISPERSION
BY
DR. ABDUL RAUF
MEASURES OF DISPERSION
The measures of central tendency are not sufficient to describe
all the characteristics of the data or distribution.
It is quite possible that two or more distributions may have the
same average, but the observations may differ from each
other.
DISPERSION:
By dispersion, we mean how far the values are scattered from
each other or from the average.
MEASURES OF DISPERSION
For example, there are two groups of cricket teams, having their
diastolic pressures (in mm of HG) as
Team A =92, 90, 88, 88, 88, 86, 84, 84, 84, 82, 80
Team B=100,98, 96, 94, 90, 86, 82, 78, 76, 74, 72
It is seen that both the groups have their mean as 86 mm Hg. At
the same, it is also seen that the range as well as the diastolic
pressures of the two groups are different.
MEASURES OF DISPERSION
1. Range (R)
2. Mean Deviation (M.D)
3. Standard Deviation (S.D)
4. Coefficient of Variation (CV )
RANGE
The range is defined as the difference between maximum values
( Xm) and the minimum values (Xo) in the data or distribution.
. R= (Xm – X0)
Where R = Range
Xm = Maximum Value
Xo = Minimum Value in the Data
Example: 60, 69, 70, 71, 72.
So, R = 72 – 60 = 12
RANGE
Thus range gives the values of the extremes but does give any
information about the values in between the extreme values. It
usually defines the limits of normalcy as
Blood sugar random=110 to 160 mg
Cholesterol = 120 to 250 mg
RANGE
I. The Range is simple to understand & easy to calculate
II. It is useful as a rough measure of dispersion.
III. It is dependent upon the extreme values so it gives no
indication how the values within the two extremes are
distributed.
IV. It is highly unstable measure of dispersion.
Mean deviation (MD)
1190
x̄ = = 119
10
∑ = l xi – x̄ l = 22
∑ = l xi – x̄ l 22
MD = = = 2.2
n 10
so mean deviation is = 2.2
Important characteristics of Mean Deviation.
i. It is simple to understand and easy to calculate
ii. It is not capable of further mathematical treatment.
iii. Though simple and easy, Mean Deviation is not used in
Statistical Analysis, being of less mathematical value
particularly in drawing Inferences (results).
MEASURES OF DISPERSION
STANDARD DEVIATION(SD)
BY
DR. ABDUL RAUF
STANDARD DEVIATION
Two classes took part in a recent quiz. There were
10 students in each class, and each class had an
average score of 81.5 Since
the averages are the same, can we assume that
the students in both classes perform the same on
the exam?
STANDARD DEVIATION
The answer is… No.
‘Standard Deviation’ is
represented by the
symbol sigma s
Example
You and your friends have just measured the heights of
your dogs (in millimeters):
so the mean (average) height is 394 mm. Let's plot this on the chart:
Now, we calculate each dogs difference from the Mean:
72 + 12 + 62 + 22 90
√ =√ = 4.74...
4 4
For example, start with the lowest score, 72. How far away is 72 from the mean
of 81.5?
72 - 81.5 = - 9.5
- 9.5
Or, start with the highest score, 89. How far away is 89 from the mean of 81.5?
89 - 81.5 = 7.5
- 9.5 7.5
So, the first step 72 -9.5
to finding the 76
Standard Deviation 80
is to find all the 80
distances from the mean. 81
83
84
85
85
89 7.5
Distance
from
So, the first Mean
step to 72 - 9.5
finding the 76 - 5.5
Standard 80 - 1.5
Deviation is 80 - 1.5
to find all 81 - 0.5
the 83 1.5
distances 84 2.5
from the 85 3.5
mean. 85 3.5
89 7.5
Distance Distances
Next, you from Mean Squared
80 - 1.5 2.25
81 - 0.5 0.25
83 1.5 2.25
84 2.5 6.25
85 3.5 12.25
85 3.5 12.25
89 7.5 56.25
Distance Distances
Divide by (n - from Mean Squared
95 13.5 182.25
other class = 253.4
96 14.5 210.25
grades = 15.91
98 16.5 272.25
93 11.5 132.25
71 - 10.5 110.25
63 -18.5 342.25
Now, lets compare the two
classes again
Team A Team B
Average on
the Quiz 81.5 81.5
Standard
Deviation 4.88 15.91
APPLICATIONS OF STANDARED DEVIATION
1).A standard deviation(SD) is the universally accepted unit of
dispersion of values, from the mean value.
2).SD summarizes the variation of a large distribution in one
figure
3).SD measures the position or distance of observation from the
mean
4).SD indicates whether variation of difference of an individual
from the mean, is by chance(natural) or real due to some
special reasons
5).SD helps in finding the size of the sample
6).SD is used to calculate Standard Error(SE) of mean & SE of
difference between two mean
7).SD is used for calculation of “relative deviate” or “Z score”
8).SD is used in the calculation of “Coefficient of Variation”(CV)
In a normal distribution series, a confidence interval of ×̄± 1 SD
encloses 68.27% values, an interval of ×̄± 2 SD encloses 95.45%
values,& an interval of ×̄± 2 SD encloses 99.73% values. For
purpose of simplicity, limit of ×̄± 2 SD is treated as including
95% values.
In other words six standard deviations, 3 on either side of the
mean cover almost the entire range of a quantitative
series(which will be explained under “normal curve”)
Z- score is the difference of a specific observation from the mean
in terms of SD.
Formula is
x - x̄
Z=
SD
Where Z= relative deviate
x= the observation in question
EXERCISE
The mean height of 4th year students is 150 cm with SD of 10 cm.
Ahsan’s height is 165 cm.
Calculate Z- score
165 – 150 15
Z= =
10 10
Z = 1.5 cm
THANK YOU
COEFFICIENT OF VARIATION
(CV)
BY
DR ABDUL RAUF
Can we compare the standard deviation of any two
quantitative groups(series) in the same group, if
the attributes are different like SD of Ht.& SD of
Wt.
Answer is
“NO”
Can we compare the standard deviation of same attribute like
height only, if the units of measurement are different in the
two groups, for example, cm and inch
Answer is
“NO”
This limitation of SD is removed by converting SD into
“COEFFICIENT OF VARIATION” (CV)
The CV is the standard deviation expressed as the
“percentage of the mean”
CV is a unit less number, therefore CV is well suited for
all types of dissimilar measurements such as Height
and Weight, or Hemoglobin and Weight, or pulse rate
and mid-arm circumference
COEFFICIENT OF VARIATION
(CV)
S
CV = 100%
X
COEFFICIENT OF VARIATION
(CV)
Measure of Relative Variation
Always a %
Shows Variation Relative to Mean
Used to Compare 2 or More Groups
SD
CV= × 100
Mean
p
x² = 0.76
c) expected number and χ²-value of “died” in
experimental group
15 x 65 39
E= = = 9.75
100 4
(O – E)² (5 – 9.75)²
x²= =
E 9.75
(- 4.75)² 22.56
x²= = = 2.31
9.75 9.75
d) expected number and χ²-value of “survived” in
experimental group
85 – 65
E= = 85 x .65 = 55.25
100
(O – E)² (60 – 55.25)²
x² = =
E 55.25
(5.25) ² 22.56
x²= = = 0.408
55.25 55.25
∑x² = Total x² value of all 4 cells
= 4.29 + 0.76 + 2.31 + 0.41
= 7.77
DF = (c – 1)(r – 1) = (2 – 1)(2 – 1) = ( 1x 1)= 1
Where
DF= Degree of freedom
c= no .of columns
r= no. of rows
On referring to Fisher’s χ²- table with 1df, the tabulated χ²-
value, corresponding to probability of 0.05(at 95%
significance level) is 3.84
Since the calculated value(7.77) is more than table
value(3.84),the null hypothesis is rejected ,accepting the
alternative hypothesis
Assumption that the drug is not
efficacious(no difference between
drug and placebo) is ruled out and
accepted that the drug is efficacious
Example
Critical value: Referring to the critical values of chi square, at 0.05 level of
significance and 4 degrees of freedom, critical value is 9.49.
Computation:
Item O (now) E(last (O-E) (O-E)2 (O-E)2
year) E
1.Vacation 6 4 2 4 1.00
Leave
2.Salary 58 65 -7 49 0.75
Increase
3.Professional 14 13 1 1 0.08
Growth
4.Health and 14 12 2 4 0.33
retirement
benefits
5.Honorarium, 8 6 2 4 0.67
incentives,
overtime pay
X2 =2.83
Since the computed value of 2.83 is
less than the tabular value of 9.49,
hence the null hypothesis is
accepted. Therefore, at 5 percent
significance level and 4 degrees of
freedom, the present distribution of
response is the same as last year’s.
Example:
Marital Status
It is not appropriate for a situation in which the sample size is small, yielding small expected
frequencies. There should be no expected frequencies less than 1, and not more 20% of the
expected frequencies are to be less than 5. For a situation with a small sample size, we should
consider using the Fisher’s Exact Test, which computes directly the probability of observing a
particular set of frequencies in 2x2 tables. The formula is
X2=_____n(ad-bc)2____
(a+c)(b+d)(a+b)(c+d)
Example:
Consider the following 2x2 table showing the rating of successful or unsuccessful on a
job and pass or fail on a ability test:
Test Item
Fail Pass Total
Successful
a=4 b=1 5
Unsuccessful
c=1 d=3 4
Total 5 4 9
Computation:
P= 20 =0.159
126
However, to compute the P value, it is still needed to find the probability of
obtaining this or a more extreme result while keeping the marginal totals in the
table fixed. To do this, reduced by 1 the smallest frequency that is greater than
zero while holding the marginal totals constant. Hence, the table will be:
5 0 5
0 4 4
4 4 9
= 5!4∙3∙2∙1
9∙8∙7∙6∙5!
P= 0.008
Thus the probability of observing this particular frequency of getting successful
in a job or a more extreme frequency is 0.159 + 0.008= 0.167. This P value is for
one-tailed test. An estimate of a P value for a two-tailed test is obtained by
multiplying the value by 2; 2x 0.167= 0.334. Based on this value, the null
hypothesis that there is no difference in the success of job with or without
passing the ability test cannot be rejected.
Yates’ Correction for Continuity
The statistic on which we base our decision has a distribution that is only approximated by the
chi-square distribution. The computed X2 values depend on the cell frequencies and consequently
are discrete. The continuous chi-square distribution seems to estimate the discrete sampling
distribution of X2 very well, provided that the number of degrees of freedom is greater than 1. In
a 2x2 contingency table, where we have only 1 degree of freedom, the Yates’ correction for
continuity may be applied. It is the process of subtracting 0.5 from the numerator at each term in
the chi-square statistic for 2x2 tables prior to squaring the term.
X2(corrected)= ∑ (│O-E│-0.5)2
E
If the expected cell frequencies are large, the corrected and uncorrected results are the same.
When the expected frequencies are between 5 and 10, Yates’ correction should be applied. For
expected frequencies less than 5 the Fishers’ exact test should be used.
Chi –square Guidelines
When testing for “goodness of fit’ at least two
categories must be used to have at least 1
degree of freedom statistic. The general rule in
setting up the chi-square is to have as many as
possible categories for the test will then more
sensitive. The limitations are no more than 20
percent of cells have an expected than the value
of 5.0, and no cell has an expected frequency
smaller than 1.0. If too many small expected
frequencies exist, the categories should be
combined, unless such combinations are not
possible.
Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
negative relationship
no relationship
Positive relationship
18
16
14
Height in CM 12
10
0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship
Reliability
Age of Car
No relation
Correlation Coefficient
If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)
xy x y
r= n
( x) 2
( y)
2
x
2 . y
2
n n
Example:
A sample of 6 children was selected, data about
their age in years and weight in kilograms was
recorded as shown in the following table . It is
required to find the correlation between age and
weight.
serial Age Weight
No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
These 2 variables are of the quantitative type,
one variable (Age) is called the independent
and denoted as (X) variable and the other
(weight) is called the dependent and denoted
as (Y) variables to find the relation between
age and weight compute the simple
correlation coefficient using the following
formula:
xy x y
r= n
( x) 2 ( y )2
x
2 . y
2
n n
Age Weight
Serial
(years) (Kg) xy X2 Y2
n.
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total ∑x= ∑y= ∑xy= ∑x2= ∑y2=
41 66 461 291 742
41 66
461
r= 6
(41) 2 (66) 2
291 .742
6 6
r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and Test Scores
Anxiety Test X2 Y2 XY
(X) score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
Calculating Correlation Coefficient
r = - 0.94
6 (di) 2
rs = 1
n(n 2 1)
∑ di2=64
6 64
rs = 1 = 0.1
7(48)
Comment:
There is an indirect weak correlation
between level of education and income.
exercise
Regression Analyses
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
By using the least squares method (a procedure
that minimizes the vertical deviations of plotted
points surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
ŷ = a bX
x y
xy
ŷ = y b(x x) bb1 = n
( x) 2
x 2
n
Regression Equation
SBP(mmHg)
220
Regression 200
180
equation describes 160
120
mathematically 100
80
Intercept
Wt (kg)
60 70 80 90 100 110 120
Slope
Linear Equations
Y
ŷY == bX
a +bX
a
Change
b = Slope in Y
Change in X
a = Y-intercept
X
Hours studying and grades
Regressing grades on hours
Linear Reg ression
90 .0 0 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88
80 .0 0
70 .0 0
41 66
461
b= 6 = 0.92
2
(41)
291
6
Regression equation
x n
2 41678
20
ŷ =112.13 + 0.4547 x
for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Multiple Regression
Multiple regression analysis is a
straightforward extension of simple
regression analysis which allows more
than one independent variable.
GEOMETRIC MEAN
BY
DR ABDUL RAUF
Geometric mean
A = X
X B
Find the geometric mean
between 4 and 9
4 = X
X 9
X² = 36
X² = 36
X=6
Find the geometric mean
between 3 and 15
3 X
=
X 15
X² = 45
X² = 45
X= 3 5
X = 6.7
8 is the geometric mean between
2 and what number
2 = 8
8 X
2X = 64
X = 32
The Altitude Drawn To The
Hypotenuse In A Right Triangle
Divides The Hypotenuse Into 2
Parts. The Altitude Is The Geometric
Mean Between The 2 Parts.
A A = Alt
Alt B
B
8 = X
8 X 5
X² = 40
X 5
X = 40
X = 2 10
Find X X = 6.3
Each Leg Is The Geometric
Mean Between The Part Of The
Hypotenuse Adjacent To The Leg And
The Whole Hypotenuse
Y Y W
=
W W Y+Z
Z Z = X
X Y+Z
X
8 = T
8 T 8+5
T T² = 8 ( 13 )
5 T² = 104
T = 104
R T = 2 26
T = 10.2
Find T
5 = R
8 R 5+8
T R² = 5 ( 5+8 )
5 R² = 5 (13 )
R² = 65
R
R = 65
Find R R = 8.1
Two Numbers: Detailed Method
•‘Bell Shaped’
• Symmetrical
• Mean, Median and Mode
are Equal
Mean
= Median
= Mode
Normal distribution curve.
16
14
12
10
B A
X Y
99.73%
95.44%
68.26%
1s 1s
2s 2s
3s 3s
4s 4s
Characteristics of Normal Distribution
1)Has a Bell Shape Curve and is Symmetric
2)The rim of the bell does not rest on the abscissa
but is separated from it by a gap.
3) It is Symmetric around the mean:
Two halves of the curve are the same (mirror
images)
Characteristics of Normal Distribution Cont’d
4)The total area under the curve is 1 (or 100%)
5)Normal Distribution has the same shape as Standard Normal
Distribution
6)All the three measures of central tendency i.e. mean, median,
mode coincide i.e. a perpendicular drawn from the peak of
curve to abscissa, that point on the abscissa is the mean,
median and the mode
Characteristics of Normal Distribution Cont’d
7) In a Standard Normal Distribution:
The mean (μ ) = 0 and
Standard deviation (σ) =1
8)Maximum no. of observations are at the value of
variable corresponding to the mean and the no. of
observation on both sides of this value gradually
decrease and there are few observations at the
extreme points
Characteristics of Normal Distribution Cont’d
9)The area under the curve (no. of observation) can be
represented in terms of relationship between the mean and
the standard deviation. The relationship is expressed as
fallows
Mean ± 1SD includes 68.3 % (roughly 2/3rd ) of all observations
Mean ± 2SD includes 95.4 % of all observations
Mean ± 3SD includes 99.7 % of all observations
Percent of Values Within One
Standard Deviations
68.26% of Cases
361
Percent of Values Within Two
Standard Deviations
95.44% of Cases
362
Percent of Values Within Three
Standard Deviations
99.72% of Cases
363
Characteristics of Normal Distribution Cont’d
10)Thus it is seen that almost all the values of observation will
be within the range, mean ±3SD and most of the values are
within the range, mean±2SD.This relationship is useful for
fixing the confidence intervals of the varieties.
11)The properties of a normal distribution and a normal curve
form the basis of various tests of significance.
Characteristics of Normal Distribution Cont’d
12)Values larger and smaller than mean ± 3 SD will be rare (less
than 1%)in nature and those larger and smaller than mean ±
2 SD will occur less than 5%. In other words, suppose we say
that the confidence limit is 99% ,that means 99% of the
values are distributed within the range of ×̄± 3 SD and the
probability of occurrence of any value falling outside this
range is only 1% (p=0.01)
Normal Distribution
Similarly, suppose we say that the confidence limit
is 95% , that means 95% of the values are
distributed within the range of ×̄ ± 2SD and the
probability of occurrence of any value falling
outside or beyond this range is only 5% (p=0.05).
Formula
X < mean = 0.5-Z
X > mean = 0.5+Z
X = mean = 0.5
Z = (X-m) / σ
where,
m = Mean.
σ = Standard Deviation.
X = Normal Random Variable
Z score is the difference of a specific observation from the mean
in terms of SD.
Formula is
x - x̄
Z=
SD
Where Z= relative deviate
x= the observation in question
The Normal Distribution: an example.
= 3 (Mean –Median)
Standard deviation
Application/Uses of Normal Distribution
• It’s application goes beyond describing distributions
• It is used by researchers and modelers.
Then:
Modified from Dawson-Saunders, B & Trapp, RG. Basic and Clinical Biostatistics,
2nd edition, 1994.
13.6% 33.35%
2.2%
0.15
0.159
-3 -2 -1 μ 1 2 3
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
Then:
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
13.6% 33.35%
2.2%
0.15
0.023
-3 -2 -1 μ 1 2 3
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
13.6% 33.35%
2.2%
0.15 0.954
-3 -2 -1 μ 1 2 3
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
13.6% 33.35%
2.2%
0.15
0.015
-3 -2 -1 μ 1 2 3
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
13.6% 33.35%
2.2%
0.15
0.015 0.015
-3 -2 -1 μ 1 2 3
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
1) 15.9% or 0.159
2) 2.3% or 0.023
3) 95.4% or 0.954
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
Solution/Answers Cont’d
4) 0.15 % or 0.015
The exercises are modified from examples in Dawson-Saunders, B & Trapp, RG.
Basic and Clinical Biostatistics, 2nd edition, 1994.
PROBABILITY
BY
DR ABDUL RAUF
PROBABILITY
Probability means a chance factor for the
occurrence of a specific event e. g
chances of winning a lottery
chances of being selected
chances of getting a male child in the 1st
pregnancy etc.
PROBABILITY
This chance factor is associated with uncertainty,
because information in the happenings is not
available.
This uncertainty or mathematical quantity which
depends upon the occurrence of favorable or
unfavorable event, is numerically expressed as
probability
PROBABILITY
PROBABILITY of a particular event can be defined as, “the ratio
of no. of favorable cases for the particular event to the total
no. of cases both favorable & unfavorable to the particular
event”
n No. of favorable cases
Formula =P= =
N Total no. of both favorable and
unfavorable cases
The Probability of an Event
=
the number of possible
P(Event) = the number of ways outcomes
it can happen
P(H+T)=__2 __
=
1 4
2
H?H T T H T T H
Try these…
a) b)
1 1 1
B
4 B
4 B
2
CARDS
What is the
probability of getting
4 fives? 4
13
DICE
6 10 12 52
DEPENDENT EVENTS
Two events A&B are said to be independent
events if the occurrence of A effects the
occurrence of B & vice versa e. g tossing a
fair coin
PROBABILITY
• We write probabilities as ratios--these ratios
can then be written as fractions or percents.
• 0 means that the probability of something
happening is impossible.
• 1 means that the probability of something
happening is certain.
Probability
0 ½ 1
Probability
Equally likely to
happen or not to Certain to
Certain not
happen happen
to happen
Chance
50 %
0% 100%
Chance
1/6
Question
½
Question
4 1
=
52 13
Question
13
13
13
13 1,10,11,12,13,14,1
5,16,17,18,19,21,3
32 1
32
32
32
.40625
Question
12
Question
36 = (3)(3)(2)(2)(1)(1)
Question
If a customer makes exactly 1 selection from each of the 5
categories listed below, what is the greatest number of different
ice cream sundaes that a customer can create?
12 ice cream flavors;
10 kinds of candy;
8 liquid toppings;
5 kinds of nuts;
With or without whip cream;
Answer
9600=(12)(10)(8)(5)(2)
Question
You are at your school cafeteria that allows you to
choose a lunch meal from a set menu. You have
two choices for the Main course (a hot chicken or a
big burger), Two choices of a drink (orange juice,
apple juice) and Three choices of dessert(pie, ice
cream, jello). How many different meal combos can
you select?
Answer
12
Question
Licence Plates for cars are labelled with 3
letters followed by 3 digits.(in this case, digits
refer to digits 0 - 9. If a question asks for
numbers, its 1 - 9 because 0 isn't really a
number)
How many possible plates are there? You can
use the same number more than once.
Answer
(26)(26)(26)(10)(10)(10) = 17,576,000
Question
92%
Question
Convert 4% to a
simplified fractions
4 1
100 = 25
Answer
4 1
=
100 25
Answer
1
= .25
4
Question
A
B
C
Answer
1
8
Question
What is the
probability of
the following
spinner
landing on
Longitude?
Answer
1/3
Question
What is the probability of the
following spinner landing on blue
or red?
Answer
5
= .625
8
Question
What is the probability of the
following spinner landing on
yellow and red?
Answer
0
Question
What is the probability of landing on a
multiple of 3 if you spin the following
spinner?
Answer
2 1
= = .25
8 4
Question
What is the probability of
landing on 2 if you spin the
following spinner?
Answer
3
= .375
8
Question
8 4
= .5714
14 7
Question
45 9
= = .45
100 20
Question
What fractional portion of the
following picture is shaded?
Answer
3
8
Question
Which representation of
fractions below is the largest
number?
Answer
3
4
THANK YOU
Question from Permutations or
Combinations
Combinations
Question from Permutations or
Combinations
Permutation
Question from Permutations or
Combinations
Permutations
Question from Permutations or
Combinations
Combination
Question from Permutations or
Combinations
Permutations
THANK YOU
SAMPLING
BY
DR ABDUL RAUF
SAMPLING
It is not possible for any scientific study to cover the whole
population because of the
COST
TIME
PRACTICABILITY
So a representative portion of the universe is taken for the
study. It is called a SAMPLE
SAMPLING
Sampling is the process
of selecting a small number of elements
from a larger defined target group
of elements such that
the information gathered
from the small group will allow judgments
to be made about the larger group
SAMPLING
Universe: the theoretical aggregation of all
possible elements—unspecified to time and
space (e.g., University of Sargodha).
SAMPLING
Population: the theoretical aggregation of
specified elements as defined for a given
survey defined by time and space (e.g.,
medical students and staff in 2008).
SAMPLING
Population
The word “population” or “universe” means
An aggregate of all “elementary units”, each
unit may be animate or inanimate, about
which an information is required
SAMPLING
Universe or whole population may be finite
ff be e. g 100 kg’s of rice in a sack
All inhabitants of a city
Universe or whole population may be
Infinite e. g stars in the sky
SAMPLING
Universe may be “HOMOGENEOUS”(made up of uniform
class) e. g. polished rice in a sack
All Muslim women of reproductive age in city
IT may be “HETROGENEOUS”(made of dissimilar classes
of persons or animals or objects)
• If all members of a population are
identical, the population is considered to
be homogenous. That is, the
characteristics of any individual in the
population would be the same as the
characteristics of any other individual
(little or no variation among individuals).
So, if the human population on
Earth was homogenous in
characteristics, how many
people would an alien need to
abduct in order to understand
what humans were like?
• When individual members of a population are
different from each other, the population is
considered to be heterogeneous (having significant
variation among individuals).
• How does this change an alien’s abduction scheme
to find out more about humans?
• In order to describe a heterogeneous population,
observations of multiple individuals are needed to
account for all possible characteristics that may
exist.
Defining Population of Interest
• Population of interest is entirely dependent on
Management Problem, Research Problems, and Research
Design.
• Some Bases for Defining Population:
– Geographic Area
– Demographics
– Usage/Lifestyle
– Awareness
SAMPLING
Sample or Target population: the
aggregation of the population from
which the sample is actually drawn (e.g.,
medical students and faculty in 2008-09
academic year)
SAMPLING
Sample element: a case or a single unit that is
selected from a population and measured in
some way—the basis of analysis (e.g., a
person, thing, specific time, etc.)
SAMPLING
Sample frame: a specific list that closely
approximates all elements in the
population, and from this, the researcher
selects units to create the study sample
Sampling Frame
• A list of population elements, (people,
companies, houses, cities, etc.) from which, units
to be sampled can be selected.
• It is difficult to get an accurate list.
• Sample frame error occurs when certain
elements of the population are accidentally
omitted or not included on the list.
SAMPLING
Sample: a set of cases that is drawn from a
larger pool and used to make
generalizations about the population
Conceptual Model
Universe
Population
Sample Population
Sample Frame
Elements
What is Sampling?
What you What you
want to talk Population actually
about observe in
the data
Sampling Process
Sampling Sample
Frame
Inference
Probability Nonprobability
sampling sampling
Classification of Sampling
Techniques
Sampling Techniques
Nonprobability Probability
Sampling Techniques Sampling Techniques
• Advantage
–Most representative group
• Disadvantage
–Difficult to identify every member of a
population
Systematic Random Sampling
Systematic random sampling
is a
method of
probability sampling
in which the defined
target population is ordered
and the sample is selected
according to position using a skip interval
Systematic Random Sampling
To select a systematic random sampling of 25 students (n) from
a class of 75 students (N)
“N” is divided by “n” to get a quotient “r”
(75/25=3)
Then unit of N is selected at random and the other units are
subsequently selected by the addition of this quotient “r” to
the previous selected number.
Systematic Random Sampling
Since “r” being 3, any one number (unit) is drawn between 1 to
3,say 2,then the sample will be made up of students with
numbers
2,
2+3=5,
5+3=8,
8+3=11 and so on
This quotient “r” is known as “SAMPLING INTERVAL
EXERCISE
Suppose there are 210 villages in a
community development block
and 40 villages are desired. How
will you select ?
Systematic Random Sampling
Quotient “r” = N /n
=210/40
=5, remainder is 10
Random number is selected out of 10 ,say 6
which becomes the first number
“r” 5 is added to 6 and so on
Hence , the serial numbers of the villages to be
selected will be
6,11,16,21,26,31, 36,41………so on up to 40
numbers
Systematic Sampling
• Advantage
– Quick, efficient, saves time and energy
• Disadvantage
– Not entirely bias free; each item does not
have equal chance to be selected
– System for selecting subjects may
introduce systematic error
– Cannot generalize beyond population
actually sampled
Stratified Random Sampling
Stratified random sampling is a
method of
probability sampling
in which the population is divided
into different subgroups and samples
are selected from each
Steps in Drawing a Stratified Random
Sample
Divide the target population into
homogeneous subgroups or strata
depending upon the characteristics to be
studied(basis being age-group, sex-
group, area wise, socio-economic
status,PhD students, Masters Students,
Bachelors students). Elements within
each strata are homogeneous, but are
heterogeneous across strata
Steps in Drawing a Stratified Random
Sample
A simple random or a systematic sample is
taken from each stratum relative to the
proportion of that stratum to each of the
others
Combine the samples from each stratum into a
single sample of the target population
Researchers use stratified sampling
–When a stratum of interest is a small
percentage of a population and
random processes could miss the
stratum by chance.
–When enough is known about the
population that it can be easily
broken into subgroups or strata.
This type of sampling is used when population is
heterogeneous. For example prevalence of a disease is
different in different age groups
The population is stratified into different sub groups as
Children
Adults
Old persons
ADVANTAGES
Precision of the estimate of the characteristic
under study is increased
Estimate of the characteristic under study can be
made for each strata separately
Ensures that all strata are adequately represented
POPULATION
n = 1000; SE = 10%
equal intensity
STRATA 1 STRATA 2
proportional to size
STRATA 1
n =400 STRATA 2
SE=7.5%
n = 600
SE=5.0%
Sample equal intensity vs.? proportional to size ?
“CLUMP”
POPULATION
Primary sampling
Unit
POPULATION
Non probability
Sampling Techniques
1.Sampling errors
2.Non- sampling errors
SAMPLING ERRORS
These are due to:
Faulty sampling method
Small size of sample
These errors can be minimized through proper sampling
method
NON- SAMPLING ERRORS
A. COVERAGE ERROR this occurs
when all the units in the sample are not covered either due to
non co-operation or due to lost to follow-up.
This can be reduced by an intensive effort to get complete
coverage
NON- SAMPLING ERRORS
B. OBSERVATIONAL(OR
EXPERIMENTAL ) ERROR
This is due to interviewer’s bias or due to lack of training.
This can be reduced by setting up standards of interview or
proper training of the workers
NON- SAMPLING ERRORS
C. PROCESSING ERROR
This is due to clerical mistake or computational
error.
This can be reduced by administrative control
SIZE OF THE SAMPLE
Optimum size of the sample, has to be considered , keeping in
view the time, cost and the feasibility of the study.
FACTORS 0F ESTIMATION OF SAMPLE SIZE
Characteristic
Permissible error
Probability level
Resources
SAMPLE SIZE FOR QUALITATIVE DATA
4pq
n=
L²
Where
n= required sample size,
p= approximate prevalence rate of disease obtained from
previous studies or from pilot study)
q= 1-p
L= permissible error in the estimate of “p”
The above formula has been work out for a probability level
of p=0·05 (i.e. , the prevalence rate will be have 5 percent
error or 95 percent correct value ) in the simple size.
Example
: To estimate prevalence rate of ascariasis in
community, where it is approximately known to be
40 percent, then the required sample size to
estimate the morbidity (ascariasis) with 5 % error
with probability of 0.05 , is calculated as fallows:
Where , p =40%
q= 1-p =100 -40 =60%
L= 5% of 40 = 5/100 X 40 = 2
4pq 4X40X60
n= = = 2400
L² (2)²
FOR QUANTITATIVE DATA
t²α×s²
n=
e²
Where
n= desired sample size
s= standard deviation of the observation
e = permissible error
tα= is the value of ‘t’ at 5% level from ‘t’ table
EXAMPLE
In a community survey to estimate the Hb level,
from the data already available. If it is known that
the mean Hb % level is about 12 gm % with a S.D
of 1.5 gm % then the sample size required to
estimate the Hb level with a permissible error of
0.5 gram % is obtained as follows
t²α×s²
n=
e²
s = 1.5 gm
e = 0.5 gm
t₀.₀₅ = 1.96 (can be taken as 2)
2²×(1.5)² 4× 2.25
n= =
(0.5)² 0.25
n = 36 persons
In clinical trials there will be two groups, one
experimental and other control group. In order to
estimate the size of the sample for each group,
the difference in the response rates of two groups
is to be taken into consideration
2t²α×s²
n=
d²
Where
n= required sample size for each group
s = pooled SD of observation of two groups
d = anticipated smallest difference
tα= is usually taken as ‘t’ at 5% level
EXAMPLE
AN investigator wants to investigate the increase in
Hb% level in anemia cases by administration of a
particular drug compared against a known drug.
The minimum no. of cases in each group to be
investigated is calculated as follows
Suppose
d= 2%
s= 3 gm
t= at 5% level is taken as 2
2×(2)²×(3)² 8× 9
n= = = 18persons
(2)² 4
SAMPLING VARIATION
If two or more samples are drawn from the same
population, there means(m₁, m₂, m₃……) may not be
equal but will show variation, even though they are
from the same population.
Such a difference between the means of the samples, is
known as “sampling variation”
SAMPLING VARIATION
The sampling variation, from one sample to
another, may be by chance, when it is called
“Natural or Biological variability” or
Due to play of certain factors, when it is called
“Real variability”
e. g effect of nutrition, vaccine, smoking etc
SAMPLING VARIATION
The means of the samples(m₁, m₂, m₃ etc) show
dispersion around the population mean(M)
symmetrically as in Normal distribution with a
central tendency and with a definite standard
deviation.
The variation in the sample means is measured in
“Standard error”
STANDARD ERROR
Standard error, in fact is not an error, but is a SD of
sample means from that of population(M).
This standard deviation of sample mean with that of
population mean, is called the “Standard error of the
mean” denoted as
SE ×̄ or simply the standard error(SE)
STANDARD ERROR
SD
SE =
√n
Since the distribution of the means, follows the pattern of
normal distribution, it is not difficult to visualize that 95% of
the sample means will lie within the limits of two standard
error(M±2SE)or M±2SD/√n
Therefore, the chance that the population
mean
(M) Lies between the limits defined by sample
mean±2SE is also 95%.
This is referred to as 95% “Confidence limits”
The confidence limit is increased to 99% by
increasing the no. of standard error to ±3SE
“t” TEST
By
DR ABDUL RAUF
“t” TEST
WS Gassett observed that with small samples the sampling
variations will be large
He demonstrated that the ratio of observed difference between
two values to the standard error(SE) of difference follows a
distribution called “t” distribution and such a ratio is denoted
as “t” (for small samples)
“t” TEST
‘ t value’ was derived by “Student” in 1908
“t” is calculated as
A ratio of difference between two means or proportions to
standard error of the difference
x₁̄ - x₂̄ mean difference
t= =
SE Standard error of mean difference
Unpaired t- test
If the observations are made on two independent groups, like
Control group
Experimental (treated) group
and their means are compared for their significant difference
It is known as “unpaired comparisons” and the test applied is
Unpaired t- test
Paired t- test
If the observations are made on a single sample and the
values of a certain characteristic is noted before and
after the treatment with a particular drug, such
comparison of values of observations is known as
“paired comparison”
√ δ 21 + δ 22
n
Unequal sample size
x1 – x2
t=
√
( n1 – 1)δ21 + ( n2 – 1) δ22
∙( 1 +1
)
n1 n2
n1 + n2 – 2
• Represents the number of
independent observations in a
sample.
• Is a measure that states the
number of variables that can
change within a statistical test.
• Calculated as follows
In unpaired ‘t’ test of difference
between means,
DF= n₁ + n₂ - 2
• Where n₁ & n₂ are no. of
observations in each series
In unpaired ‘t’ test
DF= n – 1
• A probability table is used
• First determine degrees of freedom
• Decide the level of significance
• Example: degrees of freedom= 4
α= .05
Results of t test
• Example: Data obtained from a experiment
comparing the number of un-popped seeds in
popcorn brand A and popcorn brand B.
A B
26 32
22 35
30 20
34 33
Is the difference significant?
• Determine mean, variance and standard deviation
of samples.
Mean xA = Σx
= 26+22+30+34
n = 23
4
Mean xB = Σx
n = 32+35+20+33
= 30
4
variance δ2= Σ (х – х)2
n-1
Popcorn A = ( 26-23)2 + (22-23)2 + (30-23)2 + (34-23)2
3
= 9 + 1 + 49 + 121
= 60
3
Popcorn B = ( 30-30)2+ (35-30)2 + (20- 30)2 + (33- 30)2
3
= 0 + 25 + 100 + 9
= 44.67
3
Standard deviation: δ= √ δ2
popcorn A
√ 60 = 7.75
Popcorn B
√ 44.67 = 6.68
Finding Calculated t
x –x
t= 1 2
√ δ 21 + δ 22
t = 23 - 30
60+ 44.67
√ 4
= 7
√ 26.17
= 7
5.12 = 1.38
Determine critical value of t
• Select level of significance α=.01
• Determine degrees of freedom
degrees of freedom of A= 3
degrees of freedom of B= 3
total degrees of freedom = 6
• Critical value of t = 3.707
Calculated value of t =1.38 is less than critical value of
t from the table, 3.707.
The null hypothesis is not rejected.
Descriptive statistics popcorn A popcorn B
Mean 23 30
Variance 60 44.67
Standard deviation 7.75 6.68
1SD (68% Band) 15.25 - 30.75 23.32- 36.68
2 SD (95% Band) 7.50-38.50 16.64-43.36
3 SD (99% Band) -.25 - 46.25 9.96-50.04
Number 4 4
Results of t test t= 1.38 df=6
t of 1.38 < 3.707 α=.01
Error Types
• Type I Error: Reject H0 when it is true
• Type II Error: Do not reject H0 when it is false
592
• (beta)is called the probability of a type II error.
– type II error is the error of not rejecting the null
hypothesis, when the null is in fact false.
– type II error is like not noticing a wolf that is really there
– NOTE: is not equal to 1- , although is often larger
than .
Errors and correct conclusions
in a hypothesis test
State of reality:
BY
DR ABDUL RAUF
Absolute measures of Dispersion are expressed in same units in
which original data is presented but these measures cannot
be used to compare the variations between the two series.
Relative measures are not expressed in units but it is a pure
number. It is the ratios of absolute dispersion to an
appropriate average such as co-efficient of Standard
Deviation or Co-efficient of Mean Deviation.
Absolute Measures
Range
quartile Deviation
Mean Deviation
Standard Deviation
Lorenz Curve
Relative Measure
Co-efficient of Range
Co-efficient of Quartile Deviation
Co-efficient of mean Deviation
co-efficient of Variation.
The semi-inter quartile range is a
measure of spread or dispersion. It is
computed as one half the difference
between the 75th percentile [often
called (Q3)] and the 25th percentile
(Q1). The formula for semi-inter quartile
range is therefore: (Q3-Q1)/2.
Since half the scores in a distribution lie between Q3 and Q1,
the semi-inter quartile range is 1/2 the distance needed to
cover 1/2 the scores. In a symmetric distribution, an interval
stretching from one semi-inter quartile range below the
median to one semi-inter quartile above the median will
contain 1/2 of the scores. This will not be true for a skewed
distribution, however.
The semi-inter quartile range is little affected by extreme
scores, so it is a good measure of spread for skewed
distributions. However, it is more subject to sampling
fluctuation in normal distributions than is the standard
deviation and therefore not often used for data that are
approximately normally distributed.
Understand measures of association and
difference
DR ABDUL RAUF
Outcome Measures
• Compare the incidence of disease among
people who have some characteristic with
those who do not
• The ratio of the incidence rate in one group
to that in another is called a rate ratio or
relative risk (RR)
• The difference in incidence rates between
the groups is called a risk difference or
attributable risk (AR)
Calculating Outcome Measures
Outcome
Disease No Disease
Exposure (cases) (controls) Incidence
Exposed A B IE = A / (A+B)
Not Exposed C D IN = C / (C+D)
Relative Risk = IE / IN
Attributable Risk = IE - IN
Lung Cancer
Exposure Exposure
reduces Particular Exposure
as a risk
disease risk exposure is increases
factor for
not a risk disease risk
the (Protective factor (Risk factor)
disease? factor)
Annual Death Rates for Lung Cancer
and Coronary Heart Disease
by Smoking Status, Males
Annual Death Rate / 100,000
Exposure Lung Cancer Coronary Heart Disease